Tutorial 8

Answer Model

Data Fundamentals & CRS Management

st_crs(nc)$epsg returns the EPSG code identifier (e.g., 32119), while st_crs(nc)$proj4string returns the full PROJ.4 parameter string defining the coordinate system mathematically. EPSG is a shorthand registry code; proj4string contains explicit projection parameters.

library(sf)

Linking to GEOS 3.14.1, GDAL 3.12.4, PROJ 9.8.1; sf_use_s2() is TRUE

nc <- st_read(system.file("shape/nc.shp", package="sf"))

Reading layer `nc' from data source 
  `/home/bas/R/x86_64-redhat-linux-gnu-library/4.6/sf/shape/nc.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27

st_crs(nc)$epsg

[1] 4267

st_crs(nc)$proj4string

[1] "+proj=longlat +datum=NAD27 +no_defs"

nc_4326 <- st_transform(nc, 4326)
nc_32119 <- st_transform(nc, 32119)

Transforming to EPSG:4326 (WGS84 geographic coordinates) changes county shapes more dramatically than EPSG:32119 (NAD83 / North Carolina projection). Geographic CRS (4326) uses angular units (degrees) that distort shapes at regional scales due to Earth’s curvature, while projected CRS (32119) preserves local geometry using meters with minimal distortion.

Area calculations differ because the original CRS (NAD27 / NC) uses feet as units, while EPSG:32119 uses meters with an area-preserving projection.

buncombe <- nc[nc$NAME == "Buncombe", ]
area_original <- st_area(buncombe) / 1e6  # Convert sq ft to km² (approximate)
area_32119 <- st_area(st_transform(buncombe, 32119)) / 1e6

area_original

1691.385 [m^2]

area_32119

1691.168 [m^2]

Results differ because the original CRS is not area-preserving and uses non-metric units. EPSG:32119 is a Lambert Conformal Conic projection optimized for North Carolina with minimal area distortion. For economic analysis requiring area-weighted variables (e.g., GDP/km²), EPSG:32119 (or any equal-area projection) is correct—geographic CRS like 4326 should never be used for area calculations.

CRS mismatch error occurs when attempting geometric operations on objects with different coordinate reference systems.

nc_a <- st_transform(nc, 32119)
nc_b <- st_transform(nc, 2264)

st_intersection(nc_a[1,], nc_b[1,])

Error in `geos_op2_geom()`:
! st_crs(x) == st_crs(y) is not TRUE

# "st_crs(x) == st_crs(y) is not TRUE"

# Fix: Transform nc_b to match nc_a's CRS
nc_b_fixed <- st_transform(nc_b, st_crs(nc_a))
intersection <- st_intersection(nc_a[1,], nc_b_fixed[1,])

Warning: attribute variables are assumed to be spatially constant throughout
all geometries

intersection_area <- st_area(intersection) / 1e6  # km²

Vector Operations – Buffers, Centroids & Topology

Identify county with highest 1979 birth rate, create buffer, and find counties whose centroids fall within buffer.

library(ggplot2)
high_birth_county <- nc[which.max(nc$BIR79), ]
buffer_25km <- st_buffer(high_birth_county, dist = 25000)
centroids <- st_centroid(nc)

Warning: st_centroid assumes attributes are constant over geometries

qualifying <- centroids[st_within(centroids, buffer_25km, sparse=F), ]

n_qualifying <- nrow(qualifying)
n_qualifying

[1] 5

ggplot() +
  geom_sf(data = nc, fill = "white", color = "gray50") +
  geom_sf(data = buffer_25km, fill = "red", alpha = 0.3) +
  geom_sf(data = high_birth_county, fill = "blue", alpha = 0.7) +
  geom_sf(data = qualifying, color = "gold", size = 2) +
  theme_minimal()

Visualization shows spatial extent of demographic influence.

Border analysis and inner buffer calculation:

touching <- nc[st_touches(high_birth_county, nc, sparse = FALSE)[1, ], ]
inner_buffer <- st_buffer(high_birth_county, dist = -10000)

Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs, endCapStyle =
endCapStyle, : st_buffer does not correctly buffer longitude/latitude data

dist is assumed to be in decimal degrees (arc_degrees).

inner_area_pct <- as.numeric(st_area(inner_buffer) / st_area(high_birth_county)) * 100

The inner buffer represents the county’s “core” excluding peripheral border zones. A high percentage (>60%) suggests compact development where economic activity concentrates centrally; low percentages indicate elongated/rural counties where borders constitute significant economic interfaces. This core-periphery distinction matters for policy targeting (e.g., infrastructure investment).

Nearest neighbor analysis by centroid distance:

centroids <- st_centroid(nc)

Warning: st_centroid assumes attributes are constant over geometries

dist_matrix <- st_distance(centroids, centroids)
diag(dist_matrix) <- Inf  # Exclude self
nearest_idx <- max.col(-dist_matrix, "first")
nearest_dists <- apply(dist_matrix, 1, min)

min_pair <- which.min(nearest_dists)
max_pair <- which.max(nearest_dists)

ggplot(nc) +
  geom_sf(aes(fill = nearest_dists), color = "white") +
  scale_fill_viridis_c(name = "Distance to\nnearest neighbor (m)") +
  theme_minimal()

Smallest distance: Typically adjacent urban counties. Largest distance: Isolated rural counties. Distance heterogeneity reflects settlement patterns affecting labor market integration.

Concentric buffers for Wake County with population rings:

library(ggplot2)
wake <- nc[nc$NAME == "Wake", ]
# Create buffers at increasing distances
buffer_0 <- wake
buffer_10 <- st_buffer(wake, dist = 10000)
buffer_25 <- st_buffer(wake, dist = 25000)
buffer_50 <- st_buffer(wake, dist = 50000)

# Create rings using st_difference
ring1 <- st_difference(buffer_10, buffer_0)          # 0-10km

Warning: attribute variables are assumed to be spatially constant throughout
all geometries

ring2 <- st_difference(buffer_25, buffer_10)         # 10-25km

Warning: attribute variables are assumed to be spatially constant throughout
all geometries

ring3 <- st_difference(buffer_50, buffer_25)         # 25-50km

Warning: attribute variables are assumed to be spatially constant throughout
all geometries

# Assign ring IDs for plotting
rings <- rbind(ring1, ring2, ring3) 
rings$ring_id <- factor(c("0-10km", "10-25km", "25-50km"))

# Calculate population in each ring using area-weighted interpolation
ring1_pop <- sum(st_interpolate_aw(nc["BIR74"], ring1, extensive = TRUE)$BIR74)

Warning in st_interpolate_aw.sf(nc["BIR74"], ring1, extensive = TRUE):
st_interpolate_aw assumes attributes are constant or uniform over areas of x

ring2_pop <- sum(st_interpolate_aw(nc["BIR74"], ring2, extensive = TRUE)$BIR74)

Warning in st_interpolate_aw.sf(nc["BIR74"], ring2, extensive = TRUE):
st_interpolate_aw assumes attributes are constant or uniform over areas of x

ring3_pop <- sum(st_interpolate_aw(nc["BIR74"], ring3, extensive = TRUE)$BIR74)

Warning in st_interpolate_aw.sf(nc["BIR74"], ring3, extensive = TRUE):
st_interpolate_aw assumes attributes are constant or uniform over areas of x

ring_pops <- c(ring1 = ring1_pop, ring2 = ring2_pop, ring3 = ring3_pop)
ring_pops

    ring1     ring2     ring3 
 7412.975 12860.919 31167.598

# Visualization
ggplot() +
  geom_sf(data = nc, fill = "lightgray", color = "white") +
  geom_sf(data = rings, aes(fill = ring_id), alpha = 0.7) +
  geom_sf(data = wake, fill = "darkblue", alpha = 0.3) +
  scale_fill_brewer(palette = "YlOrRd", name = "Distance ring") +
  theme_minimal() +
  labs(title = "Wake County Population Rings (BIR74)",
       subtitle = paste("Ring populations:", paste0(round(ring_pops), collapse=",")))

Disjoint counties from high-SIDS areas:

median_sid <- median(nc$SID74)
high_sid <- nc[nc$SID74 > median_sid, ]
disjoint <- nc[!st_touches(nc, st_union(high_sid), sparse = FALSE)[,1], ]

Geographic pattern: Disjoint counties typically cluster in western mountains or northeastern coastal plains—regions physically separated from central Piedmont (where high-SIDS counties concentrate). Economic mechanism: Health outcome clustering may reflect spatial sorting by income (wealthier households self-select into healthier environments) or spillover effects from urban pollution/healthcare access concentrated in central regions.

Spatial Joins & Real-World Applications

Pollution exposure simulation:

set.seed(123)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

sources <- st_sample(st_as_sfc(st_bbox(nc)), 200) |> st_cast("POINT") |> st_sf()

Warning in st_poly_sample(x, size = size, ..., type = type, by_polygon =
by_polygon, : coordinate ranges not computed along great circles; install
package lwgeom to get rid of this warning
Warning in st_poly_sample(x, size = size, ..., type = type, by_polygon =
by_polygon, : coordinate ranges not computed along great circles; install
package lwgeom to get rid of this warning

selected <- sources[sample(nrow(sources), 5), ]
buffers <- st_buffer(selected, dist = runif(5, 5000, 30000))
buffers$zone_id <- 1:5

# County-zone intersections
county_zones <- st_join(nc, buffers, join = st_intersects)
exposure <- aggregate(area ~ NAME, 
                      data = st_intersection(county_zones, buffers) |> 
                        st_as_sf() |> 
                        mutate(area = as.numeric(st_area(geometry))),
                      sum)

Warning: attribute variables are assumed to be spatially constant throughout
all geometries

Warning: attribute variables are assumed to be spatially constant throughout
all geometries

exposure

        NAME       area
1      Anson 1386609993
2  Granville    9213896
3 Montgomery   70721395
4     Person   93999085
5   Richmond  383328506
6     Stanly   52438324
7      Union  251781693
8      Vance   12279012
9     Warren  441185980

Counties intersecting multiple zones show cumulative exposure risk. Total exposure score enables environmental justice analysis (e.g., regressing health outcomes on exposure scores).

City-state spatial join:

library(rnaturalearth)
library(sf)
library(dplyr)

# 1. Get US state boundaries (admin level 1)
us_states <- ne_states(country = "United States of America", returnclass = "sf")

# 2. Download populated places (scale 110 = low-res global layer, ~1000 cities worldwide)
places <- ne_download(
  scale = 110,
  type  = "populated_places",
  category = "cultural",
  returnclass = "sf"
)

Reading 'ne_110m_populated_places.zip' from naturalearth...

# 3. Filter to US places only (use the SOV0NAME or ADM0NAME column)
us_places <- places |>
  filter(ADM0NAME == "United States of America")

# 4. Make sure both layers share the same CRS
us_states  <- st_transform(us_states,  crs = 4326)
us_places  <- st_transform(us_places,  crs = 4326)

# 5. Spatial join: assign each city to the state polygon it falls within
joined <- st_join(us_places, us_states["name"], join = st_within)

# 6. Which state has the most cities?
city_counts <- joined |>
  st_drop_geometry() |>
  count(name, sort = TRUE) |>
  rename(state = name, n_cities = n)

cat("State with most populated places:\n")

State with most populated places:

print(head(city_counts, 5))

                 state n_cities
1           California        2
2             Colorado        1
3 District of Columbia        1
4              Florida        1
5              Georgia        1

# 7. Average latitude per state (northernmost -> southernmost)
avg_lat <- joined |>
  mutate(lat = st_coordinates(geometry)[, 2]) |>
  st_drop_geometry() |>
  group_by(state = name) |>
  summarise(avg_lat = mean(lat, na.rm = TRUE),
            n_cities = n()) |>
  arrange(desc(avg_lat))

cat("\nAverage city latitude per state (N ? S):\n")


Average city latitude per state (N ? S):

print(avg_lat)

# A tibble: 8 × 3
  state                avg_lat n_cities
  <chr>                  <dbl>    <int>
1 Illinois                41.8        1
2 New York                40.7        1
3 Colorado                39.7        1
4 District of Columbia    38.9        1
5 California              35.9        2
6 Georgia                 33.7        1
7 Texas                   29.7        1
8 Florida                 25.8        1

California typically contains the most cities. Average latitude per state reveals north-south settlement gradients affecting climate-dependent economic activity.

Firm location simulation and correlation test:

library(sf)
library(units)

udunits database from /usr/share/udunits/udunits2.xml

# Load data (assuming this is already in your environment)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)

# 1. Calculate the area of the COUNTIES first, adding it as a column to 'nc'
nc$area_km2 <- st_area(nc) |> set_units("km^2")

set.seed(456)
firms <- st_sample(st_union(nc), 150) |> st_cast("POINT") |> st_sf()

Warning in st_poly_sample(x, size = size, ..., type = type, by_polygon =
by_polygon, : coordinate ranges not computed along great circles; install
package lwgeom to get rid of this warning

firms$revenue <- runif(150, 1, 50)
firms$industry <- sample(c("manufacturing", "retail", "services"), 150, 
                         prob = c(0.3, 0.4, 0.3), replace = TRUE)

# 2. Join firms with nc (including the new area column)
firms_joined <- st_join(firms, nc[, c("NAME", "area_km2")])

# 3. Filter for manufacturing
manufacturing <- firms_joined[firms_joined$industry == "manufacturing", ]

# 4. Run correlation using the columns from the joined data
cor.test(as.numeric(manufacturing$area_km2), 
         manufacturing$revenue)


    Pearson's product-moment correlation

data:  as.numeric(manufacturing$area_km2) and manufacturing$revenue
t = 0.80257, df = 38, p-value = 0.4272
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1900476  0.4235791
sample estimates:
     cor 
0.129105

Positive correlation would suggest manufacturing firms locate in larger counties (more land for factories), while negative correlation might indicate urban concentration despite smaller land area.

Border demographic discontinuities:

touch_matrix <- st_touches(nc, nc, sparse = FALSE)
diffs <- data.frame()
for (i in 1:nrow(nc)) {
  for (j in which(touch_matrix[i, ])) {
    if (i < j) {  # Avoid duplicates
      diffs <- rbind(diffs, data.frame(
        county1 = nc$NAME[i], county2 = nc$NAME[j],
        bir_diff = abs(nc$BIR74[i] - nc$BIR74[j]),
        nwbir_diff = abs(nc$NWBIR74[i] - nc$NWBIR74[j])
      ))
    }
  }
}
largest_bir <- diffs[which.max(diffs$bir_diff), ]

Largest discontinuity often occurs at urban-rural borders (e.g., Mecklenburg vs. adjacent rural counties). Economic relevance: Such sharp borders enable spatial regression discontinuity designs to estimate policy effects (e.g., school funding differences across county lines).

Raster Data Fundamentals

Landsat band analysis:

library(stars)

Loading required package: abind

landsat <- read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))
dim(landsat)[3]  # 6 bands

band 
   6

st_dimensions(landsat)$x$delta  # Resolution: 28.5m

[1] 28.5

band4 <- landsat[,,,4]
plot(band4, main = "Band 4: Near-Infrared")

Band 4 (NIR) is valuable because healthy vegetation strongly reflects NIR radiation—enabling NDVI calculation for crop health monitoring, yield prediction, and agricultural subsidy targeting.

Circular crop and zonal statistics:

library(stars)
library(sf)

pt <- st_point(c(294000, 9116000)) |> 
  st_sfc(crs = st_crs(landsat)) |> 
  st_buffer(dist = 500)

cropped <- st_crop(landsat, pt)

# FIX: Use st_apply() instead of apply()
band_means <- st_apply(cropped, MARGIN = 3, FUN = mean, na.rm = TRUE)

# View result
print(band_means)

stars object with 1 dimensions and 1 attribute
attribute(s):
          Min.  1st Qu.  Median     Mean  3rd Qu.     Max.
mean  57.92547 59.89674 66.6972 69.60559 73.89984 92.38302
dimension(s):
     from to
band    1  6

Mean reflectance per band characterizes land cover within the circle (e.g., high Band 4 = vegetation).

County-level raster aggregation:

set.seed(789)
nc_sample <- nc[sample(nrow(nc), 5), ] |> 
  st_transform(st_crs(landsat))
band1 <- landsat[,,,1]
county_means <- aggregate(band1, by = nc_sample, FUN = mean, na.rm = TRUE)
nc_sample$mean_band1 <- county_means[[1]]

ggplot(nc_sample) +
  geom_sf(aes(fill = mean_band1)) +
  scale_fill_viridis_c() +
  theme_minimal()

Reflectance differences correlate with land use (e.g., low Band 1 = water bodies), enabling remote sensing proxies for economic activity.

Advanced Integration & Economics Applications

Climate-province integration:

library(geodata)

Loading required package: terra

terra 1.9.27

temp <- worldclim_country(country = "Netherlands", var = "tavg", path = tempdir())

Cached as: /tmp/Rtmp11STQ1/climate/wc2.1_country/NLD_wc2.1_30s_tavg.tif

temp_stars <- st_as_stars(temp)
jan_2020 <- temp_stars[,,,1]  # Assuming first layer is Jan 2020

provinces <- ne_states("Netherlands", returnclass = "sf")
provinces$temp <- aggregate(jan_2020, by = provinces, FUN = mean, na.rm = TRUE)[[1]]

# Economic linkage approach:
# 1. Obtain provincial GDP data (e.g., from CBS Statistics Netherlands)
# 2. Merge: provinces <- merge(provinces, gdp_data, by = "province_name")
# 3. Model: lm(log_gdp_pc ~ temp + controls, data = provinces)
# 4. Address endogeneity via instrumental variables (e.g., coastal proximity)

Critical considerations: Control for urbanization (confounder), use panel data to exploit temporal variation, and account for spatial autocorrelation in errors.

Spatial lag calculation:

library(spdep)

Loading required package: spData

To access larger datasets in this package, install the spDataLarge
package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`


Attaching package: 'spData'

The following object is masked _by_ '.GlobalEnv':

    us_states

nb <- poly2nb(nc)  # Queen contiguity
listw <- nb2listw(nb, style = "W")  # Row-standardized
nc$lag_BIR74 <- lag.listw(listw, nc$BIR74)

plot(nc$BIR74, nc$lag_BIR74, 
     xlab = "County Births (BIR74)", 
     ylab = "Spatial Lag (Avg. Neighbor Births)",
     main = "Spatial Autocorrelation of Births")
abline(lm(lag_BIR74 ~ BIR74, data = nc), col = "red")

High lag values imply counties are surrounded by demographically similar neighbors—suggesting spatial spillovers in fertility behavior (e.g., cultural norms, healthcare access) or sorting (families with similar preferences cluster together).

University geocoding and distance:

library(tidygeocoder)
library(sf)

unis <- tibble::tibble(
  name    = c("Radboud University, Nijmegen", "Erasmus University Rotterdam"),
  address = c("Radboud University, Nijmegen", "Erasmus University Rotterdam")
) |> geocode(address, method = "osm")

ru  <- st_as_sf(unis[1,], coords = c("long", "lat"), crs = 4326)
eur <- st_as_sf(unis[2,], coords = c("long", "lat"), crs = 4326)

ru_utm  <- st_transform(ru,  32631)
eur_utm <- st_transform(eur, 32631)
distance_km <- as.numeric(st_distance(ru_utm, eur_utm)) / 1000

buffer_ru  <- st_buffer(ru_utm,  2000)
buffer_eur <- st_buffer(eur_utm, 2000)
overlap <- st_intersection(buffer_ru, buffer_eur)

Distance ≈ 85 km. Overlapping buffer area represents potential student mobility zone or competitive labor market for academic talent.

Climate time-series analysis:

library(stars)
climate <- read_stars(system.file("nc/bcsd_obs_1999.nc", package = "stars"), quiet = TRUE)
dim(climate)[3]

time 
  12

jan1  <- climate[,,,1]
jan12 <- climate[,,,12]

#adrop() removes the length-1 time dimension entirely, leaving both objects with only x and y, so the subtraction works cleanly.
temp_change <- adrop(jan12) - adrop(jan1)

# Economics link: 
# - Temperature volatility proxies for crop failure risk
# - Use as instrument for agricultural income shocks
# - Regress household consumption on temp_change to estimate risk coping

Daily temperature changes capture short-term climate volatility affecting planting/harvest decisions—critical for modeling farmer risk exposure in developing economies.

Spatial treatment effect estimation:

library(sf)
library(spData)

# Project to a suitable planar CRS (US Albers, metres)
nc_proj <- st_transform(nc, crs = 5070)

nc_proj$treatment <- as.integer(nc_proj$BIR74 > quantile(nc_proj$BIR74, 0.8))

treat_buffer <- st_buffer(nc_proj[nc_proj$treatment == 1, ], dist = 50000)  # 50 km ✓

nc_proj$spillover <- as.integer(
  rowSums(st_intersects(nc_proj, treat_buffer, sparse = FALSE)) > 0 &
    nc_proj$treatment == 0
)

nc_proj$outcome <- 100 + 15*nc_proj$treatment + 8*nc_proj$spillover +
                   rnorm(nrow(nc_proj), 0, 10)

# a) Direct effect (pure control vs treated, no spillover units)
direct_sub <- nc_proj[nc_proj$spillover == 0, ]
stopifnot(length(unique(direct_sub$treatment)) == 2)  # sanity check
direct_model <- lm(outcome ~ treatment, data = direct_sub)
summary(direct_model)$coefficients["treatment", ]

    Estimate   Std. Error      t value     Pr(>|t|) 
1.792513e+01 4.162123e+00 4.306727e+00 1.633613e-04

# b) Spillover effect (among non-treated only)
spillover_sub <- nc_proj[nc_proj$treatment == 0, ]
spillover_model <- lm(outcome ~ spillover, data = spillover_sub)
summary(spillover_model)$coefficients["spillover", ]

    Estimate   Std. Error      t value     Pr(>|t|) 
1.300168e+01 2.736072e+00 4.751952e+00 9.022925e-06