Module 8: Attribute Data & Databases

8.5 Spatial Joins and Relates

The fundamental operation for combining datasets by location rather than by shared keys.

Lesson 42 of 100·18 min read

Key takeaways

A spatial join attaches attributes from one dataset to another based on geometric relationship.

Choose the predicate (intersects, within, contains, nearest) based on what you want to mean.

Spatial indexes and careful predicate choice separate a 1-second query from a 1-hour query.

Introduction

An ordinary join connects rows by shared keys (order.customer_id = customer.id). A spatial join connects rows by geometric relationship ("for each point, find the polygon that contains it"). Spatial joins are arguably the most useful spatial operation — they turn raw geometries into meaningful attributed data.

The five common predicates

Predicate	Use case
`intersects`	Any overlap
`within`	A entirely inside B
`contains`	B entirely inside A
`dwithin(d)`	A within distance d of B
`nearest(k)`	K nearest B to A

All of these live in PostGIS as ST_* functions, in GeoPandas as sjoin / sjoin_nearest variants, and in every serious GIS.

Points in polygons

The most common spatial join: attach the containing polygon's attributes to each point.

SQL

1SELECT p.id, p.name, c.neighborhood_name
2FROM points_of_interest p
3LEFT JOIN neighborhoods c
4  ON ST_Within(p.geom, c.geom);

In GeoPandas:

Python

1joined = gpd.sjoin(points, neighborhoods,
2                   predicate='within', how='left')

Polygons to polygons

When two polygon layers overlap but don't contain each other (e.g., zoning × floodplain), use ST_Intersects:

SQL

1SELECT z.zone_id, f.flood_zone, ST_Area(ST_Intersection(z.geom, f.geom)) AS overlap_m2
2FROM zoning z
3JOIN floodplain f ON ST_Intersects(z.geom, f.geom);

For proportional attribute transfer (e.g., estimate population in a new district from overlapping census blocks), you need area-weighted aggregation — see Module 8.6.

Distance joins

"Cafés within 500 m of subway entrances":

SQL

1SELECT s.id, s.name, c.id AS cafe_id, c.name AS cafe_name,
2       ST_Distance(s.geom::geography, c.geom::geography) AS metres
3FROM stations s
4JOIN cafes c
5  ON ST_DWithin(s.geom::geography, c.geom::geography, 500);

Always use ST_DWithin rather than ST_Distance < 500 — the former is index-optimised.

Nearest joins

"For each school, the nearest hospital":

SQL

1SELECT s.id, s.name AS school,
2       h.name AS nearest_hospital,
3       ST_Distance(s.geom::geography, h.geom::geography) AS metres
4FROM schools s
5CROSS JOIN LATERAL (
6  SELECT name, geom
7  FROM hospitals
8  ORDER BY s.geom <-> geom
9  LIMIT 1
10) h;

CROSS JOIN LATERAL + <-> lets each row pick its own nearest candidate efficiently. In GeoPandas, use sjoin_nearest.

One-to-many joins

When a feature can match multiple features in the other layer, you get multiple rows. That's often what you want — aggregate afterwards:

SQL

1SELECT s.id, s.name, COUNT(c.id) AS cafes_within_500m
2FROM stations s
3LEFT JOIN cafes c
4  ON ST_DWithin(s.geom::geography, c.geom::geography, 500)
5GROUP BY s.id, s.name;

Cardinality and NULLs

Beware:

INNER JOIN drops rows with no match — easy to lose stations with zero cafés nearby.
LEFT JOIN preserves them with NULL on the joined side.
A point on a polygon boundary is within under some predicates and not others — use ST_Covers to explicitly include boundaries.

Performance

With spatial indexes in place, spatial joins scale to millions of features. Without them, they don't. Rules:

CREATE INDEX ... USING GIST (geom) on both tables.
Predicates that use bounding boxes (ST_Intersects, ST_DWithin) use the index; function-wrapped geometries don't.
Reproject both layers to the same CRS before the join; don't call ST_Transform inside the predicate.
VACUUM ANALYZE after bulk loads.

Worked pattern — point in polygon with area-weighted sum

How many residents live in each school's 1 km buffer?

SQL

1WITH school_buffers AS (
2  SELECT id, ST_Buffer(geom::geography, 1000)::geometry AS geom
3  FROM schools
4)
5SELECT s.id,
6       SUM(
7         cb.population *
8         ST_Area(ST_Intersection(s.geom, cb.geom)) /
9         ST_Area(cb.geom)
10       ) AS estimated_pop
11FROM school_buffers s
12JOIN census_blocks cb ON ST_Intersects(s.geom, cb.geom)
13GROUP BY s.id;

Each census block contributes population proportional to the fraction of its area inside the buffer. This is the simplest form of areal interpolation.

In GeoPandas

Python

1import geopandas as gpd
2points = gpd.read_file('poi.geojson').to_crs(3857)
3polys = gpd.read_file('neigh.geojson').to_crs(3857)
4Standard sjoin
5joined = points.sjoin(polys, predicate='within', how='left')
6Nearest join
7nearest = points.sjoin_nearest(polys, how='left',
8max_distance=500,
9distance_col='metres')
10Aggregation

Self-check exercises

1. Difference between inner and left spatial join — practical consequence?

An inner join returns only rows with a match in the other layer; a left join returns all rows from the left layer and NULL for any right-side match. If you want to preserve "stations with zero cafés nearby", you need a left join; with inner, those stations disappear from the result.

2. Why use ST_DWithin(a, b, 500) instead of ST_Distance(a, b) < 500?

ST_DWithin is optimised to use spatial indexes via bounding-box pre-filtering — the planner filters candidates quickly before computing precise distances. ST_Distance < 500 forces a distance computation for every row pair, which defeats the index and runs O(N × M).

3. Your spatial join runs for an hour on 1 M × 1 M features. What three things would you check?

(1) Are there GiST indexes on both geometry columns? Create them if missing. (2) Are the two layers in the same CRS? Reproject first, don't transform inside the predicate. (3) Are you using an indexable predicate (ST_Intersects, ST_DWithin)? A function like ST_Distance < 500 isn't indexable — switch to ST_DWithin. After these, run VACUUM ANALYZE to refresh statistics.

Summary

Spatial joins connect rows by geometric relationship, not shared keys.
Pick the predicate carefully: intersects, within, contains, dwithin, nearest.
Index both sides and use indexable predicates; reproject before joining.
Left joins preserve unmatched rows — essential for complete summary tables.