Module 8: Attribute Data & Databases

8.3 SQL for GIS

The SQL every GIS analyst should know — filtering, joining, aggregating, with spatial flavour.

Lesson 40 of 100·22 min read

Key takeaways

SQL is the lingua franca of GIS analytics once your data is in a database.

The core is SELECT ... FROM ... WHERE ... GROUP BY ... JOIN; PostGIS adds ST_* spatial functions.

Writing readable SQL pays off — rebuilding a tangled query costs more than writing it carefully the first time.

Introduction

SQL has been the dominant data language since 1974. For GIS, it pairs naturally with PostGIS to handle every query type — attribute filter, join, aggregate, spatial predicate. This lesson is a fast tour for readers new to SQL, using examples that reflect real analyst workloads.

The five building blocks

SELECT — which columns?

SQL

SELECT name, population FROM cities;

Use * to get everything; prefer explicit columns in production code (performance + readability).

WHERE — filter rows

SQL

1SELECT name FROM cities WHERE population > 100000;
2SELECT * FROM parks WHERE country = 'Denmark' AND area_ha >= 5;

ORDER BY — sort

SQL

1SELECT name, population FROM cities
2ORDER BY population DESC
3LIMIT 10;

GROUP BY + aggregate — summarise

SQL

1SELECT country, COUNT(*) AS n_cities, AVG(population) AS avg_pop
2FROM cities
3GROUP BY country
4ORDER BY n_cities DESC;

Aggregates: COUNT, SUM, AVG, MIN, MAX, STDDEV, PERCENTILE_CONT.

JOIN — combine tables

SQL

1SELECT c.name, co.continent
2FROM cities c
3JOIN countries co ON c.iso3 = co.iso3
4WHERE co.continent = 'Africa';

Types:

INNER JOIN — matching rows from both.
LEFT JOIN — all from left, matching from right (NULL otherwise).
RIGHT JOIN — mirror.
FULL OUTER JOIN — all from both.
CROSS JOIN — cartesian; dangerous on large tables.

Spatial flavour

PostGIS adds ~300 ST_* functions. The essentials:

Geometry constructors

SQL

1ST_GeomFromText('POINT(-73.96 40.78)', 4326)
2ST_MakePoint(-73.96, 40.78)
3ST_GeomFromGeoJSON('{"type":"Point","coordinates":[-73.96,40.78]}')

Spatial predicates (return boolean)

SQL

1ST_Intersects(a.geom, b.geom)
2ST_Contains(a.geom, b.geom)
3ST_Within(a.geom, b.geom)
4ST_DWithin(a.geom, b.geom, 500)   -- within 500 units
5ST_Touches(a.geom, b.geom)
6ST_Crosses(a.geom, b.geom)

Geometry manipulations (return geometry)

SQL

1ST_Buffer(geom, 100)
2ST_Union(geom)        -- aggregate dissolve
3ST_Intersection(a.geom, b.geom)
4ST_Difference(a.geom, b.geom)
5ST_Simplify(geom, 10)
6ST_Centroid(geom)

Measurements (return numeric)

SQL

1ST_Area(geom::geography)          -- m²
2ST_Length(geom::geography)        -- metres
3ST_Distance(a.geom::geography, b.geom::geography)
4ST_Perimeter(geom::geography)

Cast to ::geography for metre-accurate calculations on lat/lon data.

Practical queries

Points in polygons

SQL

1SELECT d.name, COUNT(*) AS n_trees
2FROM districts d
3LEFT JOIN trees t ON ST_Within(t.geom, d.geom)
4GROUP BY d.name
5ORDER BY n_trees DESC;

Nearest neighbour

SQL

1-- Closest hospital to each school
2SELECT s.id, s.name AS school,
3       h.name AS nearest_hospital,
4       ST_Distance(s.geom::geography, h.geom::geography) AS metres
5FROM schools s
6CROSS JOIN LATERAL (
7  SELECT name, geom
8  FROM hospitals
9  ORDER BY s.geom <-> geom         -- bounding-box distance (indexed)
10  LIMIT 1
11) h;

Aggregate within a buffer

SQL

1SELECT s.name,
2       COUNT(c.id) AS cafes_within_500m
3FROM stations s
4LEFT JOIN cafes c
5  ON ST_DWithin(s.geom::geography, c.geom::geography, 500)
6GROUP BY s.id, s.name;

Dissolve by category

SQL

1SELECT landuse_class, ST_Union(geom) AS geom
2FROM land_parcels
3GROUP BY landuse_class;

Common table expressions (CTEs)

WITH clauses let you break queries into named steps:

SQL

1WITH risk_areas AS (
2  SELECT ST_Buffer(geom, 500) AS geom
3  FROM hazardous_waste_sites
4),
5risk_population AS (
6  SELECT SUM(population) AS pop
7  FROM census_blocks b
8  JOIN risk_areas r ON ST_Intersects(b.geom, r.geom)
9)
10SELECT pop FROM risk_population;

Readable, debuggable, and often no slower than nested subqueries.

Window functions

Compute ranks, running sums, and moving averages:

SQL

1SELECT city,
2       population,
3       RANK() OVER (PARTITION BY country ORDER BY population DESC) AS rank_in_country
4FROM cities;

Explain and optimise

Before running a long query, use EXPLAIN ANALYZE to see the plan:

SQL

1EXPLAIN ANALYZE
2SELECT * FROM parks WHERE ST_Intersects(geom, ST_MakeEnvelope(...));

Look for Index Scan (good) vs Seq Scan (full scan — suspect on large tables).

Readability matters

A 50-line SQL query hidden in application code is painful. Conventions:

UPPER CASE for SQL keywords.
Columns in multi-line lists for readability.
CTEs to name intermediate results.
Comments where logic is non-obvious.
One statement per file for complex ETL.

Future you will thank you.

Self-check exercises

1. Why cast to ::geography for distance calculations on lat/lon data?

Geometry operations on lat/lon interpret coordinates as planar and return degrees. Geography operations use spherical (ellipsoidal) distance algorithms and return metres. A query like ST_Distance(geom, geom2) without the cast returns degrees, which is almost never what you want. ::geography is the fix; for very high volumes, consider reprojecting to a metric CRS instead to avoid the per-query cost.

2. Write an SQL query that counts points within each polygon.

SQL

1SELECT d.id, d.name, COUNT(p.id) AS n_points
2FROM districts d
3LEFT JOIN points p ON ST_Within(p.geom, d.geom)
4GROUP BY d.id, d.name
5ORDER BY n_points DESC;

Left join preserves polygons with zero points; inner join drops them.

3. Your query does a sequential scan instead of using a spatial index. What might be wrong?

Common causes: (1) no spatial index created — CREATE INDEX ... USING GIST (geom). (2) a function wraps the geometry (ST_Buffer(geom, ...)), preventing index use — move the buffer into a CTE. (3) statistics are outdated — run VACUUM ANALYZE. (4) query touches the whole dataset anyway; sometimes a seq scan is genuinely faster.

Summary

SQL core: SELECT, WHERE, ORDER BY, GROUP BY, JOIN.
PostGIS adds ST_* for geometry, predicates, measurements.
Use CTEs for readable multi-step queries.
EXPLAIN ANALYZE diagnoses performance.