Module 8: Attribute Data & Databases

8.2 Relational Databases for GIS

Why GIS eventually moves from files to databases — and the principles that make the move worthwhile.

Lesson 39 of 100·18 min read

Key takeaways

Databases scale beyond single-user shapefile / GeoPackage workflows.

Tables, keys, normalisation, and transactions are the fundamentals every spatial database uses.

PostgreSQL + PostGIS is the de facto open-source spatial database; SQL Server Spatial and Oracle Spatial are commercial alternatives.

Introduction

When you outgrow single-user files, the next step is a relational database. This lesson covers the fundamentals — what relational databases are, how they relate to files, and why PostGIS has become the default for spatial data.

What's a relational database?

A relational database stores data in tables, with rows and columns, and lets you query across tables using SQL. Key properties:

Schema — tables have defined columns and types.
Constraints — keys, not-null, check constraints.
Transactions — groups of changes succeed or fail atomically (ACID).
Concurrency — many users can read and write simultaneously.
Indexes — speed queries.
Views — virtual tables defined by queries.
Stored procedures / functions — custom logic in the database.

Files vs databases — when to switch

Factor	Files (shp / GPKG)	Database
Users	1–2	Many
Data size	< 2 GB (shp), < 50 GB (GPKG)	Terabytes
Concurrency	Single writer	Many writers
Backups	Copy the file	Continuous, point-in-time
Cross-table queries	Limited	Rich SQL joins
Security	Filesystem	Row/column-level permissions

Rule of thumb: a solo analyst on a small dataset is fine with files. A team, public-facing app, or > 50 GB dataset benefits from a database.

Normalisation

Good database design avoids duplicating data. The classic example:

Denormalised (bad):

event_id	country_name	country_iso
1	Denmark	DK
2	Denmark	DK
3	Denmark	DK

Normalised (good):

events(event_id, country_iso) + countries(country_iso, country_name).

Normalisation prevents update anomalies (changing Denmark's name requires one update, not many) and shrinks storage. Over-normalisation slows queries; there's a balance.

Keys again

A primary key uniquely identifies each row. A foreign key references a primary key in another table. The database enforces these — you can't insert an event for a country that doesn't exist.

Transactions

A transaction groups multiple statements. Either all succeed or all fail — no partial writes. Example:

SQL

1BEGIN;
2UPDATE accounts SET balance = balance - 100 WHERE id = 1;
3UPDATE accounts SET balance = balance + 100 WHERE id = 2;
4COMMIT;

If the power cuts between the two updates, the database rolls back the first and you don't end up missing £100.

For GIS: if an edit changes both geometry and attributes across multiple tables, you want them all committed atomically.

Indexes, revisited

A spatial index (GiST in PostGIS, R-tree in SpatiaLite) accelerates spatial predicates (ST_Intersects, ST_DWithin). Create one per geometry column:

SQL

CREATE INDEX idx_parks_geom ON parks USING GIST (geom);

Without this, a query scanning millions of features is O(N); with it, O(log N) for common queries.

PostgreSQL + PostGIS

PostgreSQL is a rock-solid open-source relational database. PostGIS is its spatial extension, adding:

Geometry and geography types.
~300 spatial functions (ST_*).
Spatial indexes.
Coordinate transformation.
Raster support (via postgis_raster).
Topology extension.

PostGIS is the default spatial database for most professional GIS teams. It's free, scales to petabytes, and integrates with every major tool.

Commercial alternatives

Oracle Spatial — enterprise, strong geodetic support.
SQL Server Spatial — tight Windows/.NET integration.
MySQL spatial — basic features; weaker than PostGIS.
SpatiaLite — single-file, covered in Module 6.3.
DuckDB spatial — analytics-focused, embeddable, growing fast.

Cloud spatial databases

Amazon Aurora PostgreSQL + PostGIS.
Google Cloud SQL for PostgreSQL + PostGIS.
Azure Database for PostgreSQL + PostGIS.
BigQuery GEOGRAPHY type — serverless, great for massive aggregations.
Snowflake GEOGRAPHY / GEOMETRY — similar.

For serverless analytics on huge datasets, column-store warehouses (BigQuery, Snowflake) often beat row-store PostGIS.

A tiny PostGIS example

SQL

1-- Create a spatial table
2CREATE TABLE parks (
3  id SERIAL PRIMARY KEY,
4  name TEXT NOT NULL,
5  geom GEOMETRY(Polygon, 4326) NOT NULL
6);
7-- Spatial index
8CREATE INDEX idx_parks_geom ON parks USING GIST (geom);
9-- Insert a feature
10INSERT INTO parks (name, geom)
11VALUES ('Central Park',
12ST_GeomFromText('POLYGON((...))', 4326));

Four lines of SQL do a real spatial query — what used to require specialised GIS software.

Migrating from files to PostGIS

Shell

1# Bulk import a shapefile into PostGIS
2ogr2ogr -f PostgreSQL \
3  PG:"host=db.example.com user=gis dbname=gisdb password=XXX" \
4  parcels.shp \
5  -nln parcels -overwrite -lco GEOMETRY_NAME=geom

One line, shapefile → PostGIS. Run a CREATE INDEX, VACUUM, and you're ready to query.

Self-check exercises

1. You have a shapefile with 500 000 parcels and one analyst editing. Should you move to PostGIS?

Probably not yet. One analyst on 500 k features is comfortable with a GeoPackage; you don't need concurrent writers or ACID across tables. Move to PostGIS when you add a second editor, start integrating web apps that need the data live, or cross 10 M features / multi-GB sizes.

2. Why is a spatial index important for spatial queries?

Without it, a query like "which parks are within 500 m of this school" must scan every park feature — O(N). A GiST / R-tree index narrows the candidates by bounding-box pre-filter to a tiny fraction, making the query O(log N). For 1 M features, this is the difference between seconds and milliseconds.

3. What's the difference between ACID transactions and eventual consistency?

ACID (Atomicity, Consistency, Isolation, Durability) transactions guarantee all-or-nothing semantics and immediate visibility of committed data — PostGIS provides this. Eventual consistency (common in distributed / NoSQL systems) allows temporary divergence between nodes, with convergence over time; faster but harder to reason about for correctness-critical data like cadastral records.

Summary

Files for solo work; databases for teams, scale, and concurrency.
PostgreSQL + PostGIS is the default open-source spatial database.
Normalisation, indexes, and transactions make databases reliable.
Cloud databases and warehouses add elastic scale and serverless options.