Module 8: Attribute Data & Databases

8.1 Attribute Tables

The tabular half of every GIS dataset — types, keys, and why discipline here saves analysis time.

Lesson 38 of 100·15 min read

Key takeaways

Every vector feature carries an attribute row; keeping this table clean is half of good GIS.

Data types (integer, float, string, date, boolean) must match the real semantics of each field.

Unique identifiers and documented units turn data into a durable asset.

Introduction

Vector data has two halves: geometry and attributes. Geometry gets most of the glamour, but analysts spend as much time on attribute data — cleaning, joining, classifying, aggregating — as they do on geometry. This lesson covers the patterns that make attribute tables a pleasure rather than a pain.

The basics

An attribute table is a table. Each row corresponds to one feature (point, line, polygon). Each column is a property. Columns have:

A name.
A data type (integer, float, string, date, boolean, ...).
Optionally a length or precision.
Optionally a default value.
Optionally a constraint (not null, unique).

Example for a hydrant table:

hydrant_id (int PK)	city (str)	flow_gpm (int)	installed (date)	out_of_service (bool)	geom (point)
101	Copenhagen	750	1998-07-01	false	POINT(...)
102	Copenhagen	600	2002-11-12	false	POINT(...)
103	Copenhagen	0	1965-03-01	true	POINT(...)

Data types — choose deliberately

Integer for counts, IDs, year-only dates, enumerated categories (as codes).
Float / double for measurements, averages, ratios.
String for names, free text. Specify max length where the schema allows.
Boolean for yes/no flags.
Date / datetime for dates. Avoid storing "2024-07" as an integer 202407.
Enum / category for small, fixed sets (e.g., zoning classes).

Over time, the wrong type causes queries to fail or silently slow. A string "07" sorts before "12" correctly but after "10"; as an integer, ordering is correct.

Primary keys

Every table needs a stable, unique identifier — a primary key (PK). Options:

Surrogate key — an auto-incrementing integer (hydrant_id). Opaque but never changes.
Natural key — a meaningful identifier (OSM ID, parcel number). Changes when the real-world reference changes.
UUID — 128-bit random ID. Useful across systems; not human-readable.

Prefer a surrogate PK in analytical data, with the natural key as a separate indexed field.

Foreign keys

A foreign key (FK) links rows in one table to rows in another. A maintenance_log table might have a hydrant_id FK pointing back to the hydrant it describes. In a database, you can enforce referential integrity (no orphan logs).

Units — document them

The field area is ambiguous. The field area_m2 isn't. Bake units into column names or metadata. Temperature in Celsius? Fahrenheit? Kelvin? "temp" loses this; "temp_c" keeps it.

Some organisations put units in parentheses: flow (gpm). Either works; what matters is that they're explicit.

Handling missing data

Two common representations:

NULL — "value unknown". Most databases support it; be careful — NULL != NULL in SQL.
Sentinel — a reserved value (e.g., -9999, "N/A"). Older systems use this; it's error-prone (do you treat -9999 as data?).

Document missingness explicitly. When computing summaries, decide whether to drop, impute, or explicitly include missing.

Indexes

For attribute columns you query often (city, date, status), create a database index. Index strategies:

B-tree — default; supports equality and range queries.
Hash — equality only; rare in spatial GIS.
Unique — enforces uniqueness.
Partial — only indexes rows matching a predicate.

Indexes speed reads at some cost to writes; plan per workload.

Calculated fields

Rather than store area_m2 as a fixed column, some tools let you define a calculated field that recomputes from geometry. Decide per context:

Frequent reads, infrequent geometry changes → store the value.
Live editing or polymorphic geometry → calculate on demand.

Attribute validation

Before you trust a table, run quick checks:

Is every PK unique and non-null?
Are values within expected ranges (ages 0–120, dates between 1900 and today)?
Are referenced FKs present in the parent table?
Are enumerated columns limited to the allowed set?
Are required fields filled?

Tools: pandas, great_expectations, SQL with CHECK constraints.

Pivoting and reshaping

Some analyses need data in a different shape:

Long to wide — one column per category (e.g., population by year becomes pop_2020, pop_2025).
Wide to long — collapse many columns into key/value pairs.

pandas.melt, pivot_table, and equivalent SQL constructs handle both.

Attributes matter for symbology

Thematic maps depend on attributes:

Sequential ramps need a numeric attribute.
Qualitative palettes need categorical strings.
Classification methods need distribution-friendly values.

Tidy data makes tidy maps.

Self-check exercises

1. A shapefile stores dates as strings "YYYYMMDD". What's the risk, and the fix?

Sorting, comparison, and date arithmetic won't work correctly — "20250301" sorts after "20251001" numerically but alphabetically it sorts correctly; on the other hand, operations like "subtract 30 days" require a real date type. Fix by casting to date on import (pandas pd.to_datetime; PostGIS ::date) and storing as a proper date column going forward.

2. Why prefer a surrogate integer PK over a natural key like OSM ID?

OSM IDs change when features are split, merged, or re-entered. A surrogate integer never changes, so downstream joins and references stay valid. Keep OSM ID as a separate indexed column for traceability, but reference rows via the surrogate key internally.

3. Your dataset has a column `temp` with values from -40 to 120. Celsius or Fahrenheit?

Ambiguous — that's the point. A range spanning -40 to 120 is plausible for both (Fahrenheit would cover most weather; Celsius extremes would need an exceptional context). Always name columns with units: temp_c or temp_f. If you inherited this ambiguity, dig into metadata and clarify before using the data.

Summary

Attribute tables are half of vector GIS.
Choose types deliberately, document units, use primary keys.
Validate before analysing; index before scaling.
Clean attribute tables make cleaner maps and cleaner analyses.