8.1 Attribute Tables
The tabular half of every GIS dataset — types, keys, and why discipline here saves analysis time.
Key takeaways
- Every vector feature carries an attribute row; keeping this table clean is half of good GIS.
- Data types (integer, float, string, date, boolean) must match the real semantics of each field.
- Unique identifiers and documented units turn data into a durable asset.
Introduction
Vector data has two halves: geometry and attributes. Geometry gets most of the glamour, but analysts spend as much time on attribute data — cleaning, joining, classifying, aggregating — as they do on geometry. This lesson covers the patterns that make attribute tables a pleasure rather than a pain.
The basics
An attribute table is a table. Each row corresponds to one feature (point, line, polygon). Each column is a property. Columns have:
- A name.
- A data type (integer, float, string, date, boolean, ...).
- Optionally a length or precision.
- Optionally a default value.
- Optionally a constraint (not null, unique).
Example for a hydrant table:
| hydrant_id (int PK) | city (str) | flow_gpm (int) | installed (date) | out_of_service (bool) | geom (point) |
|---|---|---|---|---|---|
| 101 | Copenhagen | 750 | 1998-07-01 | false | POINT(...) |
| 102 | Copenhagen | 600 | 2002-11-12 | false | POINT(...) |
| 103 | Copenhagen | 0 | 1965-03-01 | true | POINT(...) |
Data types — choose deliberately
- Integer for counts, IDs, year-only dates, enumerated categories (as codes).
- Float / double for measurements, averages, ratios.
- String for names, free text. Specify max length where the schema allows.
- Boolean for yes/no flags.
- Date / datetime for dates. Avoid storing "2024-07" as an integer 202407.
- Enum / category for small, fixed sets (e.g., zoning classes).
Over time, the wrong type causes queries to fail or silently slow. A string "07" sorts before "12" correctly but after "10"; as an integer, ordering is correct.
Primary keys
Every table needs a stable, unique identifier — a primary key (PK). Options:
- Surrogate key — an auto-incrementing integer (
hydrant_id). Opaque but never changes. - Natural key — a meaningful identifier (OSM ID, parcel number). Changes when the real-world reference changes.
- UUID — 128-bit random ID. Useful across systems; not human-readable.
Prefer a surrogate PK in analytical data, with the natural key as a separate indexed field.
Foreign keys
A foreign key (FK) links rows in one table to rows in another. A maintenance_log table might have a hydrant_id FK pointing back to the hydrant it describes. In a database, you can enforce referential integrity (no orphan logs).
Units — document them
The field area is ambiguous. The field area_m2 isn't. Bake units into column names or metadata. Temperature in Celsius? Fahrenheit? Kelvin? "temp" loses this; "temp_c" keeps it.
Some organisations put units in parentheses: flow (gpm). Either works; what matters is that they're explicit.
Handling missing data
Two common representations:
- NULL — "value unknown". Most databases support it; be careful —
NULL != NULLin SQL. - Sentinel — a reserved value (e.g., -9999, "N/A"). Older systems use this; it's error-prone (do you treat -9999 as data?).
Document missingness explicitly. When computing summaries, decide whether to drop, impute, or explicitly include missing.
Indexes
For attribute columns you query often (city, date, status), create a database index. Index strategies:
- B-tree — default; supports equality and range queries.
- Hash — equality only; rare in spatial GIS.
- Unique — enforces uniqueness.
- Partial — only indexes rows matching a predicate.
Indexes speed reads at some cost to writes; plan per workload.
Calculated fields
Rather than store area_m2 as a fixed column, some tools let you define a calculated field that recomputes from geometry. Decide per context:
- Frequent reads, infrequent geometry changes → store the value.
- Live editing or polymorphic geometry → calculate on demand.
Attribute validation
Before you trust a table, run quick checks:
- Is every PK unique and non-null?
- Are values within expected ranges (ages 0–120, dates between 1900 and today)?
- Are referenced FKs present in the parent table?
- Are enumerated columns limited to the allowed set?
- Are required fields filled?
Tools: pandas, great_expectations, SQL with CHECK constraints.
Pivoting and reshaping
Some analyses need data in a different shape:
- Long to wide — one column per category (e.g., population by year becomes
pop_2020,pop_2025). - Wide to long — collapse many columns into key/value pairs.
pandas.melt, pivot_table, and equivalent SQL constructs handle both.
Attributes matter for symbology
Thematic maps depend on attributes:
- Sequential ramps need a numeric attribute.
- Qualitative palettes need categorical strings.
- Classification methods need distribution-friendly values.
Tidy data makes tidy maps.
Self-check exercises
1. A shapefile stores dates as strings "YYYYMMDD". What's the risk, and the fix?
Sorting, comparison, and date arithmetic won't work correctly — "20250301" sorts after "20251001" numerically but alphabetically it sorts correctly; on the other hand, operations like "subtract 30 days" require a real date type. Fix by casting to date on import (pandas pd.to_datetime; PostGIS ::date) and storing as a proper date column going forward.
2. Why prefer a surrogate integer PK over a natural key like OSM ID?
OSM IDs change when features are split, merged, or re-entered. A surrogate integer never changes, so downstream joins and references stay valid. Keep OSM ID as a separate indexed column for traceability, but reference rows via the surrogate key internally.
3. Your dataset has a column `temp` with values from -40 to 120. Celsius or Fahrenheit?
Ambiguous — that's the point. A range spanning -40 to 120 is plausible for both (Fahrenheit would cover most weather; Celsius extremes would need an exceptional context). Always name columns with units: temp_c or temp_f. If you inherited this ambiguity, dig into metadata and clarify before using the data.
Summary
- Attribute tables are half of vector GIS.
- Choose types deliberately, document units, use primary keys.
- Validate before analysing; index before scaling.
- Clean attribute tables make cleaner maps and cleaner analyses.
Further reading
- Wickham, H. — Tidy Data (Journal of Statistical Software).
- PostGIS documentation — indexing and constraints.
- pandas documentation — dtypes, missing data.
- SQL Antipatterns (Karwin) — practical database design lessons.