Module 8: Attribute Data & Databases

8.1 Attribute Tables

The tabular half of every GIS dataset — types, keys, and why discipline here saves analysis time.

Lesson 38 of 100·15 min read

Key takeaways

  • Every vector feature carries an attribute row; keeping this table clean is half of good GIS.
  • Data types (integer, float, string, date, boolean) must match the real semantics of each field.
  • Unique identifiers and documented units turn data into a durable asset.

Introduction

Vector data has two halves: geometry and attributes. Geometry gets most of the glamour, but analysts spend as much time on attribute data — cleaning, joining, classifying, aggregating — as they do on geometry. This lesson covers the patterns that make attribute tables a pleasure rather than a pain.

The basics

An attribute table is a table. Each row corresponds to one feature (point, line, polygon). Each column is a property. Columns have:

  • A name.
  • A data type (integer, float, string, date, boolean, ...).
  • Optionally a length or precision.
  • Optionally a default value.
  • Optionally a constraint (not null, unique).

Example for a hydrant table:

hydrant_id (int PK)city (str)flow_gpm (int)installed (date)out_of_service (bool)geom (point)
101Copenhagen7501998-07-01falsePOINT(...)
102Copenhagen6002002-11-12falsePOINT(...)
103Copenhagen01965-03-01truePOINT(...)

Data types — choose deliberately

  • Integer for counts, IDs, year-only dates, enumerated categories (as codes).
  • Float / double for measurements, averages, ratios.
  • String for names, free text. Specify max length where the schema allows.
  • Boolean for yes/no flags.
  • Date / datetime for dates. Avoid storing "2024-07" as an integer 202407.
  • Enum / category for small, fixed sets (e.g., zoning classes).

Over time, the wrong type causes queries to fail or silently slow. A string "07" sorts before "12" correctly but after "10"; as an integer, ordering is correct.

Primary keys

Every table needs a stable, unique identifier — a primary key (PK). Options:

  • Surrogate key — an auto-incrementing integer (hydrant_id). Opaque but never changes.
  • Natural key — a meaningful identifier (OSM ID, parcel number). Changes when the real-world reference changes.
  • UUID — 128-bit random ID. Useful across systems; not human-readable.

Prefer a surrogate PK in analytical data, with the natural key as a separate indexed field.

Foreign keys

A foreign key (FK) links rows in one table to rows in another. A maintenance_log table might have a hydrant_id FK pointing back to the hydrant it describes. In a database, you can enforce referential integrity (no orphan logs).

Units — document them

The field area is ambiguous. The field area_m2 isn't. Bake units into column names or metadata. Temperature in Celsius? Fahrenheit? Kelvin? "temp" loses this; "temp_c" keeps it.

Some organisations put units in parentheses: flow (gpm). Either works; what matters is that they're explicit.

Handling missing data

Two common representations:

  • NULL — "value unknown". Most databases support it; be careful — NULL != NULL in SQL.
  • Sentinel — a reserved value (e.g., -9999, "N/A"). Older systems use this; it's error-prone (do you treat -9999 as data?).

Document missingness explicitly. When computing summaries, decide whether to drop, impute, or explicitly include missing.

Indexes

For attribute columns you query often (city, date, status), create a database index. Index strategies:

  • B-tree — default; supports equality and range queries.
  • Hash — equality only; rare in spatial GIS.
  • Unique — enforces uniqueness.
  • Partial — only indexes rows matching a predicate.

Indexes speed reads at some cost to writes; plan per workload.

Calculated fields

Rather than store area_m2 as a fixed column, some tools let you define a calculated field that recomputes from geometry. Decide per context:

  • Frequent reads, infrequent geometry changes → store the value.
  • Live editing or polymorphic geometry → calculate on demand.

Attribute validation

Before you trust a table, run quick checks:

  • Is every PK unique and non-null?
  • Are values within expected ranges (ages 0–120, dates between 1900 and today)?
  • Are referenced FKs present in the parent table?
  • Are enumerated columns limited to the allowed set?
  • Are required fields filled?

Tools: pandas, great_expectations, SQL with CHECK constraints.

Pivoting and reshaping

Some analyses need data in a different shape:

  • Long to wide — one column per category (e.g., population by year becomes pop_2020, pop_2025).
  • Wide to long — collapse many columns into key/value pairs.

pandas.melt, pivot_table, and equivalent SQL constructs handle both.

Attributes matter for symbology

Thematic maps depend on attributes:

  • Sequential ramps need a numeric attribute.
  • Qualitative palettes need categorical strings.
  • Classification methods need distribution-friendly values.

Tidy data makes tidy maps.

Self-check exercises

1. A shapefile stores dates as strings "YYYYMMDD". What's the risk, and the fix?

Sorting, comparison, and date arithmetic won't work correctly — "20250301" sorts after "20251001" numerically but alphabetically it sorts correctly; on the other hand, operations like "subtract 30 days" require a real date type. Fix by casting to date on import (pandas pd.to_datetime; PostGIS ::date) and storing as a proper date column going forward.

2. Why prefer a surrogate integer PK over a natural key like OSM ID?

OSM IDs change when features are split, merged, or re-entered. A surrogate integer never changes, so downstream joins and references stay valid. Keep OSM ID as a separate indexed column for traceability, but reference rows via the surrogate key internally.

3. Your dataset has a column `temp` with values from -40 to 120. Celsius or Fahrenheit?

Ambiguous — that's the point. A range spanning -40 to 120 is plausible for both (Fahrenheit would cover most weather; Celsius extremes would need an exceptional context). Always name columns with units: temp_c or temp_f. If you inherited this ambiguity, dig into metadata and clarify before using the data.

Summary

  • Attribute tables are half of vector GIS.
  • Choose types deliberately, document units, use primary keys.
  • Validate before analysing; index before scaling.
  • Clean attribute tables make cleaner maps and cleaner analyses.

Further reading

  • Wickham, H. — Tidy Data (Journal of Statistical Software).
  • PostGIS documentation — indexing and constraints.
  • pandas documentation — dtypes, missing data.
  • SQL Antipatterns (Karwin) — practical database design lessons.