Module 20: Data Quality, Ethics & Careers

20.2 Metadata Standards

The data about your data — why it matters and how to produce it.

Lesson 96 of 100·12 min read

Key takeaways

  • Metadata describes a dataset's source, vintage, coverage, accuracy, and licence.
  • Standards (ISO 19115, FGDC, INSPIRE) make metadata machine-readable.
  • STAC, schema.org, and DCAT are modern, web-friendly alternatives.

Introduction

A dataset without metadata is a dataset you can't trust. The key facts — where it came from, when, by whom, under what licence, at what accuracy — are exactly what gets lost when files are passed around without documentation. This lesson covers the standards and modern practice.

What goes in metadata

Essential:

  • Title and description.
  • Author / publisher.
  • Vintage (acquisition / publication date).
  • Spatial extent (bounding box or polygon).
  • Temporal extent if time-series.
  • CRS.
  • Spatial resolution / scale.
  • Lineage — how the data was produced.
  • Accuracy — positional and thematic.
  • Licence.
  • Contact for queries.

Nice to have:

  • Keywords / taxonomies.
  • Supplemental documentation URL.
  • Version history.
  • Known issues.

Formal standards

ISO 19115 / 19115-2

International metadata standard for geographic information. Comprehensive, verbose — 400+ elements. Required for many government portals.

FGDC Content Standard for Digital Geospatial Metadata (CSDGM)

Older US federal standard. Being replaced by ISO 19115-2 but still widespread.

INSPIRE

EU directive mandating ISO-compatible metadata across member states' environmental data.

Dublin Core

Simpler, generic metadata standard — 15 core elements (title, creator, subject, etc.). Good for lightweight cataloguing.

Modern web-friendly standards

STAC — SpatioTemporal Asset Catalog

JSON-based metadata for Earth-observation data. Successors of static ISO XML files for cloud-native archives.

JSON
1{
2  "type": "Feature",
3  "id": "S2A_MSIL2A_20240715",
4  "properties": {
5    "datetime": "2024-07-15T10:32:21Z",
6    "eo:cloud_cover": 5.2,
7    "proj:epsg": 32632
8  },
9  "geometry": {...},
10  "assets": {
11    "B04": {"href": "https://...B04.tif", "type": "image/tiff"},
12    "B08": {"href": "https://...B08.tif", "type": "image/tiff"}
13  }
14}

Enables API-based discovery across archives (Microsoft Planetary Computer, Element84, AWS Open Data).

DCAT — Data Catalog Vocabulary

W3C RDF-based standard for describing datasets on the web. Used by many open-data portals (data.gov, EU Open Data Portal).

schema.org / Dataset

Generic web metadata standard, ingested by Google Dataset Search. Lightweight and widely adopted.

Tools for creating metadata

  • GeoNetwork — open-source metadata catalogue server.
  • pycsw — OGC CSW server in Python.
  • Metador — Tooling for ISO 19115.
  • QGIS — has a Metadata tab per layer.
  • STAC CLI and libraries (pystac, stac-utils).

Metadata quality

  • Human-readable — short descriptions that a non-specialist can understand.
  • Machine-readable — structured fields that tools can query.
  • Discoverable — indexed by search engines / portals.
  • Actionable — includes download URLs, contact info.

Automate where possible — don't rely on humans to write every field. Deriving extent, CRS, and acquisition date from the data itself is standard.

When creating metadata pays off

  • Sharing with colleagues or the public.
  • Required by funders or regulators.
  • Data might outlive its creator's memory (anything over 6 months old).
  • Reuse and integration depend on understanding data's limits.

Self-check exercises

1. Why is STAC replacing ISO 19115 for modern Earth observation archives?

STAC is simpler (JSON vs XML), cloud-friendly (accessible over HTTP), and standardised across providers so clients can query any STAC-compliant archive. ISO 19115 is comprehensive but verbose; STAC captures 95% of what's needed for EO discovery with 20% of the complexity. Both coexist: STAC for active datasets, ISO for formal archives.

2. What's the single most important metadata field?

Vintage / acquisition date, closely followed by CRS and licence. Without a date, users can't judge relevance; without a CRS, they can't integrate; without a licence, they can't use. Title and description are essential but often inferable; these three are harder to reconstruct and cause the biggest downstream problems if missing.

3. Your open-data portal requires ISO 19115 compliance, but you have dozens of datasets. How do you minimise effort?

Auto-generate the bulk of fields — extent from data bounds, CRS from file metadata, format from extension, size from file system. Only manually fill subjective fields (description, keywords, purpose). Tools like GeoNetwork and QGIS's metadata editor pre-populate what they can. Templates per dataset family save repeated typing.

Summary

  • Metadata is non-optional for any shared or long-lived dataset.
  • ISO 19115 is the formal standard; STAC is the modern web standard.
  • Automate generation; write prose for the fields that need human judgment.
  • Machine-readable + human-readable + discoverable + actionable.

Further reading

  • STAC specification (stacspec.org).
  • ISO 19115 overview documents.
  • INSPIRE Metadata Technical Guidance.
  • schema.org Dataset documentation.