12.3 Address Parsing and Standardisation

Key takeaways

Address parsing decomposes a free-text address into structured components.

Standardisation normalises spelling, abbreviations, and formats.

Tools like libpostal, Pelias, and national registries are the professional choice.

Introduction

A CSV of addresses you receive from a client is almost guaranteed to be inconsistent. Before geocoding, you normalise. This short but high-impact lesson covers parsing and standardisation — the boring discipline that makes geocoding actually work.

What changes

Starting point:

Code

11600 pennsylvania ave nw wash dc 20500
21600 PENNSYLVANIA AVENUE N.W., WASHINGTON, D.C.
3Sixteen hundred penn. ave NW, dist of columbia

After parsing and normalisation:

JSON

1{
2  "house_number": "1600",
3  "road": "Pennsylvania Avenue NW",
4  "city": "Washington",
5  "state": "District of Columbia",
6  "postcode": "20500",
7  "country": "United States"
8}

Same structured output — one less headache for every downstream step.

Libpostal

libpostal (Mapzen, open source) is the most robust open address parser. It handles:

60+ languages.
Non-Latin scripts.
Common abbreviations.
Tokenisation.

Python

1from postal.parser import parse_address
2from postal.expand import expand_address
3parse_address("1600 Pennsylvania Ave NW, Washington DC 20500")
4[('1600', 'house_number'),
5('pennsylvania ave nw', 'road'),
6('washington', 'city'),
7('dc', 'state'),
8('20500', 'postcode')]
9expand_address("1600 Penn Ave NW, Washington DC")
10['1600 pennsylvania avenue northwest washington district of columbia', ...]

libpostal is C-based; Python bindings wrap it. Installing it requires a ~1 GB model download.

Specialist alternatives

Pelias — geocoder with integrated parsing.
SmartyStreets — US-focused commercial.
Addressbase Premium (UK) — comes with official parsed addresses.
DAWA (Denmark) — Danish address system with a REST API.
GeocodeEarth — managed Pelias.

For USA work, the USPS ZIP+4 database plus a commercial cleaner (SmartyStreets, Loqate) often outperforms generic libraries.

National formats

Address formats vary by country:

Country	Format
USA	Number Street, City State ZIP
UK	Number Street, Locality, Town, County POSTCODE
France	Number Street, POSTCODE City
Japan	Prefecture, City, District, Block, House
Denmark	Road Number, POSTCODE City

A single parser that handles all formats (libpostal) is convenient but not always best. Country-specific systems often outperform generic parsers within their domain.

Common errors

Directional confusion — "NW" after the street vs "N" before.
Suite / unit suffixes — "Apt 5", "Ste 300", "#4B".
Typos — "pennsilvania", "wasington".
Missing fields — no postcode or no state.
Nonstandard punctuation — commas, semicolons, unusual separators.

Building a cleaning pipeline

Python

1import pandas as pd
2from postal.parser import parse_address
3def clean_row(row):
4parsed = dict(parse_address(row['raw_address']))
5return pd.Series({
6'house_number': parsed.get('house_number'),
7'road': parsed.get('road'),
8'city': parsed.get('city'),
9'state': parsed.get('state'),
10'postcode': parsed.get('postcode'),
11'country': parsed.get('country', 'US'),
12})

Before geocoding, inspect the cleaned DataFrame — are there rows with missing city or postcode? Those will fail later, better to catch now.

Fuzzy matching

For input with typos, a Levenshtein or Jaro-Winkler match against a reference table can fix many errors before geocoding:

Python

1from rapidfuzz import process
2best = process.extractOne(row['city'], reference_cities)
3# returns (matched_value, score, index)

Use a confidence threshold (e.g., score > 85) for auto-accept; below that, flag for review.

After parsing: deduplication

Many datasets have the same address written multiple ways. Use the parsed, normalised form as a dedup key:

Python

1df['norm_key'] = df['road'] + '|' + df['house_number'] + '|' + df['postcode']
2df.drop_duplicates('norm_key')

The payoff

Clean addresses don't just geocode better — they also join correctly with other systems (customer databases, billing, property records). Investing in parsing once pays off across every downstream use.

Self-check exercises

1. Why can't you just use a regex to parse addresses?

Addresses are too variable — spelling, order, language, abbreviations, and edge cases — for a regex to handle reliably. Machine-learned parsers like libpostal use statistical sequence models trained on millions of real addresses and perform dramatically better, especially across countries and languages.

2. Your input is "123 Main St Apt 4B, Springfield IL 62704". Which field does libpostal typically put "Apt 4B" in?

libpostal assigns it to a unit field (or house in some versions), separate from the street address. Downstream geocoders ignore unit numbers for coordinates but may retain them for display. Keep them in your schema — they matter for delivery and postal mail.

3. A customer's "city" field is "New York" but their postcode is 08003 (a NJ ZIP). What should you do?

Flag the inconsistency. Either the city is wrong (they're in NJ, not NYC) or the postcode is a typo. Cross-checking city + postcode / region against an authoritative table catches many entry errors; otherwise geocoding will silently pick one and you'll get wrong coordinates. Manual review or user confirmation is the right response.

Summary

Parse and normalise addresses before geocoding.
libpostal is the open-source default; national systems often beat it in-country.
Handle spelling, abbreviation, punctuation, and format variation.
Use parsed fields for dedup, joins, and downstream analysis.

Introduction

What changes

Libpostal

Specialist alternatives

National formats

Common errors

Building a cleaning pipeline

Fuzzy matching

After parsing: deduplication

The payoff

Self-check exercises

Summary

Further reading

Module 12: Geocoding & Addressing

Ready to level up your map-making process?