12.3 Address Parsing and Standardisation
Why every serious geocoding pipeline starts with cleaning — and the tools that do it.
Key takeaways
- Address parsing decomposes a free-text address into structured components.
- Standardisation normalises spelling, abbreviations, and formats.
- Tools like libpostal, Pelias, and national registries are the professional choice.
Introduction
A CSV of addresses you receive from a client is almost guaranteed to be inconsistent. Before geocoding, you normalise. This short but high-impact lesson covers parsing and standardisation — the boring discipline that makes geocoding actually work.
What changes
Starting point:
11600 pennsylvania ave nw wash dc 20500
21600 PENNSYLVANIA AVENUE N.W., WASHINGTON, D.C.
3Sixteen hundred penn. ave NW, dist of columbiaAfter parsing and normalisation:
1{
2 "house_number": "1600",
3 "road": "Pennsylvania Avenue NW",
4 "city": "Washington",
5 "state": "District of Columbia",
6 "postcode": "20500",
7 "country": "United States"
8}Same structured output — one less headache for every downstream step.
Libpostal
libpostal (Mapzen, open source) is the most robust open address parser. It handles:
- 60+ languages.
- Non-Latin scripts.
- Common abbreviations.
- Tokenisation.
1from postal.parser import parse_address
2from postal.expand import expand_address
3[object Object]
4[object Object]
5[object Object]
6[object Object]
7[object Object]
8[object Object]
9[object Object]
10[object Object]
11libpostal is C-based; Python bindings wrap it. Installing it requires a ~1 GB model download.
Specialist alternatives
- Pelias — geocoder with integrated parsing.
- SmartyStreets — US-focused commercial.
- Addressbase Premium (UK) — comes with official parsed addresses.
- DAWA (Denmark) — Danish address system with a REST API.
- GeocodeEarth — managed Pelias.
For USA work, the USPS ZIP+4 database plus a commercial cleaner (SmartyStreets, Loqate) often outperforms generic libraries.
National formats
Address formats vary by country:
| Country | Format |
|---|---|
| USA | Number Street, City State ZIP |
| UK | Number Street, Locality, Town, County POSTCODE |
| France | Number Street, POSTCODE City |
| Japan | Prefecture, City, District, Block, House |
| Denmark | Road Number, POSTCODE City |
A single parser that handles all formats (libpostal) is convenient but not always best. Country-specific systems often outperform generic parsers within their domain.
Common errors
- Directional confusion — "NW" after the street vs "N" before.
- Suite / unit suffixes — "Apt 5", "Ste 300", "#4B".
- Typos — "pennsilvania", "wasington".
- Missing fields — no postcode or no state.
- Nonstandard punctuation — commas, semicolons, unusual separators.
Building a cleaning pipeline
1import pandas as pd
2from postal.parser import parse_address
3[object Object]
4Before geocoding, inspect the cleaned DataFrame — are there rows with missing city or postcode? Those will fail later, better to catch now.
Fuzzy matching
For input with typos, a Levenshtein or Jaro-Winkler match against a reference table can fix many errors before geocoding:
1from rapidfuzz import process
2best = process.extractOne(row['city'], reference_cities)
3# returns (matched_value, score, index)Use a confidence threshold (e.g., score > 85) for auto-accept; below that, flag for review.
After parsing: deduplication
Many datasets have the same address written multiple ways. Use the parsed, normalised form as a dedup key:
1df['norm_key'] = df['road'] + '|' + df['house_number'] + '|' + df['postcode']
2df.drop_duplicates('norm_key')The payoff
Clean addresses don't just geocode better — they also join correctly with other systems (customer databases, billing, property records). Investing in parsing once pays off across every downstream use.
Self-check exercises
1. Why can't you just use a regex to parse addresses?
Addresses are too variable — spelling, order, language, abbreviations, and edge cases — for a regex to handle reliably. Machine-learned parsers like libpostal use statistical sequence models trained on millions of real addresses and perform dramatically better, especially across countries and languages.
2. Your input is "123 Main St Apt 4B, Springfield IL 62704". Which field does libpostal typically put "Apt 4B" in?
libpostal assigns it to a unit field (or house in some versions), separate from the street address. Downstream geocoders ignore unit numbers for coordinates but may retain them for display. Keep them in your schema — they matter for delivery and postal mail.
3. A customer's "city" field is "New York" but their postcode is 08003 (a NJ ZIP). What should you do?
Flag the inconsistency. Either the city is wrong (they're in NJ, not NYC) or the postcode is a typo. Cross-checking city + postcode / region against an authoritative table catches many entry errors; otherwise geocoding will silently pick one and you'll get wrong coordinates. Manual review or user confirmation is the right response.
Summary
- Parse and normalise addresses before geocoding.
- libpostal is the open-source default; national systems often beat it in-country.
- Handle spelling, abbreviation, punctuation, and format variation.
- Use parsed fields for dedup, joins, and downstream analysis.
Further reading
- libpostal project documentation.
- Barrett, S. — Address Standardization (USPS publications).
- OpenAddresses gazetteer documentation.
- rapidfuzz / fuzzywuzzy libraries for fuzzy matching.
Module 12: Geocoding & Addressing
Answer these quick multiple-choice questions to check your understanding before moving on.