18.2 Rasterio and NumPy for Raster Analysis
Python's raster toolkit — open, read, analyse, and write raster datasets.
Key takeaways
- Rasterio wraps GDAL with a Python-friendly API.
- NumPy arrays are the canonical in-memory raster representation.
- Xarray + Zarr handle multi-dimensional and cloud-native rasters.
Introduction
If Module 18.1 handled vector, this lesson handles raster. Rasterio + NumPy is the default Python stack for raster analysis — fast, expressive, and compatible with the whole scientific Python ecosystem.
Rasterio basics
1import rasterio
2import numpy as np
3[object Object]
4[object Object]
5Reading a windowed region
from rasterio.windows import WindowEfficient for huge rasters — only the bytes you need are read.
Reprojecting / warping
1from rasterio.warp import calculate_default_transform, reproject, Resampling
2[object Object]
3[object Object]
4Band math with NumPy
NDVI example:
1with rasterio.open('red.tif') as r, rasterio.open('nir.tif') as n:
2 red = r.read(1).astype('float32')
3 nir = n.read(1).astype('float32')
4 profile = r.profile
5[object Object]
6Masking and clipping
1from rasterio.mask import mask
2import geopandas as gpd
3[object Object]
4[object Object]
5Zonal statistics
Use rasterstats:
from rasterstats import zonal_statsXarray — multi-dimensional rasters
For time-series or stacked bands, Xarray is preferred:
1import xarray as xr
2import rioxarray
3[object Object]
4[object Object]
5[object Object]
6[object Object]
7rioxarray bridges rasterio into xarray, giving CRS-aware indexing.
Zarr — cloud-native chunked arrays
For petabyte-scale data (climate reanalysis, MODIS cubes), Zarr stores chunks separately and allows parallel IO:
1import xarray as xr
2ds = xr.open_zarr('s3://bucket/cube.zarr',
3 storage_options={'anon': True})Combined with Dask, this scales analysis across clusters.
Dask for large rasters
1import xarray as xr
2[object Object]
3[object Object]
4Dask handles larger-than-memory workloads by processing chunks in parallel.
Converting between CRSs with rioxarray
import rioxarrayMemory considerations
- Always use
with rasterio.open(...)to close files. - Read in windows for huge data.
- Cast to appropriate dtypes (float32 vs float64 halves memory).
- Chunk with Dask / Xarray when data exceeds RAM.
Self-check exercises
1. When would you use Xarray + rioxarray instead of plain rasterio?
When you have multi-band, multi-time, or multi-dimensional data — satellite time series, climate reanalysis, stacks of images. Xarray gives you labelled dimensions (time, band, x, y) and operations that respect them. Plain rasterio is simpler for single-file 2D analysis; Xarray shines when you're manipulating datacubes.
2. Why cast to float32 before computing NDVI?
Sentinel-2 raw data is 16-bit unsigned integer. Integer arithmetic truncates: (nir - red) / (nir + red) in integer math produces 0 for most pixels because the magnitude comparisons lose precision. Casting to float32 gives real fractional values in [-1, 1]. float64 is overkill and doubles memory; float32 is the right compromise.
3. You want to process a 50 GB Sentinel-2 tile on a 16 GB laptop. Strategy?
Use Dask with Xarray: da = xr.open_dataarray(..., chunks={'x': 1024, 'y': 1024}). Operations become lazy; actual computation processes tiles one at a time, keeping RAM use low. Alternatively, open with rasterio windowed reads in a loop. For long-running jobs, process with cloud compute (Earth Engine, Planetary Computer) instead of the laptop.
Summary
- Rasterio + NumPy is the core raster toolkit.
- Windowed reads and reprojection are simple and fast.
- Xarray and Dask handle multi-dimensional and cloud-scale data.
- Zonal stats via
rasterstatsis the standard vector-raster bridge.
Further reading
- Rasterio documentation (rasterio.readthedocs.io).
- Xarray documentation (xarray.dev).
- Dask documentation (dask.org).
- Pangeo community — open science data cubes with Python.