18.2 Rasterio and NumPy for Raster Analysis
Python's raster toolkit — open, read, analyse, and write raster datasets.
Key takeaways
- Rasterio wraps GDAL with a Python-friendly API.
- NumPy arrays are the canonical in-memory raster representation.
- Xarray + Zarr handle multi-dimensional and cloud-native rasters.
Introduction
If Module 18.1 handled vector, this lesson handles raster. Rasterio + NumPy is the default Python stack for raster analysis — fast, expressive, and compatible with the whole scientific Python ecosystem.
Rasterio basics
1import rasterio
2import numpy as np
3with rasterio.open('dem.tif') as src:
4print(src.crs)
5print(src.transform)
6print(src.width, src.height)
7print(src.count) # number of bands
8print(src.nodatavals)
9print(src.dtypes)
10band1 = src.read(1) # 2D numpy array
11Reading a windowed region
from rasterio.windows import WindowEfficient for huge rasters — only the bytes you need are read.
Reprojecting / warping
1from rasterio.warp import calculate_default_transform, reproject, Resampling
2with rasterio.open('in.tif') as src:
3dst_crs = 'EPSG:3857'
4transform, width, height = calculate_default_transform(
5src.crs, dst_crs, src.width, src.height, *src.bounds
6)
7profile = src.profile.copy()
8profile.update(crs=dst_crs, transform=transform, width=width, height=height)
9with rasterio.open('out.tif', 'w', **profile) as dst:
10 for i in range(1, src.count + 1):
11 reproject(
12 source=rasterio.band(src, i),
13 destination=rasterio.band(dst, i),
14 src_crs=src.crs, dst_crs=dst_crs,
15 resampling=Resampling.bilinear
16 )
17Band math with NumPy
NDVI example:
1with rasterio.open('red.tif') as r, rasterio.open('nir.tif') as n:
2 red = r.read(1).astype('float32')
3 nir = n.read(1).astype('float32')
4 profile = r.profile
5ndvi = (nir - red) / (nir + red + 1e-6)
6ndvi = np.clip(ndvi, -1, 1)Masking and clipping
1from rasterio.mask import mask
2import geopandas as gpd
3aoi = gpd.read_file('boundary.gpkg').to_crs('EPSG:32610')
4with rasterio.open('image.tif') as src:
5out_image, out_transform = mask(src, aoi.geometry, crop=True)
6out_meta = src.meta.copy()
7out_meta.update({
8'height': out_image.shape[1],
9'width': out_image.shape[2],
10'transform': out_transform
11})Zonal statistics
Use rasterstats:
from rasterstats import zonal_statsXarray — multi-dimensional rasters
For time-series or stacked bands, Xarray is preferred:
1import xarray as xr
2import rioxarray
3da = xr.open_dataarray('scene.tif') # lazy-loaded
4print(da.dims, da.coords)
5Slicing
6subset = da.sel(x=slice(400000, 500000),
7y=slice(4300000, 4200000))
8Arithmeticrioxarray bridges rasterio into xarray, giving CRS-aware indexing.
Zarr — cloud-native chunked arrays
For petabyte-scale data (climate reanalysis, MODIS cubes), Zarr stores chunks separately and allows parallel IO:
1import xarray as xr
2ds = xr.open_zarr('s3://bucket/cube.zarr',
3 storage_options={'anon': True})Combined with Dask, this scales analysis across clusters.
Dask for large rasters
1import xarray as xr
2da = xr.open_dataarray('huge.tif', chunks={'x': 512, 'y': 512})
3Operations are lazy; compute when you call .compute() or .to_netcdf()Dask handles larger-than-memory workloads by processing chunks in parallel.
Converting between CRSs with rioxarray
import rioxarrayMemory considerations
- Always use
with rasterio.open(...)to close files. - Read in windows for huge data.
- Cast to appropriate dtypes (float32 vs float64 halves memory).
- Chunk with Dask / Xarray when data exceeds RAM.
Self-check exercises
1. When would you use Xarray + rioxarray instead of plain rasterio?
When you have multi-band, multi-time, or multi-dimensional data — satellite time series, climate reanalysis, stacks of images. Xarray gives you labelled dimensions (time, band, x, y) and operations that respect them. Plain rasterio is simpler for single-file 2D analysis; Xarray shines when you're manipulating datacubes.
2. Why cast to float32 before computing NDVI?
Sentinel-2 raw data is 16-bit unsigned integer. Integer arithmetic truncates: (nir - red) / (nir + red) in integer math produces 0 for most pixels because the magnitude comparisons lose precision. Casting to float32 gives real fractional values in [-1, 1]. float64 is overkill and doubles memory; float32 is the right compromise.
3. You want to process a 50 GB Sentinel-2 tile on a 16 GB laptop. Strategy?
Use Dask with Xarray: da = xr.open_dataarray(..., chunks={'x': 1024, 'y': 1024}). Operations become lazy; actual computation processes tiles one at a time, keeping RAM use low. Alternatively, open with rasterio windowed reads in a loop. For long-running jobs, process with cloud compute (Earth Engine, Planetary Computer) instead of the laptop.
Summary
- Rasterio + NumPy is the core raster toolkit.
- Windowed reads and reprojection are simple and fast.
- Xarray and Dask handle multi-dimensional and cloud-scale data.
- Zonal stats via
rasterstatsis the standard vector-raster bridge.
Further reading
- Rasterio documentation (rasterio.readthedocs.io).
- Xarray documentation (xarray.dev).
- Dask documentation (dask.org).
- Pangeo community — open science data cubes with Python.