Module 18: Automation & Programming

18.2 Rasterio and NumPy for Raster Analysis

Python's raster toolkit — open, read, analyse, and write raster datasets.

Lesson 88 of 100·17 min read

Key takeaways

Rasterio wraps GDAL with a Python-friendly API.

NumPy arrays are the canonical in-memory raster representation.

Xarray + Zarr handle multi-dimensional and cloud-native rasters.

Introduction

If Module 18.1 handled vector, this lesson handles raster. Rasterio + NumPy is the default Python stack for raster analysis — fast, expressive, and compatible with the whole scientific Python ecosystem.

Rasterio basics

Python

1import rasterio
2import numpy as np
3with rasterio.open('dem.tif') as src:
4print(src.crs)
5print(src.transform)
6print(src.width, src.height)
7print(src.count)           # number of bands
8print(src.nodatavals)
9print(src.dtypes)
10band1 = src.read(1)        # 2D numpy array
11

Reading a windowed region

Python

from rasterio.windows import Window

Efficient for huge rasters — only the bytes you need are read.

Reprojecting / warping

Python

1from rasterio.warp import calculate_default_transform, reproject, Resampling
2with rasterio.open('in.tif') as src:
3dst_crs = 'EPSG:3857'
4transform, width, height = calculate_default_transform(
5src.crs, dst_crs, src.width, src.height, *src.bounds
6)
7profile = src.profile.copy()
8profile.update(crs=dst_crs, transform=transform, width=width, height=height)
9with rasterio.open('out.tif', 'w', **profile) as dst:
10    for i in range(1, src.count + 1):
11        reproject(
12            source=rasterio.band(src, i),
13            destination=rasterio.band(dst, i),
14            src_crs=src.crs, dst_crs=dst_crs,
15            resampling=Resampling.bilinear
16        )
17

Band math with NumPy

NDVI example:

Python

1with rasterio.open('red.tif') as r, rasterio.open('nir.tif') as n:
2    red = r.read(1).astype('float32')
3    nir = n.read(1).astype('float32')
4    profile = r.profile
5ndvi = (nir - red) / (nir + red + 1e-6)
6ndvi = np.clip(ndvi, -1, 1)

Masking and clipping

Python

1from rasterio.mask import mask
2import geopandas as gpd
3aoi = gpd.read_file('boundary.gpkg').to_crs('EPSG:32610')
4with rasterio.open('image.tif') as src:
5out_image, out_transform = mask(src, aoi.geometry, crop=True)
6out_meta = src.meta.copy()
7out_meta.update({
8'height': out_image.shape[1],
9'width': out_image.shape[2],
10'transform': out_transform
11})

Zonal statistics

Use rasterstats:

Python

from rasterstats import zonal_stats

Xarray — multi-dimensional rasters

For time-series or stacked bands, Xarray is preferred:

Python

1import xarray as xr
2import rioxarray
3da = xr.open_dataarray('scene.tif')          # lazy-loaded
4print(da.dims, da.coords)
5Slicing
6subset = da.sel(x=slice(400000, 500000),
7y=slice(4300000, 4200000))
8Arithmetic

rioxarray bridges rasterio into xarray, giving CRS-aware indexing.

Zarr — cloud-native chunked arrays

For petabyte-scale data (climate reanalysis, MODIS cubes), Zarr stores chunks separately and allows parallel IO:

Python

1import xarray as xr
2ds = xr.open_zarr('s3://bucket/cube.zarr',
3                  storage_options={'anon': True})

Combined with Dask, this scales analysis across clusters.

Dask for large rasters

Python

1import xarray as xr
2da = xr.open_dataarray('huge.tif', chunks={'x': 512, 'y': 512})
3Operations are lazy; compute when you call .compute() or .to_netcdf()

Dask handles larger-than-memory workloads by processing chunks in parallel.

Converting between CRSs with rioxarray

Python

import rioxarray

Memory considerations

Always use with rasterio.open(...) to close files.
Read in windows for huge data.
Cast to appropriate dtypes (float32 vs float64 halves memory).
Chunk with Dask / Xarray when data exceeds RAM.

Self-check exercises

1. When would you use Xarray + rioxarray instead of plain rasterio?

When you have multi-band, multi-time, or multi-dimensional data — satellite time series, climate reanalysis, stacks of images. Xarray gives you labelled dimensions (time, band, x, y) and operations that respect them. Plain rasterio is simpler for single-file 2D analysis; Xarray shines when you're manipulating datacubes.

2. Why cast to float32 before computing NDVI?

Sentinel-2 raw data is 16-bit unsigned integer. Integer arithmetic truncates: (nir - red) / (nir + red) in integer math produces 0 for most pixels because the magnitude comparisons lose precision. Casting to float32 gives real fractional values in [-1, 1]. float64 is overkill and doubles memory; float32 is the right compromise.

3. You want to process a 50 GB Sentinel-2 tile on a 16 GB laptop. Strategy?

Use Dask with Xarray: da = xr.open_dataarray(..., chunks={'x': 1024, 'y': 1024}). Operations become lazy; actual computation processes tiles one at a time, keeping RAM use low. Alternatively, open with rasterio windowed reads in a loop. For long-running jobs, process with cloud compute (Earth Engine, Planetary Computer) instead of the laptop.

Summary

Rasterio + NumPy is the core raster toolkit.
Windowed reads and reprojection are simple and fast.
Xarray and Dask handle multi-dimensional and cloud-scale data.
Zonal stats via rasterstats is the standard vector-raster bridge.

18.2 Rasterio and NumPy for Raster Analysis

Introduction

Rasterio basics

Reading a windowed region

Reprojecting / warping

Band math with NumPy

Masking and clipping

Zonal statistics

Xarray — multi-dimensional rasters

Zarr — cloud-native chunked arrays

Dask for large rasters

Converting between CRSs with rioxarray

Memory considerations

Self-check exercises

Summary

Further reading

Ready to level up your map-making process?