CoursesGIS Basics — A Complete Introduction18.2 Rasterio and NumPy for Raster Analysis
Module 18: Automation & Programming

18.2 Rasterio and NumPy for Raster Analysis

Python's raster toolkit — open, read, analyse, and write raster datasets.

Lesson 88 of 100·17 min read

Key takeaways

  • Rasterio wraps GDAL with a Python-friendly API.
  • NumPy arrays are the canonical in-memory raster representation.
  • Xarray + Zarr handle multi-dimensional and cloud-native rasters.

Introduction

If Module 18.1 handled vector, this lesson handles raster. Rasterio + NumPy is the default Python stack for raster analysis — fast, expressive, and compatible with the whole scientific Python ecosystem.

Rasterio basics

Python
1import rasterio
2import numpy as np
3with rasterio.open('dem.tif') as src:
4print(src.crs)
5print(src.transform)
6print(src.width, src.height)
7print(src.count)           # number of bands
8print(src.nodatavals)
9print(src.dtypes)
10band1 = src.read(1)        # 2D numpy array
11

Reading a windowed region

Python
from rasterio.windows import Window

Efficient for huge rasters — only the bytes you need are read.

Reprojecting / warping

Python
1from rasterio.warp import calculate_default_transform, reproject, Resampling
2with rasterio.open('in.tif') as src:
3dst_crs = 'EPSG:3857'
4transform, width, height = calculate_default_transform(
5src.crs, dst_crs, src.width, src.height, *src.bounds
6)
7profile = src.profile.copy()
8profile.update(crs=dst_crs, transform=transform, width=width, height=height)
9with rasterio.open('out.tif', 'w', **profile) as dst:
10    for i in range(1, src.count + 1):
11        reproject(
12            source=rasterio.band(src, i),
13            destination=rasterio.band(dst, i),
14            src_crs=src.crs, dst_crs=dst_crs,
15            resampling=Resampling.bilinear
16        )
17

Band math with NumPy

NDVI example:

Python
1with rasterio.open('red.tif') as r, rasterio.open('nir.tif') as n:
2    red = r.read(1).astype('float32')
3    nir = n.read(1).astype('float32')
4    profile = r.profile
5ndvi = (nir - red) / (nir + red + 1e-6)
6ndvi = np.clip(ndvi, -1, 1)

Masking and clipping

Python
1from rasterio.mask import mask
2import geopandas as gpd
3aoi = gpd.read_file('boundary.gpkg').to_crs('EPSG:32610')
4with rasterio.open('image.tif') as src:
5out_image, out_transform = mask(src, aoi.geometry, crop=True)
6out_meta = src.meta.copy()
7out_meta.update({
8'height': out_image.shape[1],
9'width': out_image.shape[2],
10'transform': out_transform
11})

Zonal statistics

Use rasterstats:

Python
from rasterstats import zonal_stats

Xarray — multi-dimensional rasters

For time-series or stacked bands, Xarray is preferred:

Python
1import xarray as xr
2import rioxarray
3da = xr.open_dataarray('scene.tif')          # lazy-loaded
4print(da.dims, da.coords)
5Slicing
6subset = da.sel(x=slice(400000, 500000),
7y=slice(4300000, 4200000))
8Arithmetic

rioxarray bridges rasterio into xarray, giving CRS-aware indexing.

Zarr — cloud-native chunked arrays

For petabyte-scale data (climate reanalysis, MODIS cubes), Zarr stores chunks separately and allows parallel IO:

Python
1import xarray as xr
2ds = xr.open_zarr('s3://bucket/cube.zarr',
3                  storage_options={'anon': True})

Combined with Dask, this scales analysis across clusters.

Dask for large rasters

Python
1import xarray as xr
2da = xr.open_dataarray('huge.tif', chunks={'x': 512, 'y': 512})
3Operations are lazy; compute when you call .compute() or .to_netcdf()

Dask handles larger-than-memory workloads by processing chunks in parallel.

Converting between CRSs with rioxarray

Python
import rioxarray

Memory considerations

  • Always use with rasterio.open(...) to close files.
  • Read in windows for huge data.
  • Cast to appropriate dtypes (float32 vs float64 halves memory).
  • Chunk with Dask / Xarray when data exceeds RAM.

Self-check exercises

1. When would you use Xarray + rioxarray instead of plain rasterio?

When you have multi-band, multi-time, or multi-dimensional data — satellite time series, climate reanalysis, stacks of images. Xarray gives you labelled dimensions (time, band, x, y) and operations that respect them. Plain rasterio is simpler for single-file 2D analysis; Xarray shines when you're manipulating datacubes.

2. Why cast to float32 before computing NDVI?

Sentinel-2 raw data is 16-bit unsigned integer. Integer arithmetic truncates: (nir - red) / (nir + red) in integer math produces 0 for most pixels because the magnitude comparisons lose precision. Casting to float32 gives real fractional values in [-1, 1]. float64 is overkill and doubles memory; float32 is the right compromise.

3. You want to process a 50 GB Sentinel-2 tile on a 16 GB laptop. Strategy?

Use Dask with Xarray: da = xr.open_dataarray(..., chunks={'x': 1024, 'y': 1024}). Operations become lazy; actual computation processes tiles one at a time, keeping RAM use low. Alternatively, open with rasterio windowed reads in a loop. For long-running jobs, process with cloud compute (Earth Engine, Planetary Computer) instead of the laptop.

Summary

  • Rasterio + NumPy is the core raster toolkit.
  • Windowed reads and reprojection are simple and fast.
  • Xarray and Dask handle multi-dimensional and cloud-scale data.
  • Zonal stats via rasterstats is the standard vector-raster bridge.

Further reading

  • Rasterio documentation (rasterio.readthedocs.io).
  • Xarray documentation (xarray.dev).
  • Dask documentation (dask.org).
  • Pangeo community — open science data cubes with Python.