PostGISClustering

ST_ClusterKMeans

What is ST_ClusterKMeans?

ST_ClusterKMeans is a PostGIS window function that partitions a set of geometries into k clusters using the K-Means algorithm. Each geometry is assigned a cluster id between 0 and k - 1 based on proximity to the iteratively-computed cluster centres.

SQL
ST_ClusterKMeans(geometry winset geom, integer number_of_clusters, float max_radius = NULL) OVER ()integer

An optional max_radius splits clusters further if any one exceeds that bounding radius.

When would you use ST_ClusterKMeans?

Use ST_ClusterKMeans when you need a fixed partition of your data into exactly k groups — defining sales territories, grouping stops onto k delivery routes, or creating balanced zones for field crews. It is deterministic given a fixed seed and produces compact, roughly equal-sized clusters.

SQL
1SELECT id,
2       ST_ClusterKMeans(geom, 5) OVER () AS cluster_id
3FROM customers;

FAQs

How do I choose k?

K-Means requires you to pre-select the number of clusters. Common approaches: the "elbow method" (plot sum-of-squared-distances vs k and pick the kink), silhouette analysis, or domain knowledge (e.g. six delivery trucks → k = 6). If you don't know k, use ST_ClusterDBSCAN instead.

When should I prefer ST_ClusterKMeans over ST_ClusterDBSCAN?

Use K-Means when you want every point clustered, a fixed number of clusters, and roughly spherical/equal-size partitions. Use DBSCAN when clusters may be of arbitrary shape, the count is unknown, or you want outliers flagged as noise.

What does max_radius do?

If any cluster ends up with a bounding radius greater than max_radius, that cluster is subdivided further. This is useful for keeping each cluster within a practical size (e.g. a delivery zone no larger than 10 km across). If omitted, the algorithm runs plain K-Means with exactly k clusters.

Is the output deterministic?

For a given input and parameters, PostGIS produces a stable result in a single query. Across different datasets or PostGIS versions, cluster ids may differ — treat them as opaque labels rather than stable identifiers across runs.