| Title: | Cluster Origin-Destination Flow Data |
|---|---|
| Description: | Provides functionality for clustering origin-destination (OD) pairs, representing desire lines (or flows). This includes creating distance matrices between OD pairs and passing distance matrices to a clustering algorithm. See the academic paper Tao and Thill (2016) <doi:10.1111/gean.12100> for more details on spatial clustering of flows. See the paper on delineating demand-responsive operating areas by Mahfouz et al. (2025) <doi:10.1016/j.urbmob.2025.100135> for an example of how this package can be used to cluster flows for applied transportation research. |
| Authors: | Hussein Mahfouz [aut, cre] (ORCID: <https://orcid.org/0000-0003-1706-7802>), Robin Lovelace [aut] (ORCID: <https://orcid.org/0000-0001-5679-6536>) |
| Maintainer: | Hussein Mahfouz <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.1.9000 |
| Built: | 2026-05-19 08:28:01 UTC |
| Source: | https://github.com/hussein-mahfouz/flowcluster |
Also checks that 'origin' and 'destination' columns are present.
add_flow_length(x)add_flow_length(x)
x |
sf object of flows (LINESTRING, projected CRS) |
sf object with an additional length_m column (od length in meters)
flows <- sf::st_transform(flows_leeds, 3857) flows <- add_flow_length(flows)flows <- sf::st_transform(flows_leeds, 3857) flows <- add_flow_length(flows)
Add Start/End Coordinates & Flow IDs
add_xyuv(x)add_xyuv(x)
x |
sf object of flows |
tibble with x, y, u, v, flow_ID columns
flows <- sf::st_transform(flows_leeds, 3857) flows <- add_flow_length(flows) flows <- add_xyuv(flows)flows <- sf::st_transform(flows_leeds, 3857) flows <- add_flow_length(flows) flows <- add_xyuv(flows)
This function aggregates flows within clusters and creates a single
representative line for each cluster. The start and end coordinates are
computed as weighted averages (weighted by flow counts or another variable),
or simple means if no weights are provided. Each cluster is represented
by one LINESTRING.
aggregate_clustered_flows(flows, weight = NULL, crs = sf::st_crs(flows))aggregate_clustered_flows(flows, weight = NULL, crs = sf::st_crs(flows))
flows |
An |
weight |
(optional) Name of a column in |
crs |
Coordinate reference system for the output (default: taken from
|
An sf object with one line per cluster, containing:
count_total: total weight (if provided), otherwise number of flows
size: the cluster size (from the input, not recomputed)
geometry: a LINESTRING representing the aggregated OD flow
# ----- 1. Basic Usage: A quick, runnable example --- # This demonstrates the function with minimal, fast data preparation. flows <- flowcluster::flows_leeds # Create the required input columns in a single, fast pipeline flows_clustered <- flows |> add_xyuv() |> # Manually create 3 dummy clusters for demonstration dplyr::mutate(cluster = sample(1:3, size = nrow(flows), replace = TRUE)) |> # The function requires a 'size' column, so we add it dplyr::group_by(cluster) |> dplyr::add_tally(name = "size") |> dplyr::ungroup() # Demonstrate the function flows_agg_w <- aggregate_clustered_flows(flows_clustered, weight = "count") print(flows_agg_w) # ----- 2. Detailed Workflow (not run by default) --- ## Not run: # This example shows the ideal end-to-end workflow, from raw data # to clustering and finally aggregation. It is not run during checks # because the clustering steps are too slow. # a) Prepare the data by filtering and adding coordinates flows_prep <- flowcluster::flows_leeds |> sf::st_transform(3857) |> add_flow_length() |> filter_by_length(length_min = 5000, length_max = 12000) |> add_xyuv() # b) Calculate distances and cluster the flows distances <- flow_distance(flows_prep, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances) wvec <- weight_vector(dmat, flows_prep, weight_col = "count") flows_clustered_real <- cluster_flows_dbscan(dmat, wvec, flows_prep, eps = 8, minPts = 70) # c) Filter clusters and add a 'size' column flows_clustered_real <- flows_clustered_real |> dplyr::filter(cluster != 0) |> # Filter out noise points dplyr::group_by(cluster) |> dplyr::mutate(size = dplyr::n()) |> dplyr::ungroup() # d) Now, use the function on the clustered data flows_agg_real <- aggregate_clustered_flows(flows_clustered_real, weight = "count") print(flows_agg_real) # e) Visualize the results if (requireNamespace("tmap", quietly = TRUE)) { library(tmap) # This plot uses modern tmap v4 syntax. tm_shape(flows_clustered_real, facet = "cluster") + tm_lines(col = "grey50", alpha = 0.5) + tm_shape(flows_agg_real) + tm_lines(col = "red", lwd = 2) + tm_layout(title = "Original Flows (Grey) and Aggregated Flows (Red)") } ## End(Not run)# ----- 1. Basic Usage: A quick, runnable example --- # This demonstrates the function with minimal, fast data preparation. flows <- flowcluster::flows_leeds # Create the required input columns in a single, fast pipeline flows_clustered <- flows |> add_xyuv() |> # Manually create 3 dummy clusters for demonstration dplyr::mutate(cluster = sample(1:3, size = nrow(flows), replace = TRUE)) |> # The function requires a 'size' column, so we add it dplyr::group_by(cluster) |> dplyr::add_tally(name = "size") |> dplyr::ungroup() # Demonstrate the function flows_agg_w <- aggregate_clustered_flows(flows_clustered, weight = "count") print(flows_agg_w) # ----- 2. Detailed Workflow (not run by default) --- ## Not run: # This example shows the ideal end-to-end workflow, from raw data # to clustering and finally aggregation. It is not run during checks # because the clustering steps are too slow. # a) Prepare the data by filtering and adding coordinates flows_prep <- flowcluster::flows_leeds |> sf::st_transform(3857) |> add_flow_length() |> filter_by_length(length_min = 5000, length_max = 12000) |> add_xyuv() # b) Calculate distances and cluster the flows distances <- flow_distance(flows_prep, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances) wvec <- weight_vector(dmat, flows_prep, weight_col = "count") flows_clustered_real <- cluster_flows_dbscan(dmat, wvec, flows_prep, eps = 8, minPts = 70) # c) Filter clusters and add a 'size' column flows_clustered_real <- flows_clustered_real |> dplyr::filter(cluster != 0) |> # Filter out noise points dplyr::group_by(cluster) |> dplyr::mutate(size = dplyr::n()) |> dplyr::ungroup() # d) Now, use the function on the clustered data flows_agg_real <- aggregate_clustered_flows(flows_clustered_real, weight = "count") print(flows_agg_real) # e) Visualize the results if (requireNamespace("tmap", quietly = TRUE)) { library(tmap) # This plot uses modern tmap v4 syntax. tm_shape(flows_clustered_real, facet = "cluster") + tm_lines(col = "grey50", alpha = 0.5) + tm_shape(flows_agg_real) + tm_lines(col = "red", lwd = 2) + tm_layout(title = "Original Flows (Grey) and Aggregated Flows (Red)") } ## End(Not run)
See dbscan for details on the DBSCAN algorithm.
cluster_flows_dbscan(dist_mat, w_vec, x, eps, minPts)cluster_flows_dbscan(dist_mat, w_vec, x, eps, minPts)
dist_mat |
distance matrix |
w_vec |
weight vector |
x |
flows tibble with flow_ID |
eps |
DBSCAN epsilon parameter |
minPts |
DBSCAN minPts parameter |
flows tibble with an additional cluster column
flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 100) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) # filter by length flows <- filter_by_length(flows, length_min = 5000, length_max = 12000) flows <- add_xyuv(flows) # Calculate distances distances <- flow_distance(flows, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances) wvec <- weight_vector(dmat, flows, weight_col = "count") clustered <- cluster_flows_dbscan(dmat, wvec, flows, eps = 8, minPts = 70)flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 100) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) # filter by length flows <- filter_by_length(flows, length_min = 5000, length_max = 12000) flows <- add_xyuv(flows) # Calculate distances distances <- flow_distance(flows, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances) wvec <- weight_vector(dmat, flows, weight_col = "count") clustered <- cluster_flows_dbscan(dmat, wvec, flows, eps = 8, minPts = 70)
The function allows you to test different combinations of epsilon and minPts parameters for clustering flows using DBSCAN. It can be used to determine what parameter values make sense for your data
dbscan_sensitivity( dist_mat, flows, options_epsilon, options_minpts, w_vec = NULL )dbscan_sensitivity( dist_mat, flows, options_epsilon, options_minpts, w_vec = NULL )
dist_mat |
a precalculated distance matrix between desire lines (output of distance_matrix()) |
flows |
the original flows tibble (must contain flow_ID and 'count' column) |
options_epsilon |
a vector of options for the epsilon parameter |
options_minpts |
a vector of options for the minPts parameter |
w_vec |
Optional precomputed weight vector (otherwise computed internally from 'count' column) |
a tibble with columns: id (to identify eps and minpts), cluster, size (number of desire lines in cluster), count_sum (total count per cluster)
flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 1000) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) # filter by length flows <- filter_by_length(flows, length_min = 5000, length_max = 12000) # Add x, y, u, v coordinates to flows flows <- add_xyuv(flows) # Calculate distance matrix distances <- flow_distance(flows, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances) # Generate weight vector w_vec <- weight_vector(dmat, flows, weight_col = "count") # Define the parameters for sensitivity analysis options_epsilon <- seq(1, 10, by = 2) options_minpts <- seq(10, 100, by = 10) # # Run the sensitivity analysis results <- dbscan_sensitivity( dist_mat = dmat, flows = flows, options_epsilon = options_epsilon, options_minpts = options_minpts, w_vec = w_vec )flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 1000) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) # filter by length flows <- filter_by_length(flows, length_min = 5000, length_max = 12000) # Add x, y, u, v coordinates to flows flows <- add_xyuv(flows) # Calculate distance matrix distances <- flow_distance(flows, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances) # Generate weight vector w_vec <- weight_vector(dmat, flows, weight_col = "count") # Define the parameters for sensitivity analysis options_epsilon <- seq(1, 10, by = 2) options_minpts <- seq(10, 100, by = 10) # # Run the sensitivity analysis results <- dbscan_sensitivity( dist_mat = dmat, flows = flows, options_epsilon = options_epsilon, options_minpts = options_minpts, w_vec = w_vec )
Convert Long-Format Distance Tibble to Matrix
distance_matrix(distances, distance_col = "fds")distance_matrix(distances, distance_col = "fds")
distances |
tibble with columns flow_ID_a, flow_ID_b, and distance |
distance_col |
column name for distance (default "fds") |
distance matrix (tibble with rownames). The matrix has flow_ID_a as rownames and flow_ID_b as column names.
This function converts the output of flow_distance() into a format suitable for the dbscan clustering algorithm.
flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 100) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) flows <- add_xyuv(flows) # Calculate distances distances <- flow_distance(flows, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances)flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 100) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) flows <- add_xyuv(flows) # Calculate distances distances <- flow_distance(flows, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances)
Filter Flows by Length
filter_by_length(x, length_min = 0, length_max = Inf)filter_by_length(x, length_min = 0, length_max = Inf)
x |
sf object with length_m |
length_min |
minimum length (default 0) |
length_max |
maximum length (default Inf) |
filtered sf object. Flows with length_m outside the specified range are removed.
flows <- sf::st_transform(flows_leeds, 3857) flows <- add_flow_length(flows) flows <- filter_by_length(flows, length_min = 5000, length_max = 12000)flows <- sf::st_transform(flows_leeds, 3857) flows <- add_flow_length(flows) flows <- filter_by_length(flows, length_min = 5000, length_max = 12000)
This function calculates flow distance and dissimilarity measures between all pairs of flows based on the method described in @tao2016spatial.
flow_distance(x, alpha = 1, beta = 1)flow_distance(x, alpha = 1, beta = 1)
x |
tibble with flow_ID, x, y, u, v, length_m |
alpha |
numeric, origin weight |
beta |
numeric, destination weight |
tibble of all OD pairs with fd, fds columns
Tao, R., Thill, J.-C., 2016. Spatial cluster detection in spatial flow data. Geographical Analysis 48, 355–372. https://doi.org/10.1111/gean.12100
flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 100) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) flows <- add_xyuv(flows) # Calculate distances distances <- flow_distance(flows, alpha = 1.5, beta = 0.5)flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 100) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) flows <- add_xyuv(flows) # Calculate distances distances <- flow_distance(flows, alpha = 1.5, beta = 0.5)
Example flow data for Leeds. It is from the 2021 census, and it contains all Origin - Destination flows at the MSOA level. For more info on census flow data, see the ONS documentation See data-raw/flows_leeds.R for how this data was created.
flows_leedsflows_leeds
An object of class sf with LINESTRING geometry. It has the following columns:
MSOA code of origin zone
MSOA code of destination zone
number of people moving from origin to destination
desire line between origin and destination
https://www.nomisweb.co.uk/sources/census_2021_od
Generate Weight Vector from Flows
weight_vector(dist_mat, x, weight_col = "count")weight_vector(dist_mat, x, weight_col = "count")
dist_mat |
distance matrix |
x |
flows tibble with flow_ID and weight_col |
weight_col |
column to use as weights (default = "count") |
numeric weight vector. Each element corresponds to a flow in the distance matrix, and is used as a weight in the DBSCAN clustering algorithm.
flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 100) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) flows <- add_xyuv(flows) # Calculate distances distances <- flow_distance(flows, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances) wvec <- weight_vector(dmat, flows, weight_col = "count")flows <- sf::st_transform(flows_leeds, 3857) flows <- head(flows, 100) # for testing # Add flow lengths and coordinates flows <- add_flow_length(flows) flows <- add_xyuv(flows) # Calculate distances distances <- flow_distance(flows, alpha = 1.5, beta = 0.5) dmat <- distance_matrix(distances) wvec <- weight_vector(dmat, flows, weight_col = "count")