Out-of-core GIS with vectra

Gilles Colling

2026-06-28

The streaming envelope

Every desktop GIS runs its toolbox one in-memory layer at a time. Open a layer, run the tool, write the result. The working layer has to fit in RAM, and a country-scale point set or a national DEM often does not.

vectra runs the same toolbox one batch at a time. A query is pulled through the C engine in fixed-size row groups, each spatial step works on the batch in front of it, and the transformed batch spills back to disk as a fresh lazy node. Peak memory is one batch and whatever small resident layer the step needs, so a billion-row layer flows past a fixed memory budget.

Topology stays with sf and terra: vectra adds no GEOS or GDAL link. Geometry rides through the engine as hex-encoded WKB in an ordinary string column, and the coordinate reference system is carried on the node. What vectra contributes is the streaming envelope around the operations a desktop GIS keeps resident.

library(vectra)

From a raster to one row per cell

A raster is a dense grid of values. To feed it into a tabular pipeline, read it as one row per cell, each cell a record of its coordinate and value. That is the bridge between the raster cube and every verb in the engine.

vec_open_raster() opens a .vec raster cube, and vec_extract_points() samples it at coordinates. A whole national climate grid stays on disk while the occurrence points it is queried at stay in memory.

clim <- vec_open_raster("worldclim_bio.vec")

occ <- read.csv("occurrences.csv")          # species, x, y
occ <- cbind(occ, vec_extract_points(clim, occ$x, occ$y))
head(occ)

For a GeoTIFF the same sampling is tiff_extract_points(). From here the points are an ordinary vectra table, ready for any verb.

Select by location

The most-used vector tool keeps the features that stand in a spatial relation to another layer: points inside a study region, parcels touching a road, patches within a buffer. spatial_filter() streams the large layer and tests each batch against a small resident locator with an sf predicate, keeping or dropping whole features.

region <- sf::st_read("study_area.gpkg", quiet = TRUE)

inside <- tbl("occurrences.vtr") |>
  spatial_filter(region, coords = c("x", "y"), crs = 4326)

The result carries the same schema as the input, with the outside features removed. negate = TRUE keeps the outside features instead. To cut geometry along the region rather than keep whole features, spatial_clip() intersects each batch with the mask, and erase = TRUE keeps the part outside it.

clipped <- tbl("rivers.vtr") |> spatial_clip(region, crs = 4326)

To tag each feature with the polygon it falls in rather than filter, spatial_join() appends the polygon attributes. When both layers are larger than RAM, partition = grid(cellsize) bins them to a grid and joins one shard at a time, with the right layer itself a streamed node.

tagged <- tbl("gbif_points.vtr") |>
  spatial_join(tbl("countries.vtr"), coords = c("x", "y"),
               crs = 4326, partition = grid(1))

Rasterize a streamed point set

Folding a point set into a grid is the headline larger-than-RAM operation: the grid fits in memory, the points do not. rasterize() maps each point to its cell through the geotransform and accumulates a per-cell value in C, one batch at a time, no spill. This is the monoid-fold tier: bounded memory, a single pass.

# Point density on a continental grid, streamed from a billion-row file.
density <- tbl("gbif_points.vtr") |>
  rasterize(extent = c(-180, -90, 180, 90), res = 0.1,
            fun = "count", path = "density.vec")

fun is the per-cell reduction: "count", "sum", "mean", "min", "max". With a field it aggregates that column rather than tallies points. The output is a .vec cube or an in-memory matrix.

Going the other way, zonal() summarises a raster within zones, streaming the grid one tile-row strip at a time and folding per-zone moments in memory.

admin <- sf::st_read("regions.gpkg", quiet = TRUE)
zonal(clim, admin, fun = c("mean", "sd"), zone_field = "region_id")

Terrain on a streamed DEM

Moving-window operations read each output tile expanded by the kernel radius (a halo read), so a national DEM never has to be resident. focal() applies an arbitrary weight window, and terrain() derives slope, aspect, hillshade, TPI, roughness, and TRI with Horn’s method, matching terra::terrain(). When path is given the result streams straight back to a new .vec, so neither the input nor the output band is ever fully in memory.

terrain("dem.vec", v = c("slope", "aspect", "hillshade"),
        path = "dem_derivatives.vec")

To resample or reproject, warp() walks the output one tile-row strip at a time, projects each strip’s pixel centres into the source CRS (via PROJ when the two CRSs differ), and samples the bounded source window with nearest, bilinear, or cubic interpolation, matching terra::project().

warp("dem.vec", list(crs = 3035, res = 25),
     method = "bilinear", path = "dem_laea.vec")

Back from raster to vector

rasterize() carries points onto a grid; polygonize() carries a grid back to vector. It reads one tile-row strip at a time and dissolves equal-valued cells into one polygon per value, so a classified land-cover raster becomes a polygon layer out of core.

habitat <- polygonize("landcover.vec")       # one polygon per class

contours() traces iso-lines from a continuous surface with marching squares over the same haloed strip pass, then joins each level’s segments into continuous lines.

isolines <- contours("dem.vec", levels = seq(0, 3000, by = 200))

Masking and raster math

mask() clips a raster to a polygon layer one strip at a time, keeping the pixels whose centre falls inside it, the raster counterpart of spatial_clip().

study <- mask("dem.vec", region, path = "dem_study.vec")

rast_calc() evaluates a cellwise expression across aligned rasters, reading each input strip by strip. A vegetation index, a reclassification, or arithmetic across layers is one expression over the named rasters.

ndvi <- rast_calc(list(nir = "nir.vec", red = "red.vec"),
                  (nir - red) / (nir + red), path = "ndvi.vec")

mosaic() merges rasters that share a resolution and cell grid onto their union, resolving overlap with mean, first, last, min, max, or sum, one output strip at a time.

tile_merge <- mosaic(list("n50.vec", "n51.vec", "n52.vec"), fun = "mean")

Distance to the nearest feature

proximity() gives every cell its Euclidean distance to the nearest feature cell, in CRS units. Features are the non-NA cells, or the cells whose value is in target. The transform is separable: a pass along the rows, a pass along the columns, each linear in the line length, with an out-of-core transpose between them so the whole grid is never resident. Squared distances scale by the x and y resolution, so the answer is exact on non-square cells.

dist_to_road <- proximity("roads.vec", path = "road_distance.vec")
sea_distance <- proximity("landcover.vec", target = 0)   # 0 = water

Cost-distance, which accumulates a per-cell friction along the path, is a global shortest-path problem and stays resident: collect() the raster and run a resident solver for it.

The cost model

Each operation states its memory tier, so the toolbox reads as a cost model rather than a grab-bag. The tier tells you what a run holds resident.

That boundary is the honest part. The streaming envelope covers the operations whose memory can be bounded, and names the ones it cannot.

Where to go next