The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

vectra

CRAN status R-CMD-check ASAN/UBSAN Codecov test coverage License: MIT

vectra is an R-native columnar query engine for datasets larger than RAM.

Write dplyr-style pipelines against multi-GB files on a laptop. Data streams through a C11 pull-based engine one row group at a time, so peak memory stays bounded regardless of file size.

Quick Start

Point vectra at any file and query it with dplyr verbs. Nothing runs until collect().

library(vectra)

# CSV — lazy scan with type inference
tbl_csv("measurements.csv") |>
  filter(temperature > 30, year >= 2020) |>
  group_by(station) |>
  summarise(avg_temp = mean(temperature), n = n()) |>
  collect()

# GeoTIFF — climate rasters as tidy data
tbl_tiff("worldclim_bio1.tif") |>
  filter(band1 > 0) |>
  mutate(temp_c = band1 / 10) |>
  collect()

# Point extraction — sample raster values at coordinates, no terra needed
tiff_extract_points("worldclim_bio1.tif",
                    x = c(10.5, 11.2), y = c(47.1, 47.3))

# SQLite — zero-dependency, no DBI required
tbl_sqlite("survey.db", "responses") |>
  filter(year == 2025) |>
  left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
  collect()

For repeated queries, convert to vectra’s native .vtr format for faster reads:

write_vtr(big_df, "data.vtr", batch_size = 100000)

tbl("data.vtr") |>
  filter(x > 0, region == "EU") |>
  group_by(region) |>
  summarise(total = sum(value), n = n()) |>
  collect()

Append new data without rewriting the file, or do a key-based diff between two snapshots:

# Append new rows as a new row group — existing data untouched
append_vtr(new_rows_df, "data.vtr")

# Logical diff: what was added or deleted between two snapshots?
d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added)   # rows present in new but not old
d$deleted          # key values present in old but not new

Fuzzy string matching runs inside the C engine, no round-trip to R:

tbl("taxa.vtr") |>
  filter(levenshtein(species, "Quercus robur") <= 2) |>
  mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
  arrange(desc(similarity)) |>
  collect()

Register a star schema to avoid flat-table column creep. Define the links once, then pull only what you need:

s <- vtr_schema(
  fact    = tbl("observations.vtr"),
  species = link("sp_id", tbl("species.vtr")),
  site    = link("site_id", tbl("sites.vtr"))
)

# Pull columns from any dimension — joins are built automatically
lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)

Use explain() to inspect the optimized plan:

tbl("data.vtr") |>
  filter(x > 0) |>
  select(id, x) |>
  explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#>   FilterNode [streaming]
#>     ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]
#>
#> Output columns (2):
#>   id <int64>
#>   x <double>

Why vectra

Querying large datasets in R usually means Arrow (requires compiled binaries matching your platform), DuckDB (links a 30 MB bundled library), or Spark (requires a JVM and cluster configuration).

vectra is a self-contained C11 engine compiled as a standard R extension. No external libraries, no JVM, no runtime configuration. It provides:

Features

Category Verbs
Transform filter(), select(), mutate(), transmute(), rename(), relocate()
Aggregate group_by(), summarise() (n, sum, mean, min, max, sd, var, first, last, any, all, median, n_distinct), count(), tally(), distinct()
Join left_join(), inner_join(), right_join(), full_join(), semi_join(), anti_join(), cross_join(), lookup()
Order arrange(), slice_head(), slice_tail(), slice_min(), slice_max(), slice()
Window row_number(), rank(), dense_rank(), lag(), lead(), cumsum(), cummean(), cummin(), cummax(), ntile(), percent_rank(), cume_dist()
Date/Time year(), month(), day(), hour(), minute(), second(), as.Date() (in filter()/mutate())
String nchar(), substr(), grepl(), tolower(), toupper(), trimws(), paste0(), gsub(), sub(), startsWith(), endsWith() (in filter()/mutate())
String similarity levenshtein(), levenshtein_norm(), dl_dist(), dl_dist_norm(), jaro_winkler() — fuzzy matching in filter()/mutate(), with optional max_dist early termination
Expression abs(), sqrt(), log(), exp(), floor(), ceiling(), round(), log2(), log10(), sign(), trunc(), if_else(), between(), %in%, as.numeric(), pmin(), pmax(), resolve(), propagate() (in filter()/mutate())
Combine bind_rows(), bind_cols(), across()
Schema vtr_schema(), link(), lookup() — star schema definition and dimension lookup with match reporting
I/O tbl(), tbl_csv(), tbl_sqlite(), tbl_tiff(), write_vtr(), write_csv(), write_sqlite(), write_tiff(), tiff_extract_points(), tiff_metadata(), append_vtr(), delete_vtr(), diff_vtr()
Inspect explain(), glimpse(), print(), pull()

Full tidyselect support in select(), rename(), relocate(), and across(): starts_with(), ends_with(), contains(), matches(), where(), everything(), all_of(), any_of().

Installation

# CRAN
install.packages("vectra")

# Development version
pak::pak("gcol33/vectra")

Documentation

Support

“Software is like sex: it’s better when it’s free.” – Linus Torvalds

If this package saved you some time, buying me a coffee is a nice way to say thanks.

Buy Me A Coffee

License

MIT (see the LICENSE.md file)

Citation

@software{vectra,
  author = {Colling, Gilles},
  title = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
  year = {2026},
  url = {https://github.com/gcol33/vectra}
}

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.