The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

bigANNOY

Approximate nearest-neighbour search for bigmemory matrices with Annoy

Frédéric Bertrand

The bigANNOY package provides approximate nearest-neighbour search specialised for bigmemory::big.matrix objects through persisted Annoy indexes. It keeps the reference data in bigmemory storage during build and query workflows, supports repeated-query sessions through explicit open/load helpers, and can stream neighbour indices and distances directly into destination big.matrix objects.

Current features include:

native C++ bigmemory-backed build and search paths, with an R backend kept as a debug-only fallback,
persisted Annoy indexes plus sidecar metadata for safe reopen and validation,
Euclidean, angular, Manhattan, and dot-product Annoy metrics,
self-search and external-query workflows on dense matrices, big.matrix objects, descriptors, descriptor paths, and external pointers,
streamed output into file-backed or in-memory big.matrix destinations,
explicit lifecycle helpers such as annoy_open_index(), annoy_load_bigmatrix(), annoy_is_loaded(), annoy_close_index(), and annoy_validate_index(), and
benchmark helpers that can compare approximate Euclidean search against the exact bigKNN baseline when bigKNN is available.

These workflows make bigANNOY useful both as a standalone approximate search package and as the ANN side of an exact-versus-approximate evaluation pipeline built around bigKNN.

Installation

The package is currently easiest to install from GitHub:

# install.packages("remotes")
remotes::install_github("fbertran/bigANNOY")

If you prefer a local source install, clone the repository and run:

R CMD build bigANNOY
R CMD INSTALL bigANNOY_0.3.0.tar.gz

Options

The package defines a small set of runtime options:

Option	Default value	Description
`bigANNOY.block_size`	`1024L`	Default number of rows processed per build/search block.
`bigANNOY.progress`	`FALSE`	Emit simple progress messages during long-running builds, searches, and benchmarks.
`bigANNOY.backend`	`"cpp"`	Backend request. `"cpp"` uses the native compiled backend, `"auto"` falls back when compiled symbols are not loaded, and `"r"` forces the debug-only R backend.

All options can be changed with options() at runtime. For example, options(bigANNOY.block_size = 2048L) increases the default block size used by the build and search helpers.

Examples

The examples below use a small Euclidean reference matrix so the returned neighbours are easy to inspect.

Build and query an Annoy index

library(bigmemory)
library(bigANNOY)

reference <- as.big.matrix(matrix(
  c(0, 0,
    1, 0,
    0, 1,
    1, 1,
    2, 2),
  ncol = 2,
  byrow = TRUE
))

query <- matrix(
  c(0.1, 0.1,
    1.8, 1.9),
  ncol = 2,
  byrow = TRUE
)

index <- annoy_build_bigmatrix(
  reference,
  path = tempfile(fileext = ".ann"),
  metric = "euclidean",
  n_trees = 20L,
  seed = 123L,
  load_mode = "eager"
)

result <- annoy_search_bigmatrix(
  index,
  query = query,
  k = 2L,
  search_k = 100L
)

result$index
round(result$distance, 3)

Reopen and validate a persisted index

reopened <- annoy_open_index(index$path, load_mode = "lazy")

annoy_is_loaded(reopened)

report <- annoy_validate_index(
  reopened,
  strict = TRUE,
  load = TRUE
)

report$valid
annoy_is_loaded(reopened)

Stream results into bigmemory outputs

index_store <- big.matrix(nrow(query), 2L, type = "integer")
distance_store <- big.matrix(nrow(query), 2L, type = "double")

annoy_search_bigmatrix(
  index,
  query = query,
  k = 2L,
  xpIndex = index_store,
  xpDistance = distance_store
)

bigmemory::as.matrix(index_store)
round(bigmemory::as.matrix(distance_store), 3)

Benchmark approximate Euclidean search

benchmark_annoy_bigmatrix(
  n_ref = 2000L,
  n_query = 200L,
  n_dim = 20L,
  k = 10L,
  n_trees = 50L,
  search_k = 1000L,
  metric = "euclidean",
  exact = TRUE
)

If bigKNN is installed, the Euclidean benchmark helpers also report exact search timing and recall against the exact baseline.

Installed Benchmark Runner

An installed command-line benchmark script is also available at:

system.file("benchmarks", "benchmark_annoy.R", package = "bigANNOY")

Example single-run command:

Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \
  --mode=single \
  --n_ref=5000 \
  --n_query=500 \
  --n_dim=50 \
  --k=20 \
  --n_trees=100 \
  --search_k=5000 \
  --load_mode=eager

Vignettes

The package now ships with focused vignettes for the main workflows:

getting-started-bigannoy
persistent-indexes-and-lifecycle
file-backed-bigmemory-workflows
benchmarking-recall-and-latency
metrics-and-tuning
validation-and-sharing-indexes
bigannoy-vs-bigknn

Together they cover the basic ANN workflow, loaded-index lifecycle, file-backed bigmemory usage, benchmarking and recall evaluation, tuning, validation and sharing of persisted indexes, and the relationship between approximate bigANNOY search and exact bigKNN search.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.