The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Benchmarking bigKNN Against Other Exact Implementations

This article summarizes the benchmark recorded in the bench/ directory for bigKNN 0.3.0. The goal is not to claim a universal winner. The goal is to show:

The benchmark driver is bench/exact_knn_benchmark.R, and the full recorded outputs are:

Comparator Set

The dense exact comparator set in the recorded run was:

Rfast::dista(..., index = TRUE)
FNN::get.knnx(..., algorithm = "brute")
dbscan::kNN(..., approx = 0)
RANN::nn2(..., eps = 0)
nabor::knn(..., eps = 0)
BiocNeighbors::queryKNN(..., BNPARAM = KmknnParam(distance = "Euclidean"))
a base-R brute-force Euclidean implementation on the smallest case for full index-and-distance validation

bigKNN itself was measured in three dense modes:

knn_bigmatrix()
knn_search_prepared()
knn_search_stream_prepared()

For larger scale runs, the benchmark used file-backed big.matrix references and measured:

knn_prepare_bigmatrix()
knn_search_prepared()
knn_search_stream_prepared()

Recorded Environment

The recorded benchmark was generated on:

R 4.5.2
bigKNN 0.3.0
bigmemory 4.6.4
Rfast 2.1.5.2
FNN 1.1.4.1
dbscan 1.2.4
RANN 2.6.2
nabor 0.5.0
BiocNeighbors 2.2.0

Problem Sizes

The benchmark covers three dense comparator cases and two larger file-backed scale cases.

Case	`n_ref`	`n_query`	`p`	`k`	Reference size	Full dense pairwise matrix
`dense_small`	`5,000`	`250`	`16`	`10`	`0.61 MB`	`0.01 GB`
`dense_medium`	`20,000`	`500`	`16`	`10`	`2.44 MB`	`0.07 GB`
`dense_large`	`50,000`	`1,000`	`16`	`10`	`6.10 MB`	`0.37 GB`
`filebacked_xlarge`	`100,000`	`1,000`	`32`	`10`	`24.41 MB`	`0.75 GB`
`filebacked_2xlarge`	`200,000`	`1,000`	`32`	`10`	`48.83 MB`	`1.49 GB`

The full_pairwise_gb column is intentionally included because it highlights what bigKNN does not materialize: a full dense query-by-reference distance matrix.

Correctness

All dense exact comparators matched bigKNN on neighbour indices for the recorded cases. All distance-capable dense comparators matched on distances as well.

That matters because the benchmark is intended to compare exact search implementations, not approximate recall. The validation table in bench/exact_knn_benchmark_validation.csv is therefore as important as the timing table.

Dense Benchmark Snapshot

The table below compares the recorded bigKNN_prepared median with the fastest external exact comparator in each dense case.

Case	`bigKNN_prepared` median	Fastest external exact comparator
`dense_small`	`0.110 s`	`0.042 s` (`FNN_brute` and `dbscan_kdtree` tie)
`dense_medium`	`0.822 s`	`0.328 s` (`RANN_kd`, `Rfast_index_only`, and `nabor_kd` tie)
`dense_large`	`4.642 s`	`1.289 s` (`nabor_kd`)

The broader dense median ranking in the recorded run was:

at smaller and medium dense sizes, the in-memory comparator packages were noticeably faster than bigKNN
at the largest dense case in this benchmark, the external exact comparator set still clustered around 1.29 s, while bigKNN_prepared remained around 4.64 s

This is a sensible result. The dense comparator packages are optimized for ordinary in-memory matrix search, while bigKNN is designed around bigmemory::big.matrix, reusable prepared references, streaming, and larger workflows that do not assume everything should round-trip through ordinary R matrices.

File-Backed Scale Snapshot

The scale portion of the benchmark focuses on the feature set that is more specific to bigKNN: exact search on file-backed big.matrix references and streamed output into destination big.matrix objects.

Case	`bigKNN_prepare`	`bigKNN_prepared_search`	`bigKNN_stream`
`filebacked_xlarge` (`100,000 x 1,000 x 32`)	`0.039 s`	`9.714 s`	`9.290 s`
`filebacked_2xlarge` (`200,000 x 1,000 x 32`)	`0.078 s`	`18.594 s`	`18.533 s`

Two points stand out:

preparation overhead is small relative to the search itself
doubling the reference size produced close to linear growth in search time

That second point is important in practice. It means the file-backed exact path behaves predictably as the reference grows, which is exactly the kind of workflow bigKNN is meant to support.

How To Read These Results

These benchmark numbers should be interpreted in context:

if your workload is a dense in-memory Euclidean query and you only need top-k neighbours back in ordinary R matrices, the specialist dense comparator packages can be faster
if your workload is built around bigmemory::big.matrix, reusable prepared references, streamed output, resumable jobs, or larger file-backed data, bigKNN provides functionality that goes beyond the dense in-memory comparator set

In other words, the benchmark is useful both for performance comparison and for positioning the package.

Reproducing The Benchmark

From the package root, rerun the benchmark with:

env LC_ALL=C LANG=C Rscript bench/exact_knn_benchmark.R

The script prefers benchmarking the source tree through pkgload::load_all() when available, and otherwise falls back to the installed bigKNN package.

Caveats

Benchmark results are machine-specific. The recorded values in this article are best treated as a reproducible snapshot rather than a universal ranking.

The benchmark is also deliberately split into two stories:

a dense exact comparison against several external implementations
a file-backed scale story that highlights the big.matrix-oriented workflow of bigKNN

That split keeps the benchmark honest about what is truly comparable and what is specific to the package’s design goals.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.