The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
bigKNN provides exact nearest-neighbour and
radius-search routines that work directly on
bigmemory::big.matrix references. This quickstart walks
through a small end-to-end example: create a reference matrix, run exact
k-nearest neighbour and radius queries, and interpret the
objects that come back.
The reference data for bigKNN lives in a
bigmemory::big.matrix. For this quickstart we will use six
points in two dimensions and keep separate labels so that the returned
row indices are easy to read.
reference_points <- data.frame(
id = paste0("p", 1:6),
x1 = c(1, 2, 1, 2, 3, 4),
x2 = c(1, 1, 2, 2, 2, 3)
)
query_points <- data.frame(
id = c("q1", "q2"),
x1 = c(1.2, 2.8),
x2 = c(1.1, 2.2)
)
reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")]))
query_matrix <- as.matrix(query_points[c("x1", "x2")])
reference_points
#> id x1 x2
#> 1 p1 1 1
#> 2 p2 2 1
#> 3 p3 1 2
#> 4 p4 2 2
#> 5 p5 3 2
#> 6 p6 4 3
query_points
#> id x1 x2
#> 1 q1 1.2 1.1
#> 2 q2 2.8 2.2In the examples below, reference is the
big.matrix searched by bigKNN, while
query_matrix is an ordinary dense R matrix. The same APIs
also accept queries stored in another big.matrix, and the
v3 search paths accept sparse query matrices as well.
k-Nearest-Neighbour SearchWith query = NULL, knn_bigmatrix() performs
a self-search. By default exclude_self = TRUE in that case,
so each row does not return itself as its own nearest neighbour.
self_knn <- knn_bigmatrix(reference, k = 2)
self_knn
#> <bigknn_knn_result>
#> metric: euclidean
#> k: 2
#> queries: 6
#> references: 6
#> backend: bruteforceThe raw result stores two n_query x k matrices:
index: 1-based row indices into the reference
matrixdistance: the corresponding exact distancesself_knn$index
#> [,1] [,2]
#> [1,] 2 3
#> [2,] 1 4
#> [3,] 1 4
#> [4,] 2 3
#> [5,] 4 2
#> [6,] 5 4
round(self_knn$distance, 3)
#> [,1] [,2]
#> [1,] 1.000 1.000
#> [2,] 1.000 1.000
#> [3,] 1.000 1.000
#> [4,] 1.000 1.000
#> [5,] 1.000 1.414
#> [6,] 1.414 2.236A small helper table makes the same result easier to read:
knn_table(self_knn, query_ids = reference_points$id, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 p1 1 p2 1.000
#> 2 p1 2 p3 1.000
#> 3 p2 1 p1 1.000
#> 4 p2 2 p4 1.000
#> 5 p3 1 p1 1.000
#> 6 p3 2 p4 1.000
#> 7 p4 1 p2 1.000
#> 8 p4 2 p3 1.000
#> 9 p5 1 p4 1.000
#> 10 p5 2 p2 1.414
#> 11 p6 1 p5 1.414
#> 12 p6 2 p4 2.236To search new observations against the same reference, pass them
through the query argument. Here we ask for the three
nearest reference rows for two new points.
query_knn <- knn_bigmatrix(
reference,
query = query_matrix,
k = 3,
exclude_self = FALSE
)
query_knn
#> <bigknn_knn_result>
#> metric: euclidean
#> k: 3
#> queries: 2
#> references: 6
#> backend: bruteforce
knn_table(query_knn, query_ids = query_points$id, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 q1 1 p1 0.2236
#> 2 q1 2 p2 0.8062
#> 3 q1 3 p3 0.9220
#> 4 q2 1 p5 0.2828
#> 5 q2 2 p4 0.8246
#> 6 q2 3 p6 1.4420The returned indices are still row numbers in reference.
That makes the object compact and easy to use in later workflows, while
a lookup table like reference_points can translate those
indices into human-readable labels when needed.
Radius search returns every neighbour within a fixed distance
threshold instead of a fixed k. This is useful when the
local neighbourhood size should vary by query.
radius_result <- radius_bigmatrix(
reference,
query = query_matrix,
radius = 1.15,
exclude_self = FALSE
)
radius_result
#> <bigknn_radius_result>
#> metric: euclidean
#> radius: 1.15
#> queries: 2
#> references: 6
#> matches: 5
radius_result$n_match
#> [1] 3 2
radius_result$offset
#> [1] 1 4 6radius_bigmatrix() uses a flattened output format:
index and distance hold all matches
back-to-backn_match gives the number of matches per queryoffset tells you where each query’s slice starts and
endsFor query i, the matching rows live in
index[offset[i]:(offset[i + 1] - 1)], with the same slice
in distance.
radius_slice(radius_result, 1, reference_points$id)
#> neighbor distance
#> 1 p1 0.2236
#> 2 p2 0.8062
#> 3 p3 0.9220
radius_slice(radius_result, 2, reference_points$id)
#> neighbor distance
#> 1 p5 0.2828
#> 2 p4 0.8246If you only need the counts,
count_within_radius_bigmatrix() avoids returning the
flattened neighbour vectors.
bigKNN currently supports three exact metrics:
"euclidean" for ordinary Euclidean distance"sqeuclidean" for squared Euclidean distance"cosine" for cosine distance, defined as
1 - cosine similarityThe squared Euclidean metric preserves the same neighbour ordering as ordinary Euclidean distance, but reports squared values. Cosine distance can choose a different neighbour because it prefers similar direction rather than similar absolute location.
metric_summary <- do.call(rbind, lapply(
c("euclidean", "sqeuclidean", "cosine"),
function(metric) {
result <- knn_bigmatrix(
reference,
query = query_matrix,
k = 1,
metric = metric,
exclude_self = FALSE
)
data.frame(
metric = metric,
query = query_points$id,
nearest = reference_points$id[result$index[, 1]],
distance = signif(result$distance[, 1], 4),
row.names = NULL
)
}
))
metric_summary
#> metric query nearest distance
#> 1 euclidean q1 p1 0.2236000
#> 2 euclidean q2 p5 0.2828000
#> 3 sqeuclidean q1 p1 0.0500000
#> 4 sqeuclidean q2 p5 0.0800000
#> 5 cosine q1 p1 0.0009438
#> 6 cosine q2 p6 0.0002524In this example, Euclidean and squared Euclidean agree on the nearest row for each query, while cosine distance can favour a different point because its direction is more similar. Cosine distance requires non-zero rows in both the reference and the query data.
This vignette focused on the smallest useful workflow. For larger or repeated jobs, the next places to look are:
knn_prepare_bigmatrix() and
knn_search_prepared() for repeated exact queries against
the same referenceknn_plan_bigmatrix(),
knn_stream_bigmatrix(), and
radius_stream_bigmatrix() for memory-aware and streamed
workflowsknn_graph_bigmatrix(),
mutual_knn_graph_bigmatrix(),
snn_graph_bigmatrix(), and
radius_graph_bigmatrix() for exact graph constructionrecall_against_exact() and
rerank_candidates_bigmatrix() when bigKNN is
being used as the exact ground-truth engine for approximate searchThese binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.