Annoy is a small, fast and lightweight library for Approximate Nearest Neighbours with a particular focus on efficient memory use and the ability to load a pre-saved index.
Annoy is written by Erik Bernhardsson for use at Spotify, and implemented in about 500 lines of a single C++ template header file — which is wrapped by Erik into a loadable Python module.
It provides a nice example for Rcpp Modules and use of templates:
Annoy uses two template data types (generally float
and
int32_t
for efficiency) and one of two distance measures.
This package shows that it is easy to wrap both.
It also shows how easy it is to have Python and R shared the exact same functionality by virtue of modules binding on the Python modules and R side (where Rcpp helps).
Source code resides in the RcppAnnoy GitHub repo.
This is implemented as demo/simpleExample.R
and mirrors
the Python example on the Annoy repo page.
library(RcppAnnoy)
set.seed(123) # be reproducible
<- 40
f <- new(AnnoyEuclidean, f)
a <- 50 # not specified
n
for (i in seq(n)) {
<- rnorm(f)
v $addItem(i-1, v)
a
}
$build(50) # 50 trees
a$save("/tmp/test.tree")
a
<- new(AnnoyEuclidean, f) # new object, could be in another process
b $load("/tmp/test.tree") # super fast, will just mmap the file
b
print(b$getNNsByItem(0, 40))
The package matches the behaviour of the original Python package in
the original Python wrapper for the Annoy library. It also
replicates all unit tests written for the Python frontend, including a
test for efficiently mmap
-ing a binary index file. While
setting it up, some small contributions were made back to Annoy as well.
As it uses mmap
for fast disk-access to stored index
file, a Windows build is possible via MapViewOfFile
(see
e.g. Jeff Ryan’s mmap CRAN
package) but we have not needed that functionality. A clean pull
requests to the Annoy or
RcppAnnoy repos
would be welcome.
Dirk Eddelbuettel