Annoy is a small, fast and lightweight library for Approximate Nearest Neighbours with a particular focus on efficient memory use and the ability to load a pre-saved index.
Annoy is written by Erik Bernhardsson for use at Spotify, and implemented in about 500 lines of a single C++ template header file — which is wrapped by Erik into a loadable Python module.
It provides a nice example for Rcpp Modules and use of templates: Annoy uses two template data types (generally float
and int32_t
for efficiency) and one of two distance measures. This package shows that it is easy to wrap both.
It also shows how easy it is to have Python and R shared the exact same functionality by virtue of modules binding on the Python modules and R side (where Rcpp helps).
Source code resides in the RcppAnnoy GitHub repo.
This is implemented as demo/simpleExample.R
and mirrors the Python example on the Annoy repo page.
library(RcppAnnoy)
set.seed(123) # be reproducible
f <- 40
a <- new(AnnoyEuclidean, f)
n <- 50 # not specified
for (i in seq(n)) {
v <- rnorm(f)
a$addItem(i-1, v)
}
a$build(50) # 50 trees
a$save("/tmp/test.tree")
b <- new(AnnoyEuclidean, f) # new object, could be in another process
b$load("/tmp/test.tree") # super fast, will just mmap the file
print(b$getNNsByItem(0, 40))
The package matches the behaviour of the original Python package in the original Python wrapper for the Annoy library. It also replicates all unit tests written for the Python frontend, including a test for efficiently mmap
-ing a binary index file. While setting it up, some small contributions were made back to Annoy as well.
As it uses mmap
for fast disk-access to stored index file, a Windows build is possible via MapViewOfFile
(see e.g. Jeff Ryan’s mmap CRAN package) but we have not needed that functionality. A clean pull requests to the Annoy or RcppAnnoy repos would be welcome.
Dirk Eddelbuettel