Mon, 27 Nov 2017

#11: (Much) Faster Package (Re-)Installation via Caching

Welcome to the eleventh post in the rarely rued R rants series, or R4 for short. Time clearly flies as it has been three months since out last post on significantly reducing library size via stripping. I had been meaning to post on today's topic for quite some time, but somehow something (working on a paper, releasing a package, ...) got in the way.

Just a few days ago Colin (of Efficient R Programming fame) posted about speed(ing up) package installation. His recommendation? Remember that we (usually) have multiple cores and using several of them via options(Ncpus = XX). It is an excellent point, and it bears repeating.

But it turns I have not one but two salient recommendations too. Today covers the first, we should hopefully get pretty soon to the second. Both have one thing in common: you will be fastest if you avoid doing the work in the first place.

What?

One truly outstanding tool for this in the context of the installation of compiled packages is ccache. It is actually a pretty old tool that has been out for well over a decade, and it comes from the folks that gave the Samba filesystem.

What does it do? Well, it nutshell, it "hashes" a checksum of a source file once the preprocessor has operated on it and stores the resulting object file. In the case of rebuild with unchanged code you get the object code back pretty much immediately. The idea is very similar to memoisation (as implemented in R for example in the excellent little memoise package by Hadley, Jim, Kirill and Daniel. The idea is the same: if you have to do something even moderately expensive a few times, do it once and then recall it the other times.

This happens (at least to me) more often that not in package development. Maybe you change just one of several source files. Maybe you just change the R code, the Rd documentation or a test file---yet still need a full reinstallation. In all these cases, ccache can help tremdendously as illustrated below.

How?

Because essentially all our access to compilation happens through R, we need to set this in a file read by R. I use ~/.R/Makevars for this and have something like these lines on my machines:

VER=
CCACHE=ccache
CC=$(CCACHE) gcc$(VER)
CXX=$(CCACHE) g++$(VER)
CXX11=$(CCACHE) g++$(VER)
CXX14=$(CCACHE) g++$(VER)
FC=$(CCACHE) gfortran$(VER)
F77=$(CCACHE) gfortran$(VER)

That way, when R calls the compiler(s) it will prefix with ccache. And ccache will then speed up.

There is an additional issue due to R use. Often we install from a .tar.gz. These will be freshly unpackaged, and hence have "new" timestamps. This would usually lead ccache to skip to file (fear of "false positives") so we have to override this. Similarly, the tarball is usually unpackage in a temporary directory with an ephemeral name, creating a unique path. That too needs to be overwritten. So in my ~/.ccache/ccache.conf I have this:

max_size = 5.0G
# important for R CMD INSTALL *.tar.gz as tarballs are expanded freshly -> fresh ctime
sloppiness = include_file_ctime
# also important as the (temp.) directory name will differ
hash_dir = false

Show Me

A quick illustration will round out the post. Some packages are meatier than others. More C++ with more templates usually means longer build times. Below is a quick chart comparing times for a few such packages (ie RQuantLib, dplyr, rstan) as well as igraph ("merely" a large C package) and lme4 as well as Rcpp. The worst among theseis still my own RQuantLib package wrapping (still just parts of) the genormous and Boost-heavy QuantLib library.

Pretty dramatic gains. Best of all, we can of course combine these with other methods such as Colin's use of multiple CPUs, or even a simple MAKE=make -j4 to have multiple compilation units being considered in parallel. So maybe we all get to spend less time on social media and other timewasters as we spend less time waiting for our builds. Or maybe that is too much to hope for...

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link