My talks introducing High Performance Computing with R (see e.g.
these
slides from a five-hour workshop at the ISM in Tokyo) frequently feature an example
of how to extend R with dedicated
compiled code for linear regressions. Romain and I also frequently use this
a motivating examples with our
Rcpp package for
seamless R and C++ integration. In fact, the examples directory for Rcpp still
contains an earlier version of a benchmark for `fastLm()`, a faster
alternative for R's `lm()` and `lm.fit()` functions. We have also
extended this with the
RcppArmadillo
package which brings Conrad Sanderson's excellent
Armadillo library with templated C++ code
for linear algebra to R, as well as a simple integration to the GNU GSL via our
RcppGSL
package. The Rcpp
section on my blog contains several posts about `fastLm` benchmarks.

Doug Bates has been a key Rcpp contributor, helping particularly with the initial Armadillo integration. His research, however, also requires highly performing sparse matrix operations which Armadillo does not yet offer. So Doug has started to explore the Eigen project---a free C++ template math library mainly focused on vectors, matrices, and linear algebra (note that we will refer to the Eigen, Eigen2 and Eigen3 APIs as just 'Eigen' here, focusing on the latest version, Eigen3). Better still, Doug went to work and pretty much single-handedly wrote a new package RcppEigen which integrates the templated C++ library Eigen with R using Rcpp.

RcppEigen also provides a `fastLm` implementation and benchmark
script. In fact, it contains a full six
different implementations as Doug is keenly interested in rank-revealing
decompositions which can guard against ill-conditioned model matrices. Some more
background information on this is also available in Doug's article on *Least
Squares Calculations in R* in
R News 4(1).

Doug's implementation also uses an elegant design. It comprises a base class with common functionality, and six subclasses which specialize accordingly for these six different decompositions approaches:

- a column-pivoting QR,
- a standard QR,
- an LL,
- an LDL,
- an SVD, and
- a standard symmetric one.

On my server, the result of running the included benchmark script
`lmBenchmark` is as follows:

From this first set of results, the preferred method may be 'PivQR', the pivoted QR. Strictly-speaking, it is the only one we can compare tolm benchmark for n = 100000 and p = 40: nrep = 20 test relative elapsed user.self sys.self 7 LLt 1.000000 0.918 0.91 0.00 3 LDLt 1.002179 0.920 0.92 0.00 5 SymmEig 3.021786 2.774 2.19 0.57 6 QR 5.136166 4.715 4.24 0.48 2 PivQR 5.303922 4.869 4.27 0.58 8 arma 6.592593 6.052 6.03 0.02 1 lm.fit 9.386710 8.617 7.14 1.45 4 SVD 33.858388 31.082 30.19 0.84 9 GSL 114.972767 105.545 104.79 0.63

As for pure speed, the LL and LDL decomposition have almost identical
performance, and are clearly faster than the other approaches. Compared to
`lm.fit()`, which is the best one could do with just R, we see an
**improvement by a factor of eight** which is quite impressive (albeit
not robust to rank-deficient model matrices). Apart from the SVD, all
approaches using Eigen are faster than the one using Armadillo, which itself
is still faster than R's `lm.fit()`. Doug and I were very surprised
by the poor performance of the GNU GSL (which also uses SVD) via RcppGSL.

Now, Eigen uses its own code for all linear algebra operations, bypassing BLAS and LAPACK libraries. The results above were achieved with the current Atlas package in Ubuntu. If we take advantage of the BLAS / LAPACK plug-in architecture offered on Debian / Ubuntu systems (see the vignette in my gcbd package for more) and use Goto BLAS which provide tuning as well as parallelism on multi-core machines, the results are as follow:

We see that the BLAS-using Armadillo approach improves a little and moves just slightly ahead of the pivoted QR. On the other hand,lm benchmark for n = 100000 and p = 40: nrep = 20 test relative elapsed user.self sys.self 3 LDLt 1.000000 0.907 0.90 0.00 7 LLt 1.000000 0.907 0.91 0.00 5 SymmEig 2.981257 2.704 2.14 0.56 6 QR 5.004410 4.539 4.03 0.50 8 arma 5.072767 4.601 15.30 3.05 2 PivQR 5.307607 4.814 4.27 0.55 1 lm.fit 8.302095 7.530 9.55 12.25 4 SVD 33.015436 29.945 29.06 0.85 9 GSL 195.413451 177.240 244.64 319.89

This post has illustrated some of the performance gains that can be obtained
from using Eigen via RcppEigen. When not
using rank-revealing methods, computing time can be reduced by up to eight
times relative to `lm.fit()`. Rank-revealing method can still improve
by almost a factor of two. The main disadvantage of Eigen may be one of the
reasons behind its impressive performance: its heavily templated code does
not use BLAS, and the resulting object code (as e.g. in RcppEigen) becomes
enormous (when compiling with debugging symbols). As one illustration, the
shared library for RcppEigen on my Ubuntu 64-bit system has a size of 24.6 mb
whereas RcppArmadillo comes in at a mere 0.78 mb; without debugging symbols it is a
more reasonable 0.52 mb.

The performance of Eigen is certainly intriguiging, and its API is rather complete. It seems safe to say that we may see more R projects going to make use of Eigen thanks to the RcppEigen wrapper.

*Update:* Clarified statement about large
object size which was entirely due to building with debugging support.