||Thinking inside the box|
Tue, 05 Jul 2011
Even faster linear model fits with R using RcppEigen
My talks introducing High Performance Computing with R (see e.g. these slides from a five-hour workshop at the ISM in Tokyo) frequently feature an example of how to extend R with dedicated compiled code for linear regressions. Romain and I also frequently use this a motivating examples with our Rcpp package for seamless R and C++ integration. In fact, the examples directory for Rcpp still contains an earlier version of a benchmark for fastLm(), a faster alternative for R's lm() and lm.fit() functions. We have also extended this with the RcppArmadillo package which brings Conrad Sanderson's excellent Armadillo library with templated C++ code for linear algebra to R, as well as a simple integration to the GNU GSL via our RcppGSL package. The Rcpp section on my blog contains several posts about fastLm benchmarks.
Doug Bates has been a key Rcpp contributor, helping particularly with the initial Armadillo integration. His research, however, also requires highly performing sparse matrix operations which Armadillo does not yet offer. So Doug has started to explore the Eigen project---a free C++ template math library mainly focused on vectors, matrices, and linear algebra (note that we will refer to the Eigen, Eigen2 and Eigen3 APIs as just 'Eigen' here, focusing on the latest version, Eigen3). Better still, Doug went to work and pretty much single-handedly wrote a new package RcppEigen which integrates the templated C++ library Eigen with R using Rcpp.
RcppEigen also provides a fastLm implementation and benchmark script. In fact, it contains a full six different implementations as Doug is keenly interested in rank-revealing decompositions which can guard against ill-conditioned model matrices. Some more background information on this is also available in Doug's article on Least Squares Calculations in R in R News 4(1).
Doug's implementation also uses an elegant design. It comprises a base class with common functionality, and six subclasses which specialize accordingly for these six different decompositions approaches:
On my server, the result of running the included benchmark script lmBenchmark is as follows:
From this first set of results, the preferred method may be 'PivQR', the pivoted QR. Strictly-speaking, it is the only one we can compare to lm.fit() which also uses a pivoting scheme. In the case of a degenerated model matrix, all the other methods, including the four fastest approaches, are susceptible to producing incorrect estimates. Doug plans to make SVD and SymmEig rank-revealing too.lm benchmark for n = 100000 and p = 40: nrep = 20 test relative elapsed user.self sys.self 7 LLt 1.000000 0.918 0.91 0.00 3 LDLt 1.002179 0.920 0.92 0.00 5 SymmEig 3.021786 2.774 2.19 0.57 6 QR 5.136166 4.715 4.24 0.48 2 PivQR 5.303922 4.869 4.27 0.58 8 arma 6.592593 6.052 6.03 0.02 1 lm.fit 9.386710 8.617 7.14 1.45 4 SVD 33.858388 31.082 30.19 0.84 9 GSL 114.972767 105.545 104.79 0.63
As for pure speed, the LL and LDL decomposition have almost identical performance, and are clearly faster than the other approaches. Compared to lm.fit(), which is the best one could do with just R, we see an improvement by a factor of eight which is quite impressive (albeit not robust to rank-deficient model matrices). Apart from the SVD, all approaches using Eigen are faster than the one using Armadillo, which itself is still faster than R's lm.fit(). Doug and I were very surprised by the poor performance of the GNU GSL (which also uses SVD) via RcppGSL.
Now, Eigen uses its own code for all linear algebra operations, bypassing BLAS and LAPACK libraries. The results above were achieved with the current Atlas package in Ubuntu. If we take advantage of the BLAS / LAPACK plug-in architecture offered on Debian / Ubuntu systems (see the vignette in my gcbd package for more) and use Goto BLAS which provide tuning as well as parallelism on multi-core machines, the results are as follow:
We see that the BLAS-using Armadillo approach improves a little and moves just slightly ahead of the pivoted QR. On the other hand, lm.fit(), which also uses a pivoting scheme and hence only level 1 BLAS operations, changes less. GSL performs even worse (and it is unclear why). Doug's post announcing RcppEigen on the Eigen list has a few more sets of results.lm benchmark for n = 100000 and p = 40: nrep = 20 test relative elapsed user.self sys.self 3 LDLt 1.000000 0.907 0.90 0.00 7 LLt 1.000000 0.907 0.91 0.00 5 SymmEig 2.981257 2.704 2.14 0.56 6 QR 5.004410 4.539 4.03 0.50 8 arma 5.072767 4.601 15.30 3.05 2 PivQR 5.307607 4.814 4.27 0.55 1 lm.fit 8.302095 7.530 9.55 12.25 4 SVD 33.015436 29.945 29.06 0.85 9 GSL 195.413451 177.240 244.64 319.89
This post has illustrated some of the performance gains that can be obtained from using Eigen via RcppEigen. When not using rank-revealing methods, computing time can be reduced by up to eight times relative to lm.fit(). Rank-revealing method can still improve by almost a factor of two. The main disadvantage of Eigen may be one of the reasons behind its impressive performance: its heavily templated code does not use BLAS, and the resulting object code (as e.g. in RcppEigen) becomes enormous (when compiling with debugging symbols). As one illustration, the shared library for RcppEigen on my Ubuntu 64-bit system has a size of 24.6 mb whereas RcppArmadillo comes in at a mere 0.78 mb; without debugging symbols it is a more reasonable 0.52 mb.
The performance of Eigen is certainly intriguiging, and its API is rather complete. It seems safe to say that we may see more R projects going to make use of Eigen thanks to the RcppEigen wrapper.
Update: Clarified statement about large object size which was entirely due to building with debugging support.