Thinking inside the box

Accelerating path-dependent loops: A quick Rcpp case study

User BobH asked on StackOverflow about accelerating path-dependent loops. He provided a simple example in which a vector gets filled conditional on the value of the preceding element. Simple to code, but hard to vectorise.

By the time I saw that question yesterday evening, Josh Ulrich had already posted a nice answer suggesting to switch from ifelse to if to escape the overhead of a vectorised expression on simple scalars. I meekly added a comment suggesting that Rcpp would likely do well on this and that someone should volunteer such an answer. Well, it is still August and Mr. Someone is on holiday, so this morning I followed up with such an answer. And as it turns out to work quite well indeed, we will repost it here.

Let's start with the general setup, and the two functions supplied by Josh. We also byte-compile these using the compiler package which is the culmination of several years of work by R core member Luke Tierney. This package was added to R during the last major release, and we assessed it in this earlier blog post as well as several Rcpp-related follow-ups. We also load the packages for on-the-fly 'inline' compilation and easy benchmarking:

library(inline)
library(rbenchmark)
library(compiler)

fun1 <- function(z) {
  for(i in 2:NROW(z)) {
    z[i] <- ifelse(z[i-1]==1, 1, 0)
  }
  z
}
fun1c <- cmpfun(fun1)


fun2 <- function(z) {
  for(i in 2:NROW(z)) {
    z[i] <- if(z[i-1]==1) 1 else 0
  }
  z
}
fun2c <- cmpfun(fun2)

We see that basic worker just assign to the i-th element based on the preceding element. Function two uses the aforementioned if statement, and both are also byte-compiled.

Writing the same code in C++ using both Rcpp (for the R/C++ integration) and inline for the on-the-fly compilation, linking and loading of C++ code into R is pretty straightforward too:

funRcpp <- cxxfunction(signature(zs="numeric"), plugin="Rcpp", body="
  Rcpp::NumericVector z = Rcpp::NumericVector(zs);
  int n = z.size();
  for (int i=1; i<n; i++) {
    z[i] = (z[i-1]==1.0 ? 1.0 : 0.0);
  }
  return(z);
")

This single R function call takes the code embedded in the argument to the body variable, builds a complete function in temporary file, and then compiles, links and loads it. We can now call funRcpp() just like the other four functions and do so via the benchmark() function of the rbenchmark package.

R> z <- rep(c(1,1,0,0,0,0), 100)
R> identical(fun1(z),fun2(z),fun1c(z),fun2c(z),funRcpp(z))
[1] TRUE
R> 
R> res <- benchmark(fun1(z), fun2(z),
+                  fun1c(z), fun2c(z),
+                  funRcpp(z),
+                  columns=c("test", "replications", "elapsed", "relative", "user.self", "sys.self"),
+                  order="relative",
+                  replications=1000)
R> print(res)
        test replications elapsed relative user.self sys.self
5 funRcpp(z)         1000   0.005      1.0      0.01        0
4   fun2c(z)         1000   0.482     96.4      0.48        0
2    fun2(z)         1000   1.989    397.8      1.98        0
3   fun1c(z)         1000  11.365   2273.0     11.37        0
1    fun1(z)         1000  13.210   2642.0     13.21        0

We can focus on the columns labelled elapsed and relative. The C++ version is faster by a factor of almost one-hundred compared to the byte-compiled version of funtion2, and almost four-hundred times faster than the plain-R variant of function2. And function1 is even worse, coming at well over twenty-two-hundred times the run-time of the C++ version. Byte-compilation also helps little here.

For comparison, we also ran the original example of a very short vector, called more frequently:

R> z <- c(1,1,0,0,0,0)
R> res2 <- benchmark(fun1(z), fun2(z),
+                  fun1c(z), fun2c(z),
+                  funRcpp(z),
+                  columns=c("test", "replications", "elapsed", "relative", "user.self", "sys.self"),
+                  order="relative",
+                  replications=10000)
R> print(res2)
        test replications elapsed relative user.self sys.self
5 funRcpp(z)        10000   0.047  1.00000      0.04        0
4   fun2c(z)        10000   0.134  2.85106      0.13        0
2    fun2(z)        10000   0.328  6.97872      0.32        0
3   fun1c(z)        10000   1.080 22.97872      1.08        0
1    fun1(z)        10000   1.243 26.44681      1.24        0

The qualitative ranking is unchanged: the Rcpp version dominates. Function2 using if instead of the vectorised ifelse is second-best with the byte-compiled version being about twice as fast that the plain R variant, but still almost three times slower than the C++ version. And the relative differences are less pronounced: relatively speaking, the function call overhead matters less here and the actual looping matters more: C++ gets a bigger advantage on the actual loop operations in the longer vectors. That it is an important result as it suggests that on more real-life sized data, the compiled version may reap a larger benefit.

All in all a nice demonstration of Rcpp, and a gain of almost one-hundred to the best byte-compiled version is nothing to sneeze at---especially when it is so easy to write and load a five-line C++ function thanks to Rcpp.

Before closing, a quick reminder that I will giving two classes on Rcpp in a few weeks. These will be in New York on September 24, and San Franciso on October 8, see this blog post as well as this page at Revolution Analytics (who are a co-organiser of the classes) for details and registration information.

/code/rcpp | permanent link

Tue, 23 Aug 2011

Accelerating path-dependent loops: A quick Rcpp case study