Thinking inside the box

Thu, 03 Aug 2017

R for System Adminstration

Just getting back from the most fun meetup I have been to in quite some time: episode 2³ (by their count) of Open Source Open Mic hosted by Matt Godbolt and Joe Walnes here in Chicago. Nothing but a sequence of lightning talks. Plus beer and pizza. Sounds awesome? It was!

We had fantastic talks across at least half a dozen languages, covering both new-ish (Pony) and interesting ones such (Rust, Go, ...) plus of course some Javascript and some Python, no Java (yay!) and a few batshit crazy things like a self-hosting database in its own (shell) code, a terminal gif viewer (!!), and more. And it gave me an opportunity to quickly (one evening and morning commute) jam out a presentation about what is in the title: R for system administration.

And I am only half-joking. I had used R a couple of years ago when I needed to select, subset, modify, ... a large number of image files given some timestamp and filename patterns. And given how well R works in a vectorised manner with both regular expressions and timestamps, as well as on top of essentially all standard POSIX-style operating system / file-system functions, I picked up that thread again on the problem of ... cleaning up the file storage underlying CRANberries which by now has well over fifty-seven thousand (!!) tarballs of CRAN packages based on now ten years of CRANberries. So I showed how to prune this in essentially half a dozen lines of R (and data.table code), plus some motivation---all just right for a lightning talk. Seemingly the talk went well enough as quite a few folks gave a thumbs up and compliments over beers afterwards.

But see for yourself as the slides are now uploaded to my standard talks page.

My thanks to Matt and Joe for organizing the meetup. I think I will be back.

/code/snippets | permanent link

Sat, 29 Jul 2017

Updated overbought/oversold plot function

A good six years ago I blogged about plotOBOS() which charts a moving average (from one of several available variants) along with shaded standard deviation bands. That post has a bit more background on the why/how and motivation, but as a teaser here is the resulting chart of the SP500 index (with ticker ^GSCP):

Example chart of overbought/oversold levels from plotOBOS() function

The code uses a few standard finance packages for R (with most of them maintained by Joshua Ulrich given that Jeff Ryan, who co-wrote chunks of these, is effectively retired from public life). Among these, xts had a recent release reflecting changes which occurred during the four (!!) years since the previous release, and covering at least two GSoC projects. With that came subtle API changes: something we all generally try to avoid but which is at times the only way forward. In this case, the shading code I used (via polygon() from base R) no longer cooperated with the beefed-up functionality of plot.xts(). Luckily, Ross Bennett incorporated that same functionality into a new function addPolygon --- which even credits this same post of mine.

With that, the updated code becomes

## plotOBOS -- displaying overbough/oversold as eg in Bespoke's plots
##
## Copyright (C) 2010 - 2017  Dirk Eddelbuettel
##
## This is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 2 of the License, or
## (at your option) any later version.

suppressMessages(library(quantmod))     # for getSymbols(), brings in xts too
suppressMessages(library(TTR))          # for various moving averages

plotOBOS <- function(symbol, n=50, type=c("sma", "ema", "zlema"),
                     years=1, blue=TRUE, current=TRUE, title=symbol,
                     ticks=TRUE, axes=TRUE) {

    today <- Sys.Date()
    if (class(symbol) == "character") {
        X <- getSymbols(symbol, from=format(today-365*years-2*n), auto.assign=FALSE)
        x <- X[,6]                          # use Adjusted
    } else if (inherits(symbol, "zoo")) {
        x <- X <- as.xts(symbol)
        current <- FALSE                # don't expand the supplied data
    }

    n <- min(nrow(x)/3, 50)             # as we may not have 50 days

    sub <- ""
    if (current) {
        xx <- getQuote(symbol)
        xt <- xts(xx$Last, order.by=as.Date(xx$`Trade Time`))
        colnames(xt) <- paste(symbol, "Adjusted", sep=".")
        x <- rbind(x, xt)
        sub <- paste("Last price: ", xx$Last, " at ",
                     format(as.POSIXct(xx$`Trade Time`), "%H:%M"), sep="")
    }

    type <- match.arg(type)
    xd <- switch(type,                  # compute xd as the central location via selected MA smoother
                 sma = SMA(x,n),
                 ema = EMA(x,n),
                 zlema = ZLEMA(x,n))
    xv <- runSD(x, n)                   # compute xv as the rolling volatility

    strt <- paste(format(today-365*years), "::", sep="")
    x  <- x[strt]                       # subset plotting range using xts' nice functionality
    xd <- xd[strt]
    xv <- xv[strt]

    xyd <- xy.coords(.index(xd),xd[,1]) # xy coordinates for direct plot commands
    xyv <- xy.coords(.index(xv),xv[,1])

    n <- length(xyd$x)
    xx <- xyd$x[c(1,1:n,n:1)]           # for polygon(): from first point to last and back

    if (blue) {
        blues5 <- c("#EFF3FF", "#BDD7E7", "#6BAED6", "#3182BD", "#08519C") # cf brewer.pal(5, "Blues")
        fairlylight <<- rgb(189/255, 215/255, 231/255, alpha=0.625) # aka blues5[2]
        verylight <<- rgb(239/255, 243/255, 255/255, alpha=0.625) # aka blues5[1]
        dark <<- rgb(8/255, 81/255, 156/255, alpha=0.625) # aka blues5[5]
        ## buglet in xts 0.10-0 requires the <<- here
    } else {
        fairlylight <<- rgb(204/255, 204/255, 204/255, alpha=0.5)  # two suitable grays, alpha-blending at 50%
        verylight <<- rgb(242/255, 242/255, 242/255, alpha=0.5)
        dark <<- 'black'
    }

    plot(x, ylim=range(range(x, xd+2*xv, xd-2*xv, na.rm=TRUE)), main=title, sub=sub, 
         major.ticks=ticks, minor.ticks=ticks, axes=axes) # basic xts plot setup
    addPolygon(xts(cbind(xyd$y+xyv$y, xyd$y+2*xyv$y), order.by=index(x)), on=1, col=fairlylight)  # upper
    addPolygon(xts(cbind(xyd$y-xyv$y, xyd$y+1*xyv$y), order.by=index(x)), on=1, col=verylight)    # center
    addPolygon(xts(cbind(xyd$y-xyv$y, xyd$y-2*xyv$y), order.by=index(x)), on=1, col=fairlylight)  # lower
    lines(xd, lwd=2, col=fairlylight)   # central smooted location
    lines(x, lwd=3, col=dark)           # actual price, thicker
}

and the main change are the three calls to addPolygon. To illustrate, we call plotOBOS("SPY", years=2) with an updated plot of the ETF representing the SP500 over the last two years:

Updated example chart of overbought/oversold levels from plotOBOS() function

Comments and further enhancements welcome!

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/snippets | permanent link

Wed, 22 Mar 2017

Suggests != Depends

A number of packages on CRAN use Suggests: casually.

They list other packages as "not required" in Suggests: -- as opposed to absolutely required via Imports: or the older Depends: -- yet do not test for their use in either examples or, more commonly, unit tests.

So e.g. the unit tests are bound to fail because, well, Suggests != Depends.

This has been accomodated for many years by all parties involved by treating Suggests as a Depends and installing unconditionally. As I understand it, CRAN appears to flip a switch to automatically install all Suggests from major repositories glossing over what I consider to be a packaging shortcoming. (As an aside, treatment of Additonal_repositories: is indeed optional; Brooke Anderson and I have a fine paper under review on this)

I spend a fair amount of time with reverse dependency ("revdep") checks of packages I maintain, and I will no longer accomodate these packages.

These revdep checks take long enough as it is, so I will now blacklist these packages that are guaranteed to fail when their "optional" dependencies are not present.

Writing R Extensions says in Section 1.1.3

All packages that are needed10 to successfully run R CMD check on the package must be listed in one of ‘Depends’ or ‘Suggests’ or ‘Imports’. Packages used to run examples or tests conditionally (e.g. via if(require(pkgname))) should be listed in ‘Suggests’ or ‘Enhances’. (This allows checkers to ensure that all the packages needed for a complete check are installed.)

In particular, packages providing “only” data for examples or vignettes should be listed in ‘Suggests’ rather than ‘Depends’ in order to make lean installations possible.

[...]

It used to be common practice to use require calls for packages listed in ‘Suggests’ in functions which used their functionality, but nowadays it is better to access such functionality via :: calls.

and continues in Section 1.1.3.1

Note that someone wanting to run the examples/tests/vignettes may not have a suggested package available (and it may not even be possible to install it for that platform). The recommendation used to be to make their use conditional via if(require("pkgname"))): this is fine if that conditioning is done in examples/tests/vignettes.

I will now exercise my option to use 'lean installations' as discussed here. If you want your package included in tests I run, please make sure it tests successfully when only its required packages are present.

/code/snippets | permanent link

Sun, 12 Feb 2017

Letting Travis keep a secret

More and more packages, be it for R or another language, are now interfacing different application programming interfaces (API) which are exposed to the web. And many of these may require an API key, or token, or account and password.

Which traditionally poses a problem in automated tests such as those running on the popular Travis CI service which integrates so well with GitHub. A case in point is the RPushbullet package where Seth Wenchel and I have been making a few recent changes and additions.

And yesterday morning, I finally looked more closely into providing Travis CI with the required API key so that we could in fact run continuous integration with unit tests following each commit. And it turns that it is both easy and quick to do, and yet another great showcase for ad-hoc Docker use.

The rest of this post will give a quick minimal run-down, this time using the gtrendsR package by Philippe Massicotte and myself. Start by glancing at the 'encrypting files' HOWTO from Travis itself.

We assume you have Docker installed, and a suitable base package. We will need Ruby, so any base Linux image will do. In what follows, I use Ubuntu 14.04 but many other Debian, Ubunti, Fedora, ... flavours could be used provided you know how to pick the relevant packages. What is shown here should work on any recent Debian or Ubuntu flavour 'as is'.

We start by firing off the Docker engine in the repo directory for which we want to create an encrypted file. The -v $(pwd):/mnt switch mounts the current directory as /mnt in the Docker instance:

edd@max:~/git/gtrendsr(master)$ docker run --rm -ti -v $(pwd):/mnt ubuntu:trusty
root@38b478356439:/# apt-get update    ## this takes a minute or two
Ign http://archive.ubuntu.com trusty InRelease
Get:1 http://archive.ubuntu.com trusty-updates InRelease [65.9 kB]
Get:2 http://archive.ubuntu.com trusty-security InRelease [65.9 kB]
# ... a dozen+ lines omitted ...
Get:21 http://archive.ubuntu.com trusty/restricted amd64 Packages [16.0 kB]    
Get:22 http://archive.ubuntu.com trusty/universe amd64 Packages [7589 kB]      
Fetched 22.4 MB in 6min 40s (55.8 kB/s)                                        
Reading package lists... Done
root@38b478356439:/#

We then install what is needed to actually install the travis (Ruby) gem, as well as git which is used by it:

root@38b478356439:/# apt-get install -y ruby ruby-dev gem build-essential git
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
# ... lot of output ommitted ...
Processing triggers for ureadahead (0.100.0-16) ...
Processing triggers for sgml-base (1.26+nmu4ubuntu1) ...
root@38b478356439:/#

This too may take a few minutes, depending on the networking bandwidth and other factors, and should in general succeed without the need for any intervention. Once it has concluded, we can use the now-complete infrastructure to install the travis command-line client:

root@38b478356439:/# gem install travis
Fetching: multipart-post-2.0.0.gem (100%)
Fetching: faraday-0.11.0.gem (100%)
Fetching: faraday_middleware-0.11.0.1.gem (100%)
Fetching: highline-1.7.8.gem (100%)
Fetching: backports-3.6.8.gem (100%)
Fetching: multi_json-1.12.1.gem (100%
# ... many lines omitted ...
Installing RDoc documentation for websocket-1.2.4...
Installing RDoc documentation for json-2.0.3...
Installing RDoc documentation for pusher-client-0.6.2...
Installing RDoc documentation for travis-1.8.6...
root@38b478356439:/#

This in turn will take a moment.

Once done, we can use the travis client to login into GitHub. In my base this requires a password and a two-factor authentication code. Also note that we switch directories first to be in the actual repo we had mounted when launching docker.

root@38b478356439:/# cd /mnt/    ## change to repo directory
root@38b478356439:/mnt# travis --login
Shell completion not installed. Would you like to install it now? |y| y
We need your GitHub login to identify you.
This information will not be sent to Travis CI, only to api.github.com.
The password will not be displayed.

Try running with --github-token or --auto if you don't want to enter your password anyway.

Username: eddelbuettel
Password for eddelbuettel: ****************
Two-factor authentication code for eddelbuettel: xxxxxx
Successfully logged in as eddelbuettel!
root@38b478356439:/mnt#

Now the actual work of encrypting. For this particular package, we need a file .Rprofile containing a short option() segment setting a user-id and password:

root@38b478356439:/mnt# travis encrypt-file .Rprofile
Detected repository as PMassicotte/gtrendsR, is this correct? |yes| 
encrypting .Rprofile for PMassicotte/gtrendsR
storing result as .Rprofile.enc
storing secure env variables for decryption

Please add the following to your build script (before_install stage in your .travis.yml, for instance):

    openssl aes-256-cbc -K $encrypted_988d19a907a0_key -iv $encrypted_988d19a907a0_iv -in .Rprofile.enc -out .Rprofile -d

Pro Tip: You can add it automatically by running with --add.

Make sure to add .Rprofile.enc to the git repository.
Make sure not to add .Rprofile to the git repository.
Commit all changes to your .travis.yml.
root@38b478356439:/mnt#

That's it. Now we just need to follow-through as indicated, committing the .Rprofile.enc file, making sure to not commit its input file .Rprofile, and adding the proper openssl invocation with the keys known only to Travis to the file .travis.yml.

/code/snippets | permanent link

Sat, 21 Jan 2017

Updated Example Repo for RMarkdown and Metropolis/Mtheme

During useR! 2016, Nick Tierney had asked on Twitter about rmarkdown and metropolis and whether folks had used RMarkdown-driven LaTeX Beamer presentations. My firm hell yeah answer, based on having used mtheme outright or in local mods for quite some time (see my talks page), lead to this blog post of mine describing this GitHub repo I had quickly set up during breaks at useR! 2016. The corresponding blog post and the repo have some more details on how I do this, in particular about local packages (also with sources on GitHub) for the non-standard fonts I use.

This week I got around to updating the repo / example a little by making the default colours (in my example) a little less awful, and a page on blocks and, most importantly, turning the example into the animated gif below:

And thanks to the beautiful tint package -- see its repo and CRAN package --- I now know how to create a template package. So if there is interest (and spare time), we could build a template package for RStudio too.

With that, may I ask a personal favour of anybody still reading the post? Please do not hit my twitter handle with questions for support. All my code is an GitHub, and issue tickets there are much preferred. Larger projects like Rcpp also have their own mailing lists, and it is much better to use those. And if you like neither, maybe ask on StackOverflow. But please don't spam my Twitter handle. Thank you.

/code/snippets | permanent link

Thu, 30 Jun 2016

RMarkdown and Metropolis/Mtheme

Nick Tierney asked on Twitter about rmarkdown and metropolis about whether folks had used RMarkdown-driven LaTeX Beamer presentations. And the answer is a firm hell yeah. I have been using mtheme (and/or a local variant I called 'm2') as well as the newer (renamed) release mtheme for the last year or two for all my RMarkdown-based presentations as you can see from my presentations page.

And earlier this year back I cleaned this up and wrote myself local Ubuntu packages which are here on Launchpad. I also have two GitHub repos for the underlying .deb package code: - the pkg-latex-metropolis package for the LaTeX part (which is also in TeXlive in an older version) - the pkg-fonts-fira for the underlying (free) font (and this sadly cannot build on launchpad as it needs a download step).

To round things up, I now also created a public 'sample' repo on GitHub. It is complete for all but the custom per-presenteation header.tex that modifies colours, add local definitions etc as needed for each presentation.

With that, Happy Canada Day (tomorrow, though) -- never felt better to be part of something Glorious and Free, and also free of Brexit, Drumpf and other nonsense.

/code/snippets | permanent link

Sun, 26 Jul 2015

Evading the "Hadley tax": Faster Travis tests for R

Hadley is a popular figure, and rightly so as he successfully introduced many newcomers to the wonders offered by R. His approach strikes some of us old greybeards as wrong---I particularly take exception with some of his writing which frequently portrays a particular approach as both the best and only one. Real programming, I think, is often a little more nuanced and aware of tradeoffs which need to be balanced. As a book on another language once popularized: "There is more than one way to do things." But let us leave this discussion for another time.

As the reach of the Hadleyverse keeps spreading, we sometimes find ourselves at the receiving end of a cost/benefit tradeoff. That is what this post is about, and it uses a very concrete case I encountered yesterday.

As blogged earlier, the RcppZiggurat package was updated. I had not touched it in a year, but Brian Ripley had sent a brief and detailed note concerning something flagged by the Solaris compiler (correctly suggesting I replace fabs() with abs() on integer types). (Allow me to stray from the main story line here for a second to stress just how insane a work load he is carrying, essentially for all of us. R and the R community are so just so indebted to him for all his work---which makes the usual social media banter about him so unfortunate. But that too shall be left for another time.) Upon making the simple fix, and submitting to GitHub the usual Travis CI was triggered. And here is what I saw:

first travis build in a year
All happy, all green. Previous build a year ago, most recent build yesterday, both passed. But hold on: test time went from 2:54 minutes to 7:47 minutes for an increase of almost five minutes! And I knew that I had not added any new dependencies, or altered any build options. What did happen was that among the dependencies of my package, one had decided to now also depend on ggplot2. Which leads to a chain of sixteen additional packages being loaded besides the four I depend upon---when it used to be just one. And that took five minutes as all those packages are installed from source, and some are big and take a long time to compile.

There is however and easy alternative, and for that we have to praise Michael Rutter who looks after a number of things for R on Ubuntu. Among these are the R builds for Ubuntu but also the rrutter PPA as well as the c2d4u PPA. If you have not heard this alphabet soup before, a PPA is a package repository for Ubuntu where anyone (who wants to sign up) can upload (properly setup) source files which are then turned into Ubuntu binaries. With full dependency resolution and all other goodies we have come to expect from the Debian / Ubuntu universe. And Michael uses this facility with great skill and calm to provide us all with Ubuntu binaries for R itself (rebuilding what yours truly uploads into Debian), as well as a number of key packages available via the CRAN mirrors. Less know however is this "c2d4u" which stands for CRAN to Debian for Ubuntu. And this builds on something Charles Blundell once built under my mentorship in a Google Summer of Code. And Michael does a tremdous job covering well over a thousand CRAN source packages---and providing binaries for all. Which we can use for Travis!

What all that means is that I could now replace the line

 - ./travis-tool.sh install_r RcppGSL rbenchmark microbenchmark highlight

which implies source builds of the four listed packages and all their dependencies with the following line implying binary installations of already built packages:

 - ./travis-tool.sh install_aptget libgsl0-dev r-cran-rcppgsl r-cran-rbenchmark r-cran-microbenchmark r-cran-highlight

In this particular case I also needed to build a binary package of my RcppGSL package as this one is not (yet) handled by Michael. I happen to have (re-)discovered the beauty of PPAs for Travis earlier this year and revitalized an older and largely dormant launchpad account I had for this PPA of mine. How to build a simple .deb package will also have to left for a future post to keep this more concise.

This can be used with the existing r-travis setup---but one needs to use the older, initial variant in order to have the ability to install .deb packages. So in the .travis.yml of RcppZiggurat I just use

before_install:
## PPA for Rcpp and some other packages
- sudo add-apt-repository -y ppa:edd/misc
## r-travis by Craig Citro et al
- curl -OL http://raw.github.com/craigcitro/r-travis/master/scripts/travis-tool.sh
- chmod 755 ./travis-tool.sh
- ./travis-tool.sh bootstrap

to add my own PPA and all is good. If you do not have a PPA, or do not want to create your own packages you can still benefit from the PPAs by Michael and "mix and match" by installing from binary what is available, and from source what is not.

Here we were able to use an all-binary approach, so let's see the resulting performance:

latest travis build
Now we are at 1:03 to 1:15 minutes---much better.

So to conclude, while the every expanding universe of R packages is fantastic for us as users, it can be seen to be placing a burden on us as developers when installing and testing. Fortunately, the packaging infrastructure built on top of Debian / Ubuntu packages can help and dramatically reduce build (and hence test) times. Learning about PPAs can be a helpful complement to learning about Travis and continued integration. So maybe now I need a new reason to blame Hadley? Well, there is always snake case ...

Follow-up: The post got some pretty immediate feedback shortly after I posted it. Craig Citro pointed out (quite correctly) that I could use r_binary_install which would also install the Ubuntu binaries based on their R packages names. Having built R/CRAN packages for Debian for so long, I am simply more used to the r-cran-* notations, and I think I was also the one contributing install_aptget to r-travis ... Yihui Xie spoke up for the "new" Travis approach deploying containers, caching of packages and explicit whitelists. It was in that very (GH-based) discussion that I started to really lose faith in the new Travis approach as they want use to whitelist each and every package. With 6900 and counting at CRAN I fear this simply does not scale. But different approaches are certainly welcome. I posted my 1:03 to 1:15 minutes result. If the "New School" can do it faster, I'd be all ears.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/snippets | permanent link

Fri, 28 Nov 2014

CRAN Task Views for Finance and HPC now (also) on GitHub

The CRAN Task View system is a fine project which Achim Zeileis initiated almost a decade ago. It is described in a short R Journal article in Volume 5, Number 1. I have been editor / maintainer of the Finance task view essentially since the very beginning of these CRAN Task Views, and added the High-Performance Computing one in the fall of 2008. Many, many people have helped by sending suggestions or even patches; email continues to be the main venue for the changes.

The maintainers of the Web Technologies task view were, at least as far as I know, the first to make the jump to maintaining the task view on GitHub. Karthik and I briefly talked about this when he was in town a few weeks ago for our joint Software Carpentry workshop at Northwestern.

So the topic had been on my mind, but it was only today that I realized that the near-limitless amount of awesome that is pandoc can probably help with maintenance. The task view code by Achim neatly converts the very regular, very XML, very boring original format into somewhat-CRAN-website-specific html. Pandoc, being as versatile as it is, can then make (GitHub-flavoured) markdown out of this, and with a minimal amount of sed magic, we get what we need.

And hence we now have these two new repos:

Contributions are now most welcome by pull request. You can run the included converter scripts, it differs between both repos only by one constant for the task view / file name. As an illustration, the one for Finance is below.

#!/usr/bin/r
## if you do not have /usr/bin/r from littler, just use Rscript

ctv <- "Finance"

ctvfile  <- paste0(ctv, ".ctv")
htmlfile <- paste0(ctv, ".html")
mdfile   <- "README.md"

## load packages
suppressMessages(library(XML))          # called by ctv
suppressMessages(library(ctv))

r <- getOption("repos")                 # set CRAN mirror
r["CRAN"] <- "http://cran.rstudio.com"
options(repos=r)

check_ctv_packages(ctvfile)             # run the check

## create html file from ctv file
ctv2html(read.ctv(ctvfile), htmlfile)

### these look atrocious, but are pretty straight forward. read them one by one
###  - start from the htmlfile
cmd <- paste0("cat ", htmlfile,
###  - in lines of the form  ^<a href="Word">Word.html</a>
###  - capture the 'Word' and insert it into a larger URL containing an absolute reference to task view 'Word'
  " | sed -e 's|^<a href=\"\\([a-zA-Z]*\\)\\.html|<a href=\"http://cran.rstudio.com/web/views/\\1.html\"|' | ",
###  - call pandoc, specifying html as input and github-flavoured markdown as output
              "pandoc -s -r html -w markdown_github | ",
###  - deal with the header by removing extra ||, replacing |** with ** and **| with **:              
              "sed -e's/||//g' -e's/|\\*\\*/\\*\\*/g' -e's/\\*\\*|/\\*\\* /g' -e's/|$/  /g' ",
###  - make the implicit URL to packages explicit
              "-e's|../packages/|http://cran.rstudio.com/web/packages/|g' ",
###  - write out mdfile
              "> ", mdfile)

system(cmd)                             # run the conversion

unlink(htmlfile)                        # remove temporary html file

cat("Done.\n")

I am quite pleased with this setup---so a quick thanks towards the maintainers of the Web Technologies task view; of course to Achim for creating CRAN Task Views in the first place, and maintaining them all those years; as always to John MacFarlance for the magic that is pandoc; and last but not least of course to anybody who has contributed to the CRAN Task Views.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/snippets | permanent link

Wed, 16 Jul 2014

Introducing RcppParallel: Getting R and C++ to work (some more) in parallel

A common theme over the last few decades was that we could afford to simply sit back and let computer (hardware) engineers take care of increases in computing speed thanks to Moore's law. That same line of thought now frequently points out that we are getting closer and closer to the physical limits of what Moore's law can do for us.

So the new best hope is (and has been) parallel processing. Even our smartphones have multiple cores, and most if not all retail PCs now possess two, four or more cores. Real computers, aka somewhat decent servers, can be had with 24, 32 or more cores as well, and all that is before we even consider GPU coprocessors or other upcoming changes.

And sometimes our tasks are embarassingly simple as is the case with many data-parallel jobs: we can use higher-level operations such as those offered by the base R package parallel to spawn multiple processing tasks and gather the results. I covered all this in some detail in previous talks on High Performance Computing with R (and you can also consult the Task View on High Performance Computing with R which I edit).

But sometimes we can't use data-parallel approaches. Hence we have to redo our algorithms. Which is really hard. R itself has been relying on the (fairly mature) OpenMP standard for some of its operations. Luke Tierney's (awesome) keynote in May at our (sixth) R/Finance conference mentioned some of the issues related to OpenMP. One which matters is that OpenMP works really well on Linux, and either not so well (Windows) or not at all (OS X, due the usual issue with the gcc/clang switch enforced by Applem but the good news is that the OpenMP toolchain is expected to make it to OS X is some more performant form "soon"). R is still expected to make wider use of OpenMP in future versions.

Another tool which has been around for a few years, and which can be considered to be equally mature is the Intel Threaded Building Blocks library, or TBB. JJ recently started to wrap this up for use by R. The first approach resulted in a (now superseded, see below) package TBB. But hardware and OS issues bite once again, as the Intel TBB is not really building that well for the Windows toolchain used by R (and based on MinGW).

(And yes, there are two more options. But Boost Threads requires linking which precludes easy use as e.g. via our BH package. And C++11 with its threads library (based on Boost Threads) is not yet as widely available as R and Rcpp which means that it is not a real deployment option yet.)

Now, JJ, being as awesome as he is, went back to the drawing board and integrated a second threading toolkit: TinyThread++, a small header-only library without further dependencies. Not as feature-rich as Intel Threaded Building Blocks, but at least available everywhere. So a new package RcppParallel, so far only on GitHub, wraps around both TinyThread++ and Intel Threaded Building Blocks and offers a consistent interface available on all platforms used by R.

Better still, JJ also authored several pieces demonstrating this new package for the Rcpp Gallery:

All four are interesting and demonstrate different aspects of parallel computing via RcppParallel. But the last article is key. Based on a question by Jim Bullard, and then written with Jim, it shows how a particular matrix distance metric (which is missing from R) can be implemented in a serial manner in both R, and also via Rcpp. The key implementation, however, uses both Rcpp and RcppParallel and thereby achieves a truly impressive speed gain as the gains from using compiled code (via Rcpp) and from using a parallel algorithm (via RcppParallel) are multiplicative! Between JJ's and my four-core machines the gain was between 200 and 300 fold---which is rather considerable. For kicks, I also used a much bigger machine at work which came in at an even larger speed gain (but gains become clearly sublinear as the number of cores increases; there are however some tuning parameters).

So these are exciting times. I am sure there will be lots more to come. For now, head over to the RcppParallel package and start playing. Further contributions to the Rcpp Gallery are not only welcome but strongly encouraged.

/code/snippets | permanent link

Tue, 11 Mar 2014

QuantLib 1.4 packages also available for Ubuntu and Windoze

QuantLib release 1.4 came out a couple of days ago. Sticking with standard practice, I immediately updated the Debian packages I have been maintaining for over a dozen years now, and corresponding binaries are available for Debian flavors unstable ("sid") and testing ("jessie"). I also updated my RQuantLib package accordingly (though it needed no changes -- yay to stable APIs).

However, I also run Ubuntu on a few machines at work and home, and proceeded to create local packages for both 32-bit ("i386") and 64-bit ("amd64") variants of the current release 13.10 ("saucy"). As these may be of use to others, I am now making these available here from my box so that folks can run a current QuantLib version on those machines.

Lastly, Brian Ripley was kind enough to also fire up his cross-compilation setup on his Linux box to create similar Windows binaries, These are e.g. used on CRAN and by the win-builder service to compile the RQuantLib CRAN Windows builds of RQuantLib. That corresponding archive file can be obtained as Quantlib14.zip see this short page for details. My thanks to Brian for helping with this.

I hope this may be useful to the few people wishing to compile and/or extend RQuantLib and QuantLib.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/snippets | permanent link

Wed, 23 Oct 2013

Introducing the CRAN Repository Policy Watch

CRAN is the repository network for R. It is a resounding success with (as of right now) almost 5000 packages, and growth rate which has been estimated (cf John Fox's keynote at useR! a few years ago) to be near 40% per year.

We as R community members owe a great deal of thanks to the CRAN maintainers, and the R Core team. The tight integration between the language and the contributed code repository is unique among programming languages, and one of the many reasons why CRAN has been such a success and driver of growth and adoption for R. And the amount of work the CRAN maintainers put into this is substantial, and we owe them.

Yet there is some friction between the repo maintainers, and the community of developers. There have numerous discussions on the main developer list about various aspect of how CRAN maintains the impeccable quality of the code in the archive. One particular aspect which has bugging (at least me) is the lack of communication when policy changes are made. It would be trivial to posts a set of changes to the developer list, and I suggested as much. Only to be soundly ignored.

But changes to text (or html) files can monitored rather easily, and when such changes occur an alert can be sent. So I cooked up rather simple system for this which I called the CRAN Policy Watch. In essence, a simple cronjob monitors changes, and records new versions in a Github repo (which you can follow or star). Alternatively, the cronjob now also tweets from the @CRANPolicyWatch account which I invite everyone to subscribe to as well.

If someone knows of a simple tool to summarize diffs of html or text files in static html pages, I'd be interested to expand the service to some github.io pages. Alernatively I could also just commit to a single file too and let Github summarize the changes.

This was an itch I needed to scratch, and I hope some other people will fine this useful too.

/code/snippets | permanent link

Thu, 31 Jan 2013

Introducing the BH package

Earlier today a new package BH arrived on CRAN. Over the years, Jay Emerson, Michael Kane and I had numerous discussions about a basic Boost infrastructure package providing Boost headers for other CRAN packages (and yes, we are talking packages using C++ here). JJ and Romain chipped in as well, and Jay finally took the lead by first creating a repo on R-Forge. And now the package is out, so I just put together a quick demo post over at the Rcpp Gallery.

As that post notes, BH is still pretty new and rough, and we probably missed some other useful Boost packages. If so, let one of us know.

/code/snippets | permanent link

Sat, 08 Dec 2012

Rcpp attributes: Even easier integration of GSL code into R

Following the Rcpp 0.10.0 release, I had written about simulating pi easily by using the wonderful new Rcpp Attributes feature. Now with Rcpp 0.10.1 released a good week ago, it is time to look at how Rcpp Attributes can help with external libraries. As this posts aims to show, it is a breeze!

One key aspect is the use of the plugins for the inline package. They provide something akin to a callback mechanism so that compilation and linking steps can be informed about header and library locations and names. We are going to illustrate this with an example from GNU Scientific Library (GSL). The example I picked uses B-Spline estimation from the GSL. This is a little redundant as R has its own spline routines and package, but serves well as a simple illustration---and, by reproducing an existing example, followed an established path. So we will look at Section 39.7 of the GSL manual which has a complete example as a standalone C program, generating both the data and the fit via cubic B-splines.

We can decompose this two parts: data generation, and fitting. We will provide one function each, and then use both from R. These two function will follow the aforementioned example from Section 39.7 somewhat closely.

We start with the first function to generate the data.

// [[Rcpp::depends(RcppGSL)]]
#include <RcppGSL.h>

#include <gsl/gsl_bspline.h>
#include <gsl/gsl_multifit.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_statistics.h>

const int N = 200;                              // number of data points to fit 
const int NCOEFFS = 12;                         // number of fit coefficients */
const int NBREAK = (NCOEFFS - 2);               // nbreak = ncoeffs + 2 - k = ncoeffs - 2 since k = 4 */

// [[Rcpp::export]]
Rcpp::List genData() {

    const size_t n = N;
    size_t i;
    double dy;
    gsl_rng *r;
    RcppGSL::vector<double> w(n), x(n), y(n);

    gsl_rng_env_setup();
    r = gsl_rng_alloc(gsl_rng_default);

    //printf("#m=0,S=0\n");
    /* this is the data to be fitted */

    for (i = 0; i < n; ++i) {
        double sigma;
        double xi = (15.0 / (N - 1)) * i;
        double yi = cos(xi) * exp(-0.1 * xi);

        sigma = 0.1 * yi;
        dy = gsl_ran_gaussian(r, sigma);
        yi += dy;

        gsl_vector_set(x, i, xi);
        gsl_vector_set(y, i, yi);
        gsl_vector_set(w, i, 1.0 / (sigma * sigma));
                
        //printf("%f %f\n", xi, yi);
    }

    Rcpp::DataFrame res = Rcpp::DataFrame::create(Rcpp::Named("x") = x,
                                                  Rcpp::Named("y") = y,
                                                  Rcpp::Named("w") = w);

    x.free();
    y.free();
    w.free();
    gsl_rng_free(r);

    return(res);
}

We include a few header files, define (in what is common for C programs) a few constants and then define a single function genData() which returns and Rcpp::List as a list object to R. A primary importance here are the two attributes: one to declare a dependence on the RcppGSL package, and one to declare the export of the data generator function. That is all it takes! The plugin of the RcppGSL will provide information about the headers and library, and Rcpp Attributes will do the rest.

The core of the function is fairly self-explanatory, and closely follows the original example. Space gets allocated, the RNG is setup and a simple functional form generates some data plus noise (see below). In the original, the data is written to the standard output; here we return it to R as three columns in a data.frame object familiar to R users. We then free the GSL vectors; this manual step is needed as they are implemented as C vectors which do not have a destructor.

Next, we can turn the fitting function.

// [[Rcpp::export]]
Rcpp::List fitData(Rcpp::DataFrame ds) {

    const size_t ncoeffs = NCOEFFS;
    const size_t nbreak = NBREAK;

    const size_t n = N;
    size_t i, j;

    Rcpp::DataFrame D(ds);              // construct the data.frame object
    RcppGSL::vector<double> y = D["y"]; // access columns by name, 
    RcppGSL::vector<double> x = D["x"]; // assigning to GSL vectors
    RcppGSL::vector<double> w = D["w"];

    gsl_bspline_workspace *bw;
    gsl_vector *B;
    gsl_vector *c; 
    gsl_matrix *X, *cov;
    gsl_multifit_linear_workspace *mw;
    double chisq, Rsq, dof, tss;

    bw = gsl_bspline_alloc(4, nbreak);      // allocate a cubic bspline workspace (k = 4)
    B = gsl_vector_alloc(ncoeffs);

    X = gsl_matrix_alloc(n, ncoeffs);
    c = gsl_vector_alloc(ncoeffs);
    cov = gsl_matrix_alloc(ncoeffs, ncoeffs);
    mw = gsl_multifit_linear_alloc(n, ncoeffs);

    gsl_bspline_knots_uniform(0.0, 15.0, bw);   // use uniform breakpoints on [0, 15] 

    for (i = 0; i < n; ++i) {                   // construct the fit matrix X 
        double xi = gsl_vector_get(x, i);

        gsl_bspline_eval(xi, B, bw);            // compute B_j(xi) for all j 

        for (j = 0; j < ncoeffs; ++j) {         // fill in row i of X 
            double Bj = gsl_vector_get(B, j);
            gsl_matrix_set(X, i, j, Bj);
        }
    }

    gsl_multifit_wlinear(X, w, y, c, cov, &chisq, mw);  // do the fit 
    
    dof = n - ncoeffs;
    tss = gsl_stats_wtss(w->data, 1, y->data, 1, y->size);
    Rsq = 1.0 - chisq / tss;
    
    Rcpp::NumericVector FX(151), FY(151);       // output the smoothed curve 
    double xi, yi, yerr;
    for (xi = 0.0, i=0; xi < 15.0; xi += 0.1, i++) {
        gsl_bspline_eval(xi, B, bw);
        gsl_multifit_linear_est(B, c, cov, &yi, &yerr);
        FX[i] = xi;
        FY[i] = yi;
    }

    Rcpp::List res =
      Rcpp::List::create(Rcpp::Named("X")=FX,
                         Rcpp::Named("Y")=FY,
                         Rcpp::Named("chisqdof")=Rcpp::wrap(chisq/dof),
                         Rcpp::Named("rsq")=Rcpp::wrap(Rsq));

    gsl_bspline_free(bw);
    gsl_vector_free(B);
    gsl_matrix_free(X);
    gsl_vector_free(c);
    gsl_matrix_free(cov);
    gsl_multifit_linear_free(mw);
    
    y.free();
    x.free();
    w.free();

    return(res);   
}

The second function closely follows the second part of the GSL example and, given the input data, fits the output data. Data structures are setup, the spline basis is created, data is fit and then the fit is evaluated at a number of points. These two vectors are returned along with two goodness of fit measures.

We only need to load the Rcpp package and source a file containing the two snippets shown above, and we are ready to deploy this:

library(Rcpp)
sourceCpp("bSpline.cpp")                # compile two functions
dat <- genData()                        # generate the data
fit <- fitData(dat)                     # fit the model, returns matrix and gof measures

And with that, we generate a chart such as

Spline fitting example from GSL manual redone with Rcpp Attributes

via a simple four lines, or as much as it took to create the C++ functions, generate the data and fit it!

op <- par(mar=c(3,3,1,1))
plot(dat[,"x"], dat[,"y"], pch=19, col="#00000044")
lines(fit[[1]], fit[[2]], col="orange", lwd=2)
par(op)

The RcppArmadillo and RcppEigen package support plugin use in the same way. Add an attribute to export a function, and an attribute for the depends -- and you're done. Extending R with (potentially much faster) C++ code has never been easier, and opens a whole new set of doors.

/code/snippets | permanent link

Tue, 20 Nov 2012

Rcpp attributes: A simple example 'making pi'

We introduced Rcpp 0.10.0 with a number of very nice new features a few days ago, and the activity on the rcpp-devel mailing list has been pretty responsive which is awesome.

But because few things beat a nice example, this post tries to build some more excitement. We will illustrate how Rcpp attributes makes it really easy to add C++ code to R session, and that that code is as easy to grasp as R code.

Our motivating example is everybody's favourite introduction to Monte Carlo simulation: estimating π. A common method uses the fact the unit circle has a surface area equal to π. We draw two uniform random numbers x and y, each between zero and one. We then check for the distance of the corresponding point (x,y) relative to the origin. If less than one (or equal), it is in the circle (or on it); if more than one it is outside. As the first quadrant is a quarter of a square of area one, the area of the whole circle is π -- so our first quadrant approximates π over four. The following figure, kindly borrowed from Wikipedia with full attribution and credit, illustrates this:

Example of simulating pi

Now, a vectorized version (drawing N such pairs at once) of this approach is provided by the following R function.

piR <- function(N) {
    x <- runif(N)
    y <- runif(N)
    d <- sqrt(x^2 + y^2)
    return(4 * sum(d < 1.0) / N)
}

And in C++ we can write almost exactly the same function thanks the Rcpp sugar vectorisation available via Rcpp:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
double piSugar(const int N) {
    RNGScope scope;		// ensure RNG gets set/reset
    NumericVector x = runif(N);
    NumericVector y = runif(N);
    NumericVector d = sqrt(x*x + y*y);
    return 4.0 * sum(d < 1.0) / N;
}

Sure, there are small differences: C++ is statically typed, R is not. We need one include file for declaration, and we need one instantiation of the RNGScope object to ensure random number draws remain coordinated between the calling R process and the C++ function calling into its (compiled C-code based) random number generators. That way we even get the exact same draws for the same seed. But the basic approach is identical: draw a vector x and vector y, compute the distance to the origin and then obtain the proportion within the unit circle -- which we scale by four. Same idea, same vectorised implementation in C++.

But the real key here is the one short line with the [[Rcpp::export]] attribute. This is all it takes (along with sourceCpp() from Rcpp 0.10.0) to get the C++ code into R.

The full example (which assumes the C++ file is saved as piSugar.cpp in the same directory) is now:

#!/usr/bin/r

library(Rcpp)
library(rbenchmark)

piR <- function(N) {
    x <- runif(N)
    y <- runif(N)
    d <- sqrt(x^2 + y^2)
    return(4 * sum(d < 1.0) / N)
}

sourceCpp("piSugar.cpp")

N <- 1e6

set.seed(42)
resR <- piR(N)

set.seed(42)
resCpp <- piSugar(N)

## important: check results are identical with RNG seeded
stopifnot(identical(resR, resCpp))

res <- benchmark(piR(N), piSugar(N), order="relative")

print(res[,1:4])

and it does a few things: set up the R function, source the C++ function (and presto: we have a callable C++ function just like that), compute two simulations given the same seed and ensure they are in fact identical -- and proceed to compare the timing in a benchmarking exercise. That last aspect is not even that important -- we end up being almost-but-not-quite twice as fast on my machine for different values of N.

The real takeaway here is the ease with which we can get a C++ function into R --- and the new process completely takes care of passing parameters in, results out, and does the compilation, linking and loading.

More details about Rcpp attributes are in the new vignette. Now enjoy the π.

Update:One somewhat bad typo fixed.

Update:Corrected one background tag.

/code/snippets | permanent link

Wed, 14 Nov 2012

Rcpp and the new R:: namespace for Rmath.h

We released Rcpp 0.10.0 earlier today. This post will just provide a simple example for one of the smaller new features -- the new namespace for functions from Rmath.h -- and illustrate one of the key features (Rcpp attributes) in passing.

R, as a statistical language and environment, has very well written and tested statistical distribution functions providing probability density, cumulative distribution, quantiles and random number draws for dozens of common and not so common distribution functions. This code is used inside R, and available for use from standalone C or C++ programs via the standalone R math library which Debian / Ubuntu have as a package r-mathlib (and which can be built from R sources).

User sometimes write code against this interface, and then want to combine the code with other code, possibly even with Rcpp. We allowed for this, but it required a bit of an ugly interface. R provides a C interface; these have no namespaces. Identifiers can clash, and to be safe one can enable a generic prefix Rf_. So functions which could clash such as length or error become Rf_length and Rf_error and are less likely to conflict with symbols from other libraries. Unfortunately, the side-effect is that calling, say, the probability distribution function for the Normal distribution becomes Rf_pnorm5() (with the 5 denoting the five parameters: quantile, mean, std.deviation, lowerTail, logValue). Not pretty, and not obvious.

So one of the things we added was another layer of indirection by adding a namespace R with a bunch of inline'd wrapper functions (as well as several handful of unit tests to make sure we avoided typos and argument transposition and what not).

The short example below shows this for a simple function taking a vector, and returning its pnorm computed three different ways:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::DataFrame mypnorm(Rcpp::NumericVector x) {
    int n = x.size();
    Rcpp::NumericVector y1(n), y2(n), y3(n);

    for (int i=0; i<n; i++) {

        // the way we used to do this
        y1[i] = ::Rf_pnorm5(x[i], 0.0, 1.0, 1, 0);

        // the way we can do it now
        y2[i] = R::pnorm(x[i], 0.0, 1.0, 1, 0);

    }
    // or using Rcpp sugar in one go
    y3 = Rcpp::pnorm(x);

    return Rcpp::DataFrame::create(Rcpp::Named("Rold")  = y1,
                                   Rcpp::Named("Rnew")  = y2,
                                   Rcpp::Named("sugar") = y3);
}

This example also uses the new Rcpp attributes described briefly in the announcement blog post and of course in more detail in the corresponding vignette. Let us just state here that we simply provide a complete C++ function, using standard Rcpp types -- along with one 'attribute' declaration of an export via Rcpp. That's it -- even easier than using inline.

Now in R we simply do

R> sourceCpp("mypnorm.cpp")

to obtain a callable R function with the C++ code just shown behind it. No Makefile, no command-line tool invocation -- nothing but a single call to sourceCpp() which takes care of things --- and brings us a compiled C++ function to R just given the source file with its attribute declaration.

We can now use the new function to compute the probaility distribution both the old way, the new way with the 'cleaner' R::pnorm(), and of course the Rcpp sugar way in a single call. We build a data frame in C++, and assert that all three variants are the same:

R> x <- seq(0, 1, length=1e3)
R> res <- mypnorm(x)
R> head(res)
      Rold     Rnew    sugar
1 0.500000 0.500000 0.500000
2 0.500399 0.500399 0.500399
3 0.500799 0.500799 0.500799
4 0.501198 0.501198 0.501198
5 0.501597 0.501597 0.501597
6 0.501997 0.501997 0.501997
R> all.equal(res[,1], res[,2], res[,3])
[1] TRUE
R>

This example hopefully helped to illustrate how Rcpp 0.10.0 brings both something really powerful (Rcpp attributes -- more on this another time, hopefully) and convenient in the new namespace for statistical functions.

/code/snippets | permanent link

Thu, 25 Oct 2012

Accelerating R code: Computing Implied Volatilities Orders of Magnitude Faster

This blog, together with Romain's, is one of the main homes of stories about how Rcpp can help with getting code to run faster in the context of the R system for statistical programming and analysis. By making it easier to get already existing C or C++ code to R, or equally to extend R with new C++ code, Rcpp can help in getting stuff done. And it is often fairly straightforward to do so.

In this context, I have a nice new example. And for once, it is work-related. I generally cannot share too much of what we do there as this is, well, proprietary, but I have this nice new example. The other day, I was constructing (large) time series of implied volatilities. Implied volatilities can be thought of as the complement to an option's price: given a price (and all other observables which can be thought of as fixed), we compute an implied volatility price (typically via the standard Black-Scholes model). Given a changed implied volatility, we infer a new price -- see this Wikipedia page for more details. In essence, it opens the door to all sorts of arbitrage and relative value pricing adventures.

Now, we observe prices fairly frequently to create somewhat sizeable time series of option prices. And each price corresponds to one matching implied volatility, and for each such price we have to solve a small and straightforward optimization problem: to compute the implied volatility given the price. This is usually done with an iterative root finder.

The problem comes from the fact that we have to do this (i) over and over and over for large data sets, and (ii) that there are a number of callbacks from the (generic) solver to the (standard) option pricer.

So our first approach was to just call the corresponding function GBSVolatility from the fOption package from the trusted Rmetrics project by Diethelm Wuertz et al. This worked fine, but even with the usual tricks of splitting over multiple cores/machines, it simply took too long for the resolution and data amount we desired. One of the problems is that this function (which uses the proper uniroot optimizer in R) is not inefficient per se, but simply makes to many function call back to the option pricer as can be seen from a quick glance at the code. The helper function .fGBSVolatility gets called time and time again:

R> GBSVolatility
function (price, TypeFlag = c("c", "p"), S, X, Time, r, b, tol = .Machine$double.eps, 
    maxiter = 10000) 
{
    TypeFlag = TypeFlag[1]
    volatility = uniroot(.fGBSVolatility, interval = c(-10, 10), 
        price = price, TypeFlag = TypeFlag, S = S, X = X, Time = Time, 
        r = r, b = b, tol = tol, maxiter = maxiter)$root
    volatility
}
<environment: namespace:fOptions>
R> 
R> .fGBSVolatility
function (x, price, TypeFlag, S, X, Time, r, b, ...) 
{
    GBS = GBSOption(TypeFlag = TypeFlag, S = S, X = X, Time = Time, 
        r = r, b = b, sigma = x)@price
    price - GBS
}
<environment: namespace:fOptions>

So the next idea was to try the corresponding function from my RQuantLib package which brings (parts of) QuantLib to R. That was seen as been lots faster already. Now, QuantLib is pretty big and so is RQuantLib, and we felt it may not make sense to install it on a number of machines just for this simple problem. So one evening this week I noodled around for an hour or two and combined (i) a basic Black/Scholes calculation and (ii) a standard univariate zero finder (both of which can be found or described in numerous places) to minimize the difference between the observed price and the price given an implied volatility. With about one hundred lines in C++, I had something which felt fast enough. So today I hooked this into R via a two-line wrapper in quickly-created package using Rcpp.

I had one more advantage here. For our time series problem, the majority of the parameters (strike, time to maturity, rate, ...) are fixed, so we can structure the problem to be vectorised right from the start. I cannot share the code or more the details of my new implementation. However, both GBSVolatility and EuropeanOprionImpliedVolatility are on CRAN (and as I happen to maintain these for Debian, also just one sudo apt-get install r-cran-foptions r-cran-rquantlib away if you're on Debian or Ubuntu). And writing the other solver is really not that involved.

Anyway, here is the result, courtesy of a quick run via the rbenchmark package. We create a vector of length 500; the implied volatility computation will be performed at each point (and yes, our time series are much longer indeed). This is replicated 100 times (as is the default for rbenchmark) for each of the three approaches:

xyz@xxxxxxxx:~$ r xxxxR/packages/xxxxOptions/demo/timing.R
    test replications elapsed  relative user.self sys.self user.child sys.child
3 zzz(X)          100   0.038     1.000     0.040    0.000          0         0
2 RQL(X)          100   3.657    96.237     3.596    0.060          0         0
1 fOp(X)          100 448.060 11791.053   446.644    1.436          0         0
xyz@xxxxxxxx:~$

The new local solution is denoted by zzz(X). It is already orders of magnitude faster than the RQL(x) function using RQuantLib (which is, I presume, due to my custom solution internalising the loop). And the new approach is a laughable amount faster than the basic approach (shown as fOp) via fOptions. For one hundred replications of solving implied volatilities for all elements of a vector of size 500, the slow solution takes about 7.5 minutes --- while the fast solution takes 38 milliseconds. Which comes to a relative gain of over 11,000.

So sitting down with your C++ compiler to craft a quick one-hundred lines, combining two well-known and tested methods, can reap sizeable benefits. And Rcpp makes it trivial to call this from R.

/code/snippets | permanent link

Sun, 02 Sep 2012

Faster creation of binomial matrices

Scott Chamberlain blogged about faster creation of binomial matrices the other day, and even referred to our RcppArmadillo package as a possible solution (though claiming he didn't get it to work, tst tst -- that is what the rcpp-devel list is here to help with).

The post also fell short of a good aggregated timing comparison for which we love the rbenchmark package. So in order to rectify this, and to see what we can do here with Rcpp, a quick post revisiting the issue.

As preliminaries, we need to load three packages: inline to create compiled code on the fly (which, I should mention, is also used together with Rcpp by the Stan / RStan MCMC sampler which is creating some buzz this week), the compiler package included with R to create byte-compiled code and lastly the aforementioned rbenchmark package to do the timings. We also set row and column dimension, and set them a little higher than the original example to actually have something measurable:

library(inline)
library(compiler)
library(rbenchmark)

n <- 500
k <- 100

The first suggestion was the one by Scott himself. We will wrap this one, and all the following ones, in a function so that all approaches are comparable as being in a function of two dimension arguments:

scott <- function(N, K) {
    mm <- matrix(0, N, K)
    apply(mm, c(1, 2), function(x) sample(c(0, 1), 1))
}
scottComp <- cmpfun(scott)

We also immediatly compute a byte-compiled version (just because we now can) to see if this helps at all with the code. As there are no (explicit !) loops, we do not expect a big pickup. Scott's function works, but sweeps the sample() function across all rows and columns which is probably going to be (relatively) expensive.

Next is the first improvement suggested to Scott which came from Ted Hart.

ted <- function(N, K) {
    matrix(rbinom(N * K, 1, 0.5), ncol = K, nrow = N)
}

This is quite a bit smarter as it vectorises the approach, generating N times K elements at once which are then reshaped into a matrix.

Another suggestion came from David Smith as well as Rafael Maia. We rewrite it slightly to make it a function with two arguments for the desired dimensions:

david <- function(m, n) {
    matrix(sample(0:1, m * n, replace = TRUE), m, n)
}

This is very clever as it uses sample() over zero and one rather than making (expensive) draws from random number generator.

Next we have a version from Luis Apiolaza:

luis <- function(m, n) {
     round(matrix(runif(m * n), m, n))
}

It draws from a random uniform and rounds to zero and one, rather than deploying the binomial.

Then we have the version using RcppArmadillo hinted at by Scott, but with actual arguments and a correction for row/column dimensions. Thanks to inline we can write the C++ code as an R character string; inline takes care of everything and we end up with C++-based solution directly callable from R:

arma <- cxxfunction(signature(ns="integer", ks="integer"), plugin = "RcppArmadillo", body='
   int n = Rcpp::as<int>(ns);
   int k = Rcpp::as<int>(ks);
   return wrap(arma::randu(n, k));
')

This works, and is pretty fast. The only problem is that it answers the wrong question as it returns U(0,1) draws and not binomials. We need to truncate or round. So a corrected version is

armaFloor <- cxxfunction(signature(ns="integer", ks="integer"), plugin = "RcppArmadillo", body='
   int n = Rcpp::as<int>(ns);
   int k = Rcpp::as<int>(ks);
   return wrap(arma::floor(arma::randu(n, k) + 0.5));
')

which uses the the old rounding approximation of adding 1/2 before truncating.

With Armadillo in the picture, we do wonder how Rcpp sugar would do. Rcpp sugar, described in one of the eight vignettes of the Rcpp package, is using template meta-programming to provide R-like expressiveness (aka "syntactic sugar") at the C++ level. In particular, it gives access to R's RNG functions using the exact same RNGs as R making the results directly substitutable (whereas Armadillo uses its own RNG).

sugar <- cxxfunction(signature(ns="integer", ks="integer"), plugin = "Rcpp", body='
   int n = Rcpp::as<int>(ns);
   int k = Rcpp::as<int>(ks);
   Rcpp::RNGScope tmp;
   Rcpp::NumericVector draws = Rcpp::runif(n*k);
   return Rcpp::NumericMatrix(n, k, draws.begin());
')

Here Rcpp::RNGScope deals with setting/resetting the R RNG state. This draws a vector of N time K uniforms similar to Luis' function -- and just like Luis' R function does so without looping -- and then shapes a matrix of dimension N by K from it.

And it does of course have the same problem as the RcppArmadillo approach earlier and we can use the same solution:

sugarFloor <- cxxfunction(signature(ns="integer", ks="integer"), plugin = "Rcpp", body='
   int n = Rcpp::as<int>(ns);
   int k = Rcpp::as<int>(ks);
   Rcpp::RNGScope tmp;
   Rcpp::NumericVector draws = Rcpp::floor(Rcpp::runif(n*k)+0.5);
   return Rcpp::NumericMatrix(n, k, draws.begin());
')

Now that we have all the pieces in place, we can compare:

res <- benchmark(scott(n, k), scottComp(n,k),
                 ted(n, k), david(n, k), luis(n, k),
                 arma(n, k), sugar(n,k),
                 armaFloor(n, k), sugarFloor(n, k),
                 order="relative", replications=100)
print(res[,1:4])

With all the above code example in a small R script we call via littler, we get

edd@max:~/svn/rcpp/pkg$ r /tmp/scott.r 
Loading required package: methods
              test replications elapsed   relative
7      sugar(n, k)          100   0.072   1.000000
9 sugarFloor(n, k)          100   0.088   1.222222
6       arma(n, k)          100   0.126   1.750000
4      david(n, k)          100   0.136   1.888889
8  armaFloor(n, k)          100   0.138   1.916667
3        ted(n, k)          100   0.384   5.333333
5       luis(n, k)          100   0.410   5.694444
1      scott(n, k)          100  33.045 458.958333
2  scottComp(n, k)          100  33.767 468.986111

We can see several takeaways:

Rcpp sugar wins, which is something we have seen in previous posts on this blog. One hundred replication take only 72 milliseconds (or 88 in the corrected version) --- less than one millisecond per matrix creation.
RcppArmadillo does well too, and I presume that the small difference is due not to code in Armadillo but the fact that we need one more 'mapping' of data types on the way back to R
The sample() idea by David and Rafael is very, very fast too. This proves once again that well-written R code can be competitive. It also suggest how to make the C++ solution by foregoing (expensive) RNG draws in favour of sampling
The approaches by Ted and Luis are also pretty good. In practice, the are probably good enough.
Scott's function is not looking so hot (particularly as we increased the problem dimensions) and byte-compilation does not help at all.

Thanks to Scott and everybody for suggesting this interesting problem. Trying the rbinom() Rcpp sugar function, or implementing sample() at the C++ level is, as the saying goes, left as an exercise to the reader.

/code/snippets | permanent link

Thu, 16 Aug 2012

Follow-up to Counting CRAN Package Depends, Imports and LinkingTo

A few days ago, I blogged about visualizing CRAN dependency ranks which turned out to be a somewhat popular post. David Smith followed-up at the REvo blog suggesting to exclude packages already shipping with R (which is indicated by their 'Recommended' priority). Good idea!

So here is an updated version, where we limit the display to the top twenty packages counted by reverse 'Depends:', and excluding those already shipping with R such as MASS, lattice, survival, Matrix, or nlme.

CRAN package chart of Reverse Depends relations excluding Recommended packages

The mvtnorm package is still out by a wide margin, but we can note that (cough, cough) our Rcpp package for seamless R and C++ is now tied for second with the coda package for MCMC analysis. Also of note is the fact that CRAN keeps growing relentlessly and moved from 3969 packages to 3981 packages in the space of these few days...

Lastly, I have been asked about the code and/or data behind this. It is really pretty simply as the main data.frame can be had from CRAN (where I also found the initial few lines to load it). After that, one only needs a little bit of subsetting as shown below. I look forward to seeing other people riff on this data set.

#!/usr/bin/r
##
## Initial db downloand from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted

require("tools")

## this function is essentially the same as R Core's from the URL
## http://developer.r-project.org/CRAN/Scripts/depends.R
getDB <- function() {
    contrib.url(getOption("repos")["CRAN"], "source") # trigger chooseCRANmirror() if required
    description <- sprintf("%s/web/packages/packages.rds", getOption("repos")["CRAN"])
    con <- if(substring(description, 1L, 7L) == "file://") {
        file(description, "rb")
    } else {
        url(description, "rb")
    }
    on.exit(close(con))
    db <- readRDS(gzcon(con))
    rownames(db) <- db[,"Package"]

    db
}

db <- getDB()

## count packages
getCounts <- function(db, col) {
    foo <- sapply(db[,col],
                  function(s) { if (is.na(s)) NA else length(strsplit(s, ",")[[1]]) } )
}

## build a data.frame with the number of entries for reverse depends, reverse imports,
## reverse linkingto and reverse suggests; also keep Recommended status
ddall <- data.frame(pkg=db[,1],
                    RDepends=getCounts(db, "Reverse depends"),
                    RImports=getCounts(db, "Reverse imports"),
                    RLinkingTo=getCounts(db, "Reverse linking to"),
                    RSuggests=getCounts(db, "Reverse suggests"),
                    Recommended=db[,"Priority"]=="recommended"
                    )

## Subset to non-Recommended packages as in David Smith's follow-up post
dd <- subset(ddall, is.na(ddall[,"Recommended"]) | ddall[,"Recommended"] != TRUE)

labeltxt <- paste("Analysis as of", format(Sys.Date(), "%d %b %Y"),
                  "covering", nrow(db), "total CRAN packages")

cutOff <- 20
doPNG <- TRUE

if (doPNG) png("/tmp/CRAN_ReverseDepends.png", width=600, heigh=600)
z <- dd[head(order(dd[,2], decreasing=TRUE), cutOff),c(1,2)]
dotchart(z[,2], labels=z[,1], cex=1, pch=19,
         main="CRAN Packages sorted by Reverse Depends:",
         sub=paste("Limited to top", cutOff, "packages, excluding 'Recommended' ones shipped with R"),
         xlab=labeltxt)
if (doPNG) dev.off()

if (doPNG) png("/tmp/CRAN_ReverseImports.png", width=600, heigh=600)
z <- dd[head(order(dd[,3], decreasing=TRUE), cutOff),c(1,3)]
dotchart(z[,2], labels=z[,1], cex=1, pch=19,
         main="CRAN Packages sorted by Reverse Imports:",
         sub=paste("Limited to top", cutOff, "packages, excluding 'Recommended' ones shipped with R"),
         xlab=labeltxt)
if (doPNG) dev.off()

# no cutOff but rather a na.omit
if (doPNG) png("/tmp/CRAN_ReverseLinkingTo.png", width=600, heigh=600)
z <- na.omit(dd[head(order(dd[,4], decreasing=TRUE), 30),c(1,4)])
dotchart(z[,2], labels=z[,1], pch=19,
         main="CRAN Packages sorted by Reverse LinkingTo:",
         xlab=labeltxt)
if (doPNG) dev.off()

/code/snippets | permanent link

Sun, 05 Aug 2012

Counting CRAN Package Depends, Imports and LinkingTo

The recent update by Søren Højsgaard's to his gRbase package for graphical models made it the 75th package to depend on our Rcpp package for R and C++ integration. So in a lighthearted weekend moment, I tweeted about gRbase being number 75 for Rcpp to which Hadley replied, asking if Rcpp was in fact number one among R package Depends. Far from it, and I immediately replied listing lattice and Matrix as packages with way more other packages depending upon them.

But as the question seemed deserving of a bit more analysis, I spent a few minutes on this and prepared three charts listing package in order of reverse Depends, reverse Imports and reverse LinkingTo.

First off, the reverse Depends:. This is the standard means of declaring a dependence of one package upon another.

CRAN package chart of Reverse Depends relations

Unsurprisingly, the MASS package from the classic Venables and Ripley book comes first, with Deepayan Sarkar's powerful lattice package (also covered in a book) coming second. These are both recommended packages which are commonly distributed with R itself. Next are mvtnorm and survival. Our Rcpp is up there in the top-ten, but not a frontrunner.

With the advent of namespaces a few R releases ago, it became possible to import functions from other packages. So the Imports: statement now provides an alternative to the (older) Depends:. The next chart displays the same relationship for Imports::

CRAN package chart of Reverse Imports relations

Now lattice still leads, but Hadleys's plyr package grabbed the second spot just before MASS and Matrix.

It is interesting to see that the sheer number of Imports: is still not where the Depends: are. On the other hand, we see a number of more recent packages popping up in the second chart. This may reflect more recent coding practices. It will be interesting to see how this stacks up over time when we revisit this chart.

Lastly, we can also look at LinkingTo:, a declaration used to provide a C/C++-level dependency at the source code level. We use this in the Rcpp family to provide automatic resolution of the header files needed to compile against our packages. And unsurprisingly, because packages using Rcpp actually use its API (rather than R functions), the package is a little ahead of others. In the package we find three more packages of the Rcpp family, but only a limited number of other packages as C/C++-level dependencies are still somewhat rare in the R universe. There are also fewer packages overall making use of this mechanism.

CRAN package chart of Reverse LinkingTo relations

One could of course take this one level further and sum up dependencies in a recursive manner, or visualize the relationship differently. But these dotchart graphs provide a first visual description of the magnitude of Depends, Imports and LinkingTo among CRAN packages for R.

/code/snippets | permanent link

Tue, 10 Jul 2012

Getting numpy data into R -- Take Two

A couple of days ago, I had posted a short Python script to convert numpy files into a simple binary format which R can read quickly. Nice, but still needing an extra file. Shortly thereafter, I found Carl Rogers cnpy library which makes reading and writing numpy files from C++ a breeze, and I quickly wrapped this up into a new package RcppCNPy which was released a few days ago.

This post will show a quick example, also summarized in the short pdf vignette describing the package, and provided as a demo within the package.

R> library(RcppCNPy)
Loading required package: Rcpp
R> library(rbenchmark)
R> 
R> n <- 1e5
R> k <- 50
R> 
R> M <- matrix(seq(1.0, n*k, by=1.0), n, k)
R> 
R> txtfile <- tempfile(fileext=".txt")
R> write.table(M, file=txtfile)
R> 
R> pyfile <- tempfile(fileext=".py")
R> npySave(pyfile, M)
R> 
R> pygzfile <- tempfile(fileext=".py")
R> npySave(pygzfile, M)
R> system(paste("gzip -9", pygzfile))
R> pygzfile <- paste(pygzfile, ".gz", sep="")
R> 
R>

We first load the new package (as well as the rbenchmark package used for the benchmarking example) into R. We then create a large matrix of 100,000 rows and 50 columns. Not quite big data by any stretch, but large enough for ascii reading to be painfully slow. We also write two npy files and compress the second one.

Next, we use the benchmark function to time the three approaches:

R> res <- benchmark(read.table(txtfile),
+                  npyLoad(pyfile),
+                  npyLoad(pygzfile),
+                  order="relative",
+                  columns=c("test", "replications", "elapsed", "relative"),
+                  replications=10)
R> print(res)
                 test replications elapsed relative
2     npyLoad(pyfile)           10   1.241  1.00000
3   npyLoad(pygzfile)           10   3.098  2.49637
1 read.table(txtfile)           10  96.744 77.95649
R>

As shown by this example, loading a numpy file directly beats the pants off reading the data from ascii: it is about 78 times faster. Reading a compressed file is somewhat slower as the data stream has to be passed through the uncompressor provide by the zlib library. So instead of reading a binary blob in one go (once the file header has been parsed) we have to operate piecemeal---which is bound to be slower. It does however save in storage space (and users can make this tradeoff between speed and size) and is still orders of magnitude faster than parsing the ascii file. Finally, and not shown here, we unlink the temporary files.

Summing up, this post demonstrated how the RcppCNPy package can be a useful to access data in numpy files (which may even be compressed). Data can also be written from R to be accessed later by numpy.

/code/snippets | permanent link

Sat, 30 Jun 2012

Getting numpy data into R

The other day, I found myself confronted with a large number of large files. Which were presented in (gzip-)compressed ascii format---which R reads directly via gzfile() connections---as well as (compressed) numpy files.

The numpy can be read very efficiently into Python. We can do the same in R via save() and load(), of course. But the trouble is that you need to read them first. And reading hundreds of megabytes from ascii is slow, no matter which language you use. Concerning R, I poked aound scan(), played with the colClasses argument and looked at the recent LaF package written just for this purpose. And all these solutions were still orders of magnitude slower than reading numpy. Which is no surprise as it is really hard to beat binary formats when you have to parse countless ascii tokens.

So the obvious next idea was to read the numpy file in Python, and to write a simple binary format. One helpful feature with this data set was that it contained only regular (rectangular) matrices of floats. So we could just store two integers for the dimensions, followed by the total data in either one large binary blob, or a sequence of column vectors.

But one minor trouble was that the Intertubes lead to no easy solution to unpack the numpy format. StackOverflow had plenty of question around this topic converned with, say, how to serialize in language-independent way. But no converters. And nobody local knew how to undo the "pickle" format underlying numpy.

But a remote friend did: Laurent, well-known for his Rpy2 package, pointed me towards using the struct module and steered me towards the solution shown below. So a shameless plug: if you need a very experienced Python or R consultant for sciece work, consider his consulting firm.

Finally, to round out this post, let's show the simple solution we crafted so that the next guy searching the Intertubes will have an easier. Let us start with a minimal Python program writing numpy data to disk:

#!/usr/bin/env python
#
# simple example for creating numpy data to demonstrate converter

import numpy as np

# simple float array
a = np.arange(15).reshape(3,5) * 1.1

outfile = "/tmp/data.npy"
np.save(outfile, a)

Next, the simple Python converter to create a binary file containing two integers for row and column dimension, followed by row times columns of floats:

#!/usr/bin/python
#
# read a numpy file, and write a simple binary file containing
#   two integers 'n' and 'k' for rows and columns
#   n times k floats with the actual matrix
# which can be read by any application or language that can read binary
 
import struct
import numpy as np

inputfile = "/tmp/data.npy"
outputfile = "/tmp/data.bin"

# load from the file
mat = np.load(inputfile)

# create a binary file
binfile = file(outputfile, 'wb')
# and write out two integers with the row and column dimension
header = struct.pack('2I', mat.shape[0], mat.shape[1])
binfile.write(header)
# then loop over columns and write each
for i in range(mat.shape[1]):
    data = struct.pack('%id' % mat.shape[0], *mat[:,i])
    binfile.write(data)
binfile.close()

Lastly, a quick littler script showing how R can read the data in a handful of lines:

#!/usr/bin/r

infile <- "/tmp/data.bin"
con <- file(infile, "rb")
dim <- readBin(con, "integer", 2)
Mat <- matrix( readBin(con, "numeric", prod(dim)), dim[1], dim[2])
close(con)

print(Mat)

That did the job---and I already used to converter to read a few weeks worth of data for further analysis in R. This obviously isn't the last word on possible solutions as the additional temporary file can be wasteful (unless it forms a cache for data read multiple times). If someone has nice solutions, please don't hold back and contact me. Thanks again to Laurent for the winning suggestion concerning struct, and help in getting the examples shown here to work.

/code/snippets | permanent link

Sat, 18 Feb 2012

Using Rcout with Rcpp / RcppArmadillo to coordinate output with R

The new RcppArmadillo release 0.2.35 now supports the Rcpp::Rcout output stream device. Based on a contributed Rcpp patch by Jelper Ypma, the Rcpp::Rcout output stream gets redirected to R's buffered output. In other words, R's own output and that eminating from C++ code using Rcpp::Rcout are now both in sync. This avoids a stern warning from Section 5.6 in the Writing R Extensions manual:

Using C++ iostreams, as in this example, is best avoided. There is no guarantee that the output will appear in the R console, and indeed it will not on the R for Windows console. Use R code or the C entry points (*note Printing) for all I/O if at all possible.

and does in fact provide exactly what is recommended: the same entry points R itself uses.

Below is a sample program, once again using the wonderful inline package to compile, load and link C++ code into R from a simple text variable submitted to the cxxfunction. What is shown in R code to load the package, the definition of the C++ code as assigned to a variable src and the creation of the dynamically-loadaded R function called fun which contains the code from we compiled, link and load via a single call to cxxfunction() given src.

library
library(inline)

src <- '

  Rcpp::Rcout << "Armadillo version: " << arma::arma_version::as_string() << std::endl;

  // directly specify the matrix size (elements are uninitialised)
  arma::mat A(2,3);

  // .n_rows = number of rows    (read only)
  // .n_cols = number of columns (read only)
  Rcpp::Rcout << "A.n_rows = " << A.n_rows << std::endl;
  Rcpp::Rcout << "A.n_cols = " << A.n_cols << std::endl;

  // directly access an element (indexing starts at 0)
  A(1,2) = 456.0;

  A.print("A:");

  // scalars are treated as a 1x1 matrix,
  // hence the code below will set A to have a size of 1x1
  A = 5.0;
  A.print("A:");

  // if you want a matrix with all elements set to a particular value
  // the .fill() member function can be used
  A.set_size(3,3);
  A.fill(5.0);
  A.print("A:");


  arma::mat B;

  // endr indicates "end of row"
  B << 0.555950 << 0.274690 << 0.540605 << 0.798938 << arma::endr
    << 0.108929 << 0.830123 << 0.891726 << 0.895283 << arma::endr
    << 0.948014 << 0.973234 << 0.216504 << 0.883152 << arma::endr
    << 0.023787 << 0.675382 << 0.231751 << 0.450332 << arma::endr;

  // print to the cout stream
  // with an optional string before the contents of the matrix
  B.print("B:");

  // the << operator can also be used to print the matrix
  // to an arbitrary stream (cout in this case)
  Rcpp::Rcout << "B:" << std::endl << B << std::endl;

  // save to disk
  B.save("B.txt", arma::raw_ascii);

  // load from disk
  arma::mat C;
  C.load("B.txt");

  C += 2.0 * B;
  C.print("C:");


  // submatrix types:
  //
  // .submat(first_row, first_column, last_row, last_column)
  // .row(row_number)
  // .col(column_number)
  // .cols(first_column, last_column)
  // .rows(first_row, last_row)

  Rcpp::Rcout << "C.submat(0,0,3,1) =" << std::endl;
  Rcpp::Rcout << C.submat(0,0,3,1) << std::endl;

  // generate the identity matrix
  arma::mat D = arma::eye<arma::mat>(4,4);

  D.submat(0,0,3,1) = C.cols(1,2);
  D.print("D:");

  // transpose
  Rcpp::Rcout << "trans(B) =" << std::endl;
  Rcpp::Rcout << trans(B) << std::endl;

  // maximum from each column (traverse along rows)
  Rcpp::Rcout << "max(B) =" << std::endl;
  Rcpp::Rcout << max(B) << std::endl;

  // maximum from each row (traverse along columns)
  Rcpp::Rcout << "max(B,1) =" << std::endl;
  Rcpp::Rcout << max(B,1) << std::endl;

  // maximum value in B
  Rcpp::Rcout << "max(max(B)) = " << max(max(B)) << std::endl;

  // sum of each column (traverse along rows)
  Rcpp::Rcout << "sum(B) =" << std::endl;
  Rcpp::Rcout << sum(B) << std::endl;

  // sum of each row (traverse along columns)
  Rcpp::Rcout << "sum(B,1) =" << std::endl;
  Rcpp::Rcout << sum(B,1) << std::endl;

  // sum of all elements
  Rcpp::Rcout << "sum(sum(B)) = " << sum(sum(B)) << std::endl;
  Rcpp::Rcout << "accu(B)     = " << accu(B) << std::endl;

  // trace = sum along diagonal
  Rcpp::Rcout << "trace(B)    = " << trace(B) << std::endl;

  Rcpp::Rcout << std::endl;
'

fun <- cxxfunction(signature(), body=src, plugin="RcppArmadillo")

setwd("/tmp")                           # adjust on other OSs

fun()                                   # output to stdout

sink("rcpparma.log.txt")                # start 'sink' to output to file
fun()                                   # no output to screen
sink()                                  # stop 'sink'

We then switch to a temporary directory (as the example code, taken from one of the two examples in Conrad's Armadillo sources, creates a temporary file) and run the new function. To demontrate how it does in fact now mesh perfectly with R, we create an output 'sink' (which catches all output) and re-run.

This simple example demonstrated how we can use the new Rcout output stream from Rcpp to have dynamically-loaded C++ code cooperate more cleanly with the (buffered) R output. It also demontrated some of the nice features in Armadillo which we bring to R via RcppArmadillo.

/code/snippets | permanent link

Wed, 30 Nov 2011

Wicked Webapps with R, err, Wt

A few months ago, I had blogged about using R inside of Qt. This used our RInside package for embedding the statistical programming environment and language R inside of a C++ application, and further relies on our Rcpp package for R and C++ integration and object mapping.

The example was simple yet powerful: a reimplementation of the standard GUI application of a standard density estimate. Here the user can pick a kernel density function from a selection, and also slide a bandwidth parameter. One nice addition was an entry field already populated with a simple expression for a mixture of Normals, allowing for arbitrary random distributions over which to estimate. The example is pretty (thanks to Qt), and was added to RInside with the last CRAN release 0.2.4. The blog post has a nice screenshot.

I had long wondered how to do something similar 'on the web'. Web integration and application frameworks are of course a dime a dozen: Just about any language offers this, with more or less ease. But I wanted something simple yet powerful and fast. And I did not like the idea of a multi-tier app, or of a multi-language mix. I remember having seen something about a web-application framework not unlike Qt, and studying the very useful Wikipedia web application framework comparison I re-discovered Wt (pronounced "Witty"). So there it is, using C++ which brings us ample performance, the ability to connect to a number of libraries and applications (which is important in my quantitatively-minded workd) and avoid the whole multi-tier, multi-language combination. The Wt website has a few more good reason why this may be a suitable idea; the toolkit also offers a very decent amount of features and is amply documented with a fair number of examples.

And after just a little bit poking around during two weekends, I now have the following webapp committed in the SVN repository of RInside, and it is all implemented in in less than two hundred (generously commented) lines of code.

Example of embedding R via RInside into a Wt C++ application: density estimation for a mixture

It is currently up and running at the address shown in the screenshot, so give it a go (though I may take it down at another point in time). I quite like it: The application is responsive: changes in the radio buttons (for the density), or the bandwidth, each trigger reestimation of the density, and a new and updated chart is displayed immediately with no noticeable delay---just like the desktop application did.

Best of all, the code logic is essentially unchanged from the Qt-based app. Signals and slots related events to actions, the layout is in terms of standard GUI boxen and containers. And best of all, I did not have to write a line of html, javascript, css or ajax: it is all handled by the Wt toolkit. I was able to drive the app from an Android phone, my tablet, various computers around the house, and had a few friends poke a stick at it from afar.

There is at one open issue. Wt launches new instances of the application object with each connection, which is a very clean model. That doesn't map perfectly with R (which is single-threaded) and RInside (which runs as a singleton for the same reason). So right now, each action sends its state back to the client. In other words, each clients own its parameters and well as vector of random numbers. Each new action sends these back to the app which passes it to R, launches the re-estimation and gets an updated chart back which is then shown by the the client. That is not perfect, and maybe a forking model as used by Simon's RServe would be better, though it would require a rewrite of RInside. Not sure if we get there anytime soon. And for simple applications not facing legions of concurrent users, the singleton should still work. It's a proof of concept at this point. Feedback welcome, and RInside and Rcpp questions should go to the rcpp-devel list as usual.

/code/snippets | permanent link

Sat, 23 Apr 2011

Another nice Rcpp example

While preparing my slides for the Rcpp workshop this Thursday, I had wondered about more nice examples motivating Rcpp. So I posed a quick question on the rcpp-devel list.

And I received a few friendly answers. My favourite, so far, was a suggestion by Lance Bachmeier who sent me a short script which used both R and C++ (via Rcpp) to simulate a first-order vector autoregressive process (and he ensured me that it worked well enough on his graduate students). It is indeed a great example as it involves (simple) matrix multiplication in an iterative fashion. Which makes it a great example not only for Rcpp but also for our RcppArmadillo package (which wraps Conrad Sanderson's wonderful Armadillo C++ templated library for linear algebra and more). And at the same time, we can also add another look at the new and shiny R compiler I also blogged about recently.

So Lance and I iterated over this a little more over email, and I now added this as a new (and initial) example file in the RcppArmadillo SVN repo. (As an aside: The newest version 0.2.19 of RcppArmadillo has been sitting in incoming at CRAN since earlier in the week while the archive maintainer takes a well-deserved vacation. It should hit the public archive within a few days, and is otherwise available too from my site.)

So let's walk through the example:

R> ## parameter and error terms used throughout
R> a <- matrix(c(0.5,0.1,0.1,0.5),nrow=2)
R> e <- matrix(rnorm(10000),ncol=2)
R> ## Let's start with the R version
R> rSim <- function(coeff, errors) {
+   simdata <- matrix(0, nrow(errors), ncol(errors))
+   for (row in 2:nrow(errors)) {
+     simdata[row,] = coeff %*% simdata[(row-1),] + errors[row,]
+   }
+   return(simdata)
+ }
R> rData <- rSim(a, e)                     # generated by R

This starts with a simple enough loop. After skipping the first row, each iteration multiplies the previous row with the parameters and adds error terms.

We can then turn to the R compiler:

R> ## Now let's load the R compiler (requires R 2.13 or later)
R> suppressMessages(require(compiler))
R> compRsim <- cmpfun(rSim)
R> compRData <- compRsim(a,e)              # generated by R 'compiled'
R> stopifnot(all.equal(rData, compRData))  # checking results

Nice and easy: We load the compiler package, create a compiled function and use it. We check the results and surely enough find them to be identical.

With that, time to turn to C++ using Armadillo via RcppArmadillo:

R> ## Now load 'inline' to compile C++ code on the fly
R> suppressMessages(require(inline))
R> code <- '
+   arma::mat coeff = Rcpp::as<arma::mat>(a);
+   arma::mat errors = Rcpp::as<arma::mat>(e);
+   int m = errors.n_rows; int n = errors.n_cols;
+   arma::mat simdata(m,n);
+   simdata.row(0) = arma::zeros<arma::mat>(1,n);
+   for (int row=1; row<m; row++) {
+     simdata.row(row) = simdata.row(row-1)*trans(coeff)+errors.row(row);
+   }
+   return Rcpp::wrap(simdata);
+ '
R> ## create the compiled function
R> rcppSim <- cxxfunction(signature(a="numeric",e="numeric"),
+                        code,plugin="RcppArmadillo")
R> rcppData <- rcppSim(a,e)                # generated by C++ code
R> stopifnot(all.equal(rData, rcppData))   # checking results

Here we load the inline package to compile, link and load C++ snippets. We define a short C++ function in the code variable, declare a signature taking a and e as before and ask cxxfunction() to deploy the plugin for RcppArmadillo so that it and Rcpp are found during build. With that, we have a compiled function to generate data, and we once again check the result. The C++ code is pretty straightforward as well. We can instatiate Armadillo matrices directly from the R objects we pass down; we then run a similar loop building the result row by row.

Now, with all the build-up, here is the final timing comparison, using the rbenchmark package:

R> ## now load the rbenchmark package and compare all three
R> suppressMessages(library(rbenchmark))
R> res <- benchmark(rcppSim(a,e),
+                  rSim(a,e),
+                  compRsim(a,e),
+                  columns=c("test", "replications", "elapsed",
+                            "relative", "user.self", "sys.self"),
+                  order="relative")
R> print(res)
            test replications elapsed relative user.self sys.self
1  rcppSim(a, e)          100   0.038   1.0000      0.04        0
3 compRsim(a, e)          100   2.011  52.9211      2.01        0
2     rSim(a, e)          100   4.148 109.1579      4.14        0

So in a real-world example involving looping and some algebra (which is of course already done by BLAS and LAPACK libraries), the new R compiler improves by more than a factor of two, cutting time from 4.14 seconds down to about 2 seconds. Yet, this still leaves the C++ solution, clocking in at a mere 38 milliseconds, ahead by a factor of over fifty relative to the new R compiler

And compared to just R itself, the simple solution involving Rcpp and RcppArmadillo is almost 110 times faster. As I mentioned, I quite like this example ;-).

/code/snippets | permanent link

Fri, 25 Mar 2011

R inside Qt: A simple RInside application

The RInside package makes it pretty simple and straightforward to embed R, the wonderful statistical programming environment and language, inside of a C++ application. This uses both the robust embedding API provided by R itself, and the higher-level abstractions from our Rcpp package. A number of examples are shown on this blog both here and here as well as on the RInside page; and the source package actually contains well over a dozen complete examples which cover anything from simple examples to parallel use via MPI for parallel computing.

Beginning users sometimes ask about how to use RInside inside larger projects. And as I had meant to experiment with embedding inside of the powerful Qt framework anyway, I started to dabble a little. A first result is now in the SVN sources of RInside.

My starting point was the classic tkdensity demo that comes with R itself. It is a good point of departure as Tcl/Tk makes it very portable---in fact it should run on every platform that runs R---and quite expressive. And having followed some of the GUI experiments around R over the years, I have also seen various re-implementations using different GUI frameworks. And so I am adding mine to this body of work:

Example of embedding R via RInside into a Qt C++ application: density estimation for a mixture

The problem I addressed first was actual buildability. For the RInside examples, Romain and I provide a Makefile that just works by making calls to R itself to learn about flags for R, Rcpp and RInside such that all required headers and libraries are found. That is actually relatively straightforward (and documented in our vignettes) but a little intimidating at first---which is why a ready-made Makefile is a good thing.

Qt of course uses qmake and the .pro files to encode / resolve dependencies. So task one was to map what our Makefile does into its variables. Turns out that wasn't all that hard:

## -*- mode: Makefile; c-indent-level: 4; c-basic-offset: 4;  tab-width: 8; -*-
##
## Qt usage example for RInside, inspired by the standard 'density
## sliders' example for other GUI toolkits
##
## Copyright (C) 2011  Dirk Eddelbuettel and Romain Francois

TEMPLATE =              app
HEADERS =               qtdensity.h 
SOURCES =               qtdensity.cpp main.cpp

QT +=                   svg

## comment this out if you need a different version of R, 
## and set set R_HOME accordingly as an environment variable
R_HOME =                $$system(R RHOME)

## include headers and libraries for R 
RCPPFLAGS =             $$system($$R_HOME/bin/R CMD config --cppflags)
RLDFLAGS =              $$system($$R_HOME/bin/R CMD config --ldflags)
RBLAS =                 $$system($$R_HOME/bin/R CMD config BLAS_LIBS)
RLAPACK =               $$system($$R_HOME/bin/R CMD config LAPACK_LIBS)

## if you need to set an rpath to R itself, also uncomment
#RRPATH =               -Wl,-rpath,$$R_HOME/lib

## include headers and libraries for Rcpp interface classes
RCPPINCL =              $$system($$R_HOME/bin/Rscript -e \'Rcpp:::CxxFlags\(\)\')
RCPPLIBS =              $$system($$R_HOME/bin/Rscript -e \'Rcpp:::LdFlags\(\)\')

## for some reason when building with Qt we get this each time
## so we turn unused parameter warnings off
RCPPWARNING =           -Wno-unused-parameter 
## include headers and libraries for RInside embedding classes
RINSIDEINCL =           $$system($$R_HOME/bin/Rscript -e \'RInside:::CxxFlags\(\)\')
RINSIDELIBS =           $$system($$R_HOME/bin/Rscript -e \'RInside:::LdFlags\(\)\')

## compiler etc settings used in default make rules
QMAKE_CXXFLAGS +=       $$RCPPWARNING $$RCPPFLAGS $$RCPPINCL $$RINSIDEINCL
QMAKE_LFLAGS +=         $$RLDFLAGS $$RBLAS $$RLAPACK $$RCPPLIBS $$RINSIDELIBS

## addition clean targets
QMAKE_CLEAN +=          qtdensity Makefile

The double dollar signs and escaping of parentheses are a little tedious, but hey it works and expands the compiler and linker flags such that everything .

The code itself is pretty straightforward too. We instantiate the RInside object as well as the main Qt application object. We then instantiate a new object of class QtDensity that will launch the main widget; it is given a reference to the RInside object.

// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4;  tab-width: 8; -*-
//
// Qt usage example for RInside, inspired by the standard 'density
// sliders' example for other GUI toolkits
//
// Copyright (C) 2011  Dirk Eddelbuettel and Romain Francois


#include <QApplication>

#include "qtdensity.h"

int main(int argc, char *argv[])
{
    RInside R(argc, argv);  		// create an embedded R instance

    QApplication app(argc, argv);
    QtDensity qtdensity(R);
    return app.exec();
}

The definition of the main object is pretty simple: a few private variables, and a few functions to interact with the GUI and get values from the radio buttons, slider or input field---as well as functions to update the chart or re-draw the random variables.

// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4;  tab-width: 8; -*-
//
// Qt usage example for RInside, inspired by the standard 'density
// sliders' example for other GUI toolkits
//
// Copyright (C) 2011  Dirk Eddelbuettel and Romain Francois

#ifndef QTDENSITY_H
#define QTDENSITY_H

#include <RInside.h>

#include <QMainWindow>
#include <QHBoxLayout>
#include <QSlider>
#include <QSpinBox>
#include <QLabel>
#include <QTemporaryFile>
#include <QSvgWidget>

class QtDensity : public QMainWindow
{
    Q_OBJECT

public:
    QtDensity(RInside & R);

private slots:
    void getBandwidth(int bw);
    void getKernel(int kernel);
    void getRandomDataCmd(QString txt);
    void runRandomDataCmd(void);

private:
    void setupDisplay(void);    // standard GUI boilderplate of arranging things
    void plot(void);            // run a density plot in R and update the
    void filterFile(void);      // modify the richer SVG produced by R

    QSvgWidget *m_svg;          // the SVG device

    RInside & m_R;              // reference to the R instance passed to constructor
    QString m_tempfile;         // name of file used by R for plots
    QString m_svgfile;          // another temp file, this time from Qt
    int m_bw, m_kernel;         // parameters used to estimate the density
    QString m_cmd;              // random draw command string
};

#endif

Lastly, no big magic in the code either (apart from the standard magic provided by RInside). A bit of standard GUI layouting, and then some functions to pick values from the inputs as well as to compute / update the output. One issue is worth mentioning. The screenshot and code show the second version of this little application. I built a first one using a standard portable network graphics (png) file. That was fine, but not crisp as png is a pixel format so I went back and experimented with scalable vector graphics (svg) instead. One can create svg output with R in a number of ways, one of which is the cairoDevice package by Michael Lawrence (who also wrote RGtk2 and good chunks of Ggobi). Now, it turns out that Qt displays the so-called SVG tiny standard whereas R creates a fuller SVG format. Some discussion with Michael reveals that one can modify the svg file suitably (which is what the function filterFile below does) and it all works. Well: almost. There is a bug (and Michael thinks it is the SVG rendering) in which the density estimate does not get clipped to the plotting region.

// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4;  tab-width: 8; -*-
//
// Qt usage example for RInside, inspired by the standard 'density
// sliders' example for other GUI toolkits -- this time with SVG
//
// Copyright (C) 2011  Dirk Eddelbuettel and Romain Francois

#include <QtGui>

#include "qtdensity.h"

QtDensity::QtDensity(RInside & R) : m_R(R)
{
    m_bw = 100;                 // initial bandwidth, will be scaled by 100 so 1.0
    m_kernel = 0;               // initial kernel: gaussian
    m_cmd = "c(rnorm(100,0,1), rnorm(50,5,1))"; // simple mixture
    m_R["bw"] = m_bw;           // pass bandwidth to R, and have R compute a temp.file name
    m_tempfile = QString::fromStdString(Rcpp::as<std::string>(m_R.parseEval("tfile <- tempfile()")));
    m_svgfile = QString::fromStdString(Rcpp::as<std::string>(m_R.parseEval("sfile <- tempfile()")));
    m_R.parseEvalQ("library(cairoDevice)");

    setupDisplay();
}

void QtDensity::setupDisplay(void)  {
    QWidget *window = new QWidget;
    window->setWindowTitle("Qt and RInside demo: density estimation");

    QSpinBox *spinBox = new QSpinBox;
    QSlider *slider = new QSlider(Qt::Horizontal);
    spinBox->setRange(5, 200);
    slider->setRange(5, 200);
    QObject::connect(spinBox, SIGNAL(valueChanged(int)), slider, SLOT(setValue(int)));
    QObject::connect(slider, SIGNAL(valueChanged(int)), spinBox, SLOT(setValue(int)));
    spinBox->setValue(m_bw);
    QObject::connect(spinBox, SIGNAL(valueChanged(int)), this, SLOT(getBandwidth(int)));

    QLabel *cmdLabel = new QLabel("R command for random data creation");
    QLineEdit *cmdEntry = new QLineEdit(m_cmd);
    QObject::connect(cmdEntry,  SIGNAL(textEdited(QString)), this, SLOT(getRandomDataCmd(QString)));
    QObject::connect(cmdEntry,  SIGNAL(editingFinished()), this, SLOT(runRandomDataCmd()));

    QGroupBox *kernelRadioBox = new QGroupBox("Density Estimation kernel");
    QRadioButton *radio1 = new QRadioButton("&Gaussian");
    QRadioButton *radio2 = new QRadioButton("&Epanechnikov");
    QRadioButton *radio3 = new QRadioButton("&Rectangular");
    QRadioButton *radio4 = new QRadioButton("&Triangular");
    QRadioButton *radio5 = new QRadioButton("&Cosine");
    radio1->setChecked(true);
    QVBoxLayout *vbox = new QVBoxLayout;
    vbox->addWidget(radio1);
    vbox->addWidget(radio2);
    vbox->addWidget(radio3);
    vbox->addWidget(radio4);
    vbox->addWidget(radio5);
    kernelRadioBox->setMinimumSize(260,140);
    kernelRadioBox->setMaximumSize(260,140);
    kernelRadioBox->setSizePolicy(QSizePolicy::Fixed, QSizePolicy::Fixed);
    kernelRadioBox->setLayout(vbox);

    QButtonGroup *kernelGroup = new QButtonGroup;
    kernelGroup->addButton(radio1, 0);
    kernelGroup->addButton(radio2, 1);
    kernelGroup->addButton(radio3, 2);
    kernelGroup->addButton(radio4, 3);
    kernelGroup->addButton(radio5, 4);
    QObject::connect(kernelGroup, SIGNAL(buttonClicked(int)), this, SLOT(getKernel(int)));

    m_svg = new QSvgWidget();
    runRandomDataCmd();         // also calls plot()

    QGroupBox *estimationBox = new QGroupBox("Density estimation bandwidth (scaled by 100)");
    QHBoxLayout *spinners = new QHBoxLayout;
    spinners->addWidget(spinBox);
    spinners->addWidget(slider);
    QVBoxLayout *topright = new QVBoxLayout;
    topright->addLayout(spinners);
    topright->addWidget(cmdLabel);
    topright->addWidget(cmdEntry);
    estimationBox->setMinimumSize(360,140);
    estimationBox->setMaximumSize(360,140);
    estimationBox->setSizePolicy(QSizePolicy::Fixed, QSizePolicy::Fixed);
    estimationBox->setLayout(topright);
    QHBoxLayout *upperlayout = new QHBoxLayout;
    upperlayout->addWidget(kernelRadioBox);
    upperlayout->addWidget(estimationBox);

    QHBoxLayout *svglayout = new QHBoxLayout;
    svglayout->addWidget(m_svg);

    QVBoxLayout *outer = new QVBoxLayout;
    outer->addLayout(upperlayout);
    outer->addLayout(svglayout);
    window->setLayout(outer);
    window->show();
}

void QtDensity::plot(void) {
    const char *kernelstrings[] = { "gaussian", "epanechnikov", "rectangular", "triangular", "cosine" };
    m_R["bw"] = m_bw;
    m_R["kernel"] = kernelstrings[m_kernel]; // that passes the string to R
    std::string cmd1 = "Cairo(width=6,height=6,pointsize=10,surface='svg',filename=tfile); "
                       "plot(density(y, bw=bw/100, kernel=kernel), xlim=range(y)+c(-2,2), main=\"Kernel: ";
    std::string cmd2 = "\"); points(y, rep(0, length(y)), pch=16, col=rgb(0,0,0,1/4));  dev.off()";
    std::string cmd = cmd1 + kernelstrings[m_kernel] + cmd2; // stick the selected kernel in the middle
    m_R.parseEvalQ(cmd);
    filterFile();               // we need to simplify the svg file for display by Qt 
    m_svg->load(m_svgfile);
}

void QtDensity::getBandwidth(int bw) {
    if (bw != m_bw) {
        m_bw = bw;
        plot();
    }
}

void QtDensity::getKernel(int kernel) {
    if (kernel != m_kernel) {
        m_kernel = kernel;
        plot();
    }
}

void QtDensity::getRandomDataCmd(QString txt) {
    m_cmd = txt;
}

void QtDensity::runRandomDataCmd(void) {
    std::string cmd = "y <- " + m_cmd.toStdString();
    m_R.parseEvalQ(cmd);
    plot();                     // after each random draw, update plot with estimate
}

void QtDensity::filterFile() {
    // cairoDevice creates richer SVG than Qt can display
    // but per Michaele Lawrence, a simple trick is to s/symbol/g/ which we do here

    QFile infile(m_tempfile);
    infile.open(QFile::ReadOnly);
    QFile outfile(m_svgfile);
    outfile.open(QFile::WriteOnly | QFile::Truncate);
    
    QTextStream in(&infile);
    QTextStream out(&outfile);
    QRegExp rx1("<symbol"); 
    QRegExp rx2("</symbol");    
    while (!in.atEnd()) {
        QString line = in.readLine();
        line.replace(rx1, "<g"); // so '<symbol' becomes '<g ...'
        line.replace(rx2, "</g");// and '</symbol becomes '</g'
        out << line << "\n";
    }
    infile.close();
    outfile.close();
}

What the little application does is actually somewhat neat for the few lines. One key features is that the generated data can be specified directly by an R expression which allows for mixtures (as shown, and as is the default). With that it easy to see how many points are needed in the second hump to make the estimate multi-modal, and how much of a distance between both centers is needed and so on. Obviously, the effect of the chosen kernel and bandwidth can also be visualized. And with the chart the being a support vector graphics display, we can resize and scale at will and it still looks crisp.

The code (for both the simpler png variant and the svg version shown here) is in the SVN repository for RInside and will be in the next release. Special thanks to Michael Lawrence for patiently working through some svg woes with me over a few emails.

Update: Some typos fixed.

Update 2: Two URLs corrected.

/code/snippets | permanent link

Sun, 23 Jan 2011

CRANberries is now tweeting

The CRANberries service (which reports on new and updated CRAN packages for the R language and environment) is now tweeting about new packages. Simply follow @CRANberriesFeed to receive theses messages.

For the technically minded, adding this to the existing 200-line program which runs all of CRANberries was very easy. CRANberries relies only on R itself and a few Unix tools like diffstat as well as the simple blosxom txt-to-html/rss 'blog compiler'. The tweeting itself is now done by this new function

tweetNewBlogEntry <- function(curPkg, curVer, reposurl) {
    ## tests reveal that pipe(), cat(), close() is easiest
    ## to send multiple messages, may need --background option
    con <- pipe("bti --config bti.conf", "w")
    cat("New CRAN package", curPkg, "with initial version", curVer, " http://goo.gl/pgljT\n", file=con)
    close(con)
}

which simply pipes the message in Greg KH's bti program (which is now a new dependency). Special thanks to its Debian maintainer Gregor Herrmann for some helpful emails; I am using the newest release 0.29 which itself needs liboauth0. Once OAuth tokens are set-up (and see here for how to do that) all we need is the three-liner above.

At this point I am not too sure what to do about updated packages. One message per updated package seems too noisy. To be seen---comments or suggestions welcome.

/code/snippets | permanent link

Mon, 17 Jan 2011

Keeping simple things simple

My friend Jeff deserves a sincere congratulation for finally unveiling his rebranded R consultancy Lemnica. One notable feature of the new website is a section called esoteric R which discusses less frequently-visited corners of the R world. It even boasts its own CRAN package esotericR with the example sources. esoteric R currently holds two articles. Jeff had sent me the one about introducing closures a while back, and I like it and may comment at another time. What caught me by surprise when Lemnica finally opened was the other article: R calling C.

It is a fine article motivated by all the usual reasons that are e.g. mentioned in the Google Tech Talk which Romain and I gave last October about our work around Rcpp. But it is just not simple.

Allow me to explain. When Jeff showed this C language file

#include <R.h>
#include <Rinternals.h>

SEXP esoteric_rev (SEXP x) {
  SEXP res;
  int i, r, P=0;
  PROTECT(res = allocVector(REALSXP, length(x))); P++;

  for(i=length(x), r=0; i>0; i--, r++) {
     REAL(res)[r] = REAL(x)[i-1];
  }

  copyMostAttrib(x, res);
  UNPROTECT(P);
  return res;
}

and then needs several paragraphs to explain what is going on, what is needed to compile and then how to load it --- I simply could not resist. Almost immediately, I emailed back to him something as simple as this using both our Rcpp package as well as the wonderful inline package by Oleg which Romain and I more or less adopted:

library(inline)  ## for cxxfunction()
src <- 'Rcpp::NumericVector x = Rcpp::NumericVector(xs);
        std::reverse(x.begin(), x.end());
        return(x);'
fun <- cxxfunction(signature(xs="numeric"), body=src, plugin="Rcpp")
fun( seq(0, 1, 0.1) )

Here we load inline, and then define a three-line C++ program using facilities from our Rcpp package. All we need to revert a vector is to first access its R object in C++ by instantiating the R vector as a NumericVector. These C++ classes then provide iterators which are compatible with the Standard Template Library (STL). So we simply call the STL function reverse pointing the beginning and end of the vector, and are done! Rcpp then allows us the return the C++ vector which it turns into an R vector. Efficient in-place reversal, just like Jeff had motivated, in three lines. Best of all, we can execute this from within R itself:

R> library(inline)  ## for cxxfunction()
R> src <- 'Rcpp::NumericVector x = Rcpp::NumericVector(xs);
+         std::reverse(x.begin(), x.end());
+         return(x);'
R> fun <- cxxfunction(signature(xs="numeric"), body=src, plugin="Rcpp")
R> fun( seq(0, 1, 0.1) )
 [1] 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
R>

Lastly, Jeff shows a more complete example wherein a new vector is created, and any potential attributes are copied as well. Naturally, we can do that too. First, we used clone() to make a deep copy (ie forcing creation of a new object rather than a mere proxy) and use the same R API function he accessed---but it our case both prefixed with ::Rf_ for R remapping (to protect clashed with other functions with identical names) and a global namespace identifier (as it is a global C function from R).

R> library(inline)
R> src <- 'Rcpp::NumericVector x = Rcpp::clone<Rcpp::NumericVector>(xs);
+         std::reverse(x.begin(), x.end());
+         ::Rf_copyMostAttrib(xs, x);
+         return(x);'
R> fun <- cxxfunction(signature(xs="numeric"), body=src, plugin="Rcpp")
R> obj <- structure(seq(0, 1, 0.1), obligatory="hello, world!")
R> fun(obj)
 [1] 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
attr(,"obligatory")
[1] "hello, world!"
R> obj
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
attr(,"obligatory")
[1] "hello, world!"
R>

Both the obj variable and the new copy contain the desired data attribute, the new copy is reversed, the original is untouched---and all in four lines of C++ called via one inline call. I have now been going on for over one hundred lines yet I never had to mention memory management, pointers, PROTECT or other components of the R API for C. Hopefully, this short writeup provided an idea of why Romain and I think Rcpp is the way to go for creating C/C++ functions for extending and enhancing R.

/code/snippets | permanent link

Sun, 16 Jan 2011

Plotting overbought / oversold regions in R

The good folks at Bespoke Investment Group frequently show charts of so-called overbought or oversold levels; see e.g. here for the most recent global markets snapshot.

Classifying markets as overbought or oversold is a popular heuristic. It starts from computing a rolling smoothed estimate of the prices, usually via a (exponential or standard) moving average over a suitable number of days (where Bespoke uses 50 days, see here). This is typically coupled with a (simple) rolling standard deviation. Overbought and oversold regions are then constructed by taking the smoothed mean plus/minus one and two standard deviations.

Doing this is in R is pretty easy thanks to the combination of R's rich base functions and its add-on packages from CRAN. Below is a simply function I wrote a couple of months ago---and I figured I might as well release. It relies on the powerful packages quantmod and TTR by my pals Jeff Ryan and Josh Ulrich, respectively.

## plotOBOS -- displaying overbough/oversold as eg in Bespoke's plots
##
## Copyright (C) 2010 - 2011  Dirk Eddelbuettel
##
## This is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 2 of the License, or
## (at your option) any later version.

suppressMessages(library(quantmod))     # for getSymbols(), brings in xts too
suppressMessages(library(TTR))          # for various moving averages

plotOBOS <- function(symbol, n=50, type=c("sma", "ema", "zlema"), years=1, blue=TRUE) {

    today <- Sys.Date()
    X <- getSymbols(symbol, src="yahoo", from=format(today-365*years-2*n), auto.assign=FALSE)
    x <- X[,6]                          # use Adjusted

    type <- match.arg(type)
    xd <- switch(type,                  # compute xd as the central location via selected MA smoother
                 sma = SMA(x,n),
                 ema = EMA(x,n),
                 zlema = ZLEMA(x,n))
    xv <- runSD(x, n)                   # compute xv as the rolling volatility

    strt <- paste(format(today-365*years), "::", sep="")
    x  <- x[strt]                       # subset plotting range using xts' nice functionality
    xd <- xd[strt]
    xv <- xv[strt]

    xyd <- xy.coords(.index(xd),xd[,1]) # xy coordinates for direct plot commands
    xyv <- xy.coords(.index(xv),xv[,1])

    n <- length(xyd$x)
    xx <- xyd$x[c(1,1:n,n:1)]           # for polygon(): from first point to last and back

    if (blue) {
        blues5 <- c("#EFF3FF", "#BDD7E7", "#6BAED6", "#3182BD", "#08519C") # cf brewer.pal(5, "Blues")
        fairlylight <- rgb(189/255, 215/255, 231/255, alpha=0.625) # aka blues5[2]
        verylight <- rgb(239/255, 243/255, 255/255, alpha=0.625) # aka blues5[1]
        dark <- rgb(8/255, 81/255, 156/255, alpha=0.625) # aka blues5[5]
    } else {
        fairlylight <- rgb(204/255, 204/255, 204/255, alpha=0.5)         # grays with alpha-blending at 50%
        verylight <- rgb(242/255, 242/255, 242/255, alpha=0.5)
        dark <- 'black'
    }

    plot(x, ylim=range(range(xd+2*xv, xd-2*xv, na.rm=TRUE)), main=symbol, col=fairlylight) 		# basic xts plot
    polygon(x=xx, y=c(xyd$y[1]+xyv$y[1], xyd$y+2*xyv$y, rev(xyd$y+xyv$y)), border=NA, col=fairlylight) 	# upper
    polygon(x=xx, y=c(xyd$y[1]-1*xyv$y[1], xyd$y+1*xyv$y, rev(xyd$y-1*xyv$y)), border=NA, col=verylight)# center
    polygon(x=xx, y=c(xyd$y[1]-xyv$y[1], xyd$y-2*xyv$y, rev(xyd$y-xyv$y)), border=NA, col=fairlylight) 	# lower
    lines(xd, lwd=2, col=fairlylight)   # central smooted location
    lines(x, lwd=3, col=dark)           # actual price, thicker
    invisible(NULL)
}

After downloading data and computing the rolling smoothed mean and standard deviation, it really is just a matter of plotting (appropriate) filled polygons. Here I used colors from the neat RColorBrewer package with some alpha blending. Colors can be turned off via an option to the function; ranges, data length and type of smoother can also be picked.

To call this in R, simply source the file and the call, say, plotOBOS("^GSPC", years=2) which creates a two-year plot of the SP500 as shown here:

Example chart of overbought/oversold levels from plotOBOS() function

This shows the market did indeed bounce off the oversold lows nicely on a few occassions in 2009 and 2010 --- but also continued to slide after hitting the condition. Nothing is foolproof, and certainly nothing as simple as this is, so buyer beware. But it may prove useful in conjunction with other tools.

The code for the script is here and of course available under GPL 2 or later. I'd be happy to help incorporate it into some other finance package. Lastly, if you read this post this far, also consider our R / Finance conference coming at the end of April.

Edit: Corrected several typos with thanks to Josh.

/code/snippets | permanent link