Tue, 21 Apr 2015

Introducing ghrr: GitHub Hosted R Repository

ghrr logo

Background

R relies on package repositories for initial installation of a package via install.packages(). A crucial second step is update.packages(): For all currently installed packages, a list of available updates is constructed or offered for either one-by-one or bulk updates. This keeps the local packages in sync with upstream, and provides for a very convenient way to obtain new features, bug fixes and other improvements. So by installing from a repository, we automatically have the ability to track the repository for updates.

Enter drat

Fairly recently, the drat package was added to the R ecosystem. It makes both aspects of package distribution easy: providing a package (if you are an author) as well as installing it (if you are a user). Now, because drat is at the same time source code (as it is also a package providing the functionality), and a repository (using what drat provides ib features), the "namespace" becomes a little cluttered.

But because a key feature of drat is the "one variable" unique identification via the GitHub, I opted to create a drat repository in the name of a new organisation: ghrr. This is a simple acronym for GitHub Hosted R Repository.

Use cases

We can outline several use case for packages in ghrr:

  • packages not published in a repo by their authors: I already use two like that:
    • fasttime, an impeccably fast parser for ISO datetimes by Simon Urbanek which was however never released into a repo by Simon, and
    • RcppR6, a very nice extension to both R6 (by Winston) and Rcpp, by Rich FitzJohn; similarly never released beyond GitHub;
  • packages possibly unsuitable for mainline repos:
    • Rblpapi is a great package by Whit Armstong and John Laing to which I have been contributing quite a bit of late. As it requires a free-to-use but not open source library and headers from Bloomberg, it will never make it to the mainline repository for R, but hosting it in ghrr is perfect as I can easily update several machines at work once I cut a new development release;
    • winsorize is a small package I needed a few weeks ago; it is spun out of robustHD but does not yet contain new code so Andreas and I are content to keep it in this drat for now;
  • packages in pre-relase mode:
    • RcppArmadillo where I announced both a release candidate before Armadillo 5.000 came out, as well as the actual RcppArmadillo 0.500.0.0 which is not (yet) on the mainline repository as two affected packages need a small update first. Users, however, can get RcppArmadillo already from the sibling Rcpp drat repo.
    • RcppToml is a new package I am currently working on implementing a toml parser based on cpptoml. It works, but it not quite ready for public announcements yet, and hence perfect for ghrr.

Going forward

ghrr is meant to be open. While anybody can open a drat repository, particularly on GitHub, it may be beneficial to somehow group packages. This is however not something that can be planned ex-ante: it may just happen if others who see similar benefits in this can in fact contribute. In that spirit, I strongly encourage pull requests.

Early on, I made my commit messages conform to a pattern of package version sha1 repourl to make code provenance of every commit very clear. Ideally, subsequent commits would conform to such a scheme, or replace it with a better one.

Some Resources

A few links to learn more about drat and ghrr:

Comments and questions via email or issue tickets are more than welcome. We hope that others find ghrr to be a useful tool for easy repository management and use via GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/drat | permanent link