Sun, 20 Aug 2017

#10: Compacting your Shared Libraries, After The Build

Welcome to the tenth post in the rarely ranting R recommendations series, or R4 for short. A few days ago we showed how to tell the linker to strip shared libraries. As discussed in the post, there are two options. One can either set up ~/.R/Makevars by passing the strip-debug option to the linker. Alternatively, one can adjust src/Makevars in the package itself with a bit a Makefile magic.

Of course, there is a third way: just run strip --strip-debug over all the shared libraries after the build. As the path is standardized, and the shell does proper globbing, we can just do

$ strip --strip-debug /usr/local/lib/R/site-library/*/libs/*.so

using a double-wildcard to get all packages (in that R package directory) and all their shared libraries. Users on macOS probably want .dylib on the end, users on Windows want another computer as usual (just kidding: use .dll). Either may have to adjust the path which is left as an exercise to the reader.

The impact can be Yuge as illustrated in the following dotplot:

This illustration is in response to a mailing list post. Last week, someone claimed on r-help that tidyverse would not install on Ubuntu 17.04. And this is of course patently false as many of us build and test on Ubuntu and related Linux systems, Travis runs on it, CRAN tests them etc pp. That poor user had somehow messed up their default gcc version. Anyway: I fired up a Docker container, installed r-base-core plus three required -dev packages (for xml2, openssl, and curl) and ran a single install.packages("tidyverse"). In a nutshell, following the launch of Docker for an Ubuntu 17.04 container, it was just

$ apt-get update
$ apt-get install r-base libcurl4-openssl-dev libssl-dev libxml2-dev
$ apt-get install mg          # a tiny editor
$ mg /etc/R/Rprofile.site     # to add a default CRAN repo
$ R -e 'install.packages("tidyverse")'

which not only worked (as expected) but also installed a whopping fifty-one packages (!!) of which twenty-six contain a shared library. A useful little trick is to run du with proper options to total, summarize, and use human units which reveals that these libraries occupy seventy-eight megabytes:

root@de443801b3fc:/# du -csh /usr/local/lib/R/site-library/*/libs/*so
4.3M    /usr/local/lib/R/site-library/Rcpp/libs/Rcpp.so
2.3M    /usr/local/lib/R/site-library/bindrcpp/libs/bindrcpp.so
144K    /usr/local/lib/R/site-library/colorspace/libs/colorspace.so
204K    /usr/local/lib/R/site-library/curl/libs/curl.so
328K    /usr/local/lib/R/site-library/digest/libs/digest.so
33M     /usr/local/lib/R/site-library/dplyr/libs/dplyr.so
36K     /usr/local/lib/R/site-library/glue/libs/glue.so
3.2M    /usr/local/lib/R/site-library/haven/libs/haven.so
272K    /usr/local/lib/R/site-library/jsonlite/libs/jsonlite.so
52K     /usr/local/lib/R/site-library/lazyeval/libs/lazyeval.so
64K     /usr/local/lib/R/site-library/lubridate/libs/lubridate.so
16K     /usr/local/lib/R/site-library/mime/libs/mime.so
124K    /usr/local/lib/R/site-library/mnormt/libs/mnormt.so
372K    /usr/local/lib/R/site-library/openssl/libs/openssl.so
772K    /usr/local/lib/R/site-library/plyr/libs/plyr.so
92K     /usr/local/lib/R/site-library/purrr/libs/purrr.so
13M     /usr/local/lib/R/site-library/readr/libs/readr.so
4.7M    /usr/local/lib/R/site-library/readxl/libs/readxl.so
1.2M    /usr/local/lib/R/site-library/reshape2/libs/reshape2.so
160K    /usr/local/lib/R/site-library/rlang/libs/rlang.so
928K    /usr/local/lib/R/site-library/scales/libs/scales.so
4.9M    /usr/local/lib/R/site-library/stringi/libs/stringi.so
1.3M    /usr/local/lib/R/site-library/tibble/libs/tibble.so
2.0M    /usr/local/lib/R/site-library/tidyr/libs/tidyr.so
1.2M    /usr/local/lib/R/site-library/tidyselect/libs/tidyselect.so
4.7M    /usr/local/lib/R/site-library/xml2/libs/xml2.so
78M     total
root@de443801b3fc:/# 

Looks like dplyr wins this one at thirty-three megabytes just for its shared library.

But with a single stroke of strip we can reduce all this down a lot:

root@de443801b3fc:/# strip --strip-debug /usr/local/lib/R/site-library/*/libs/*so
root@de443801b3fc:/# du -csh /usr/local/lib/R/site-library/*/libs/*so
440K    /usr/local/lib/R/site-library/Rcpp/libs/Rcpp.so
220K    /usr/local/lib/R/site-library/bindrcpp/libs/bindrcpp.so
52K     /usr/local/lib/R/site-library/colorspace/libs/colorspace.so
56K     /usr/local/lib/R/site-library/curl/libs/curl.so
120K    /usr/local/lib/R/site-library/digest/libs/digest.so
2.5M    /usr/local/lib/R/site-library/dplyr/libs/dplyr.so
16K     /usr/local/lib/R/site-library/glue/libs/glue.so
404K    /usr/local/lib/R/site-library/haven/libs/haven.so
76K     /usr/local/lib/R/site-library/jsonlite/libs/jsonlite.so
20K     /usr/local/lib/R/site-library/lazyeval/libs/lazyeval.so
24K     /usr/local/lib/R/site-library/lubridate/libs/lubridate.so
8.0K    /usr/local/lib/R/site-library/mime/libs/mime.so
52K     /usr/local/lib/R/site-library/mnormt/libs/mnormt.so
84K     /usr/local/lib/R/site-library/openssl/libs/openssl.so
76K     /usr/local/lib/R/site-library/plyr/libs/plyr.so
32K     /usr/local/lib/R/site-library/purrr/libs/purrr.so
648K    /usr/local/lib/R/site-library/readr/libs/readr.so
400K    /usr/local/lib/R/site-library/readxl/libs/readxl.so
128K    /usr/local/lib/R/site-library/reshape2/libs/reshape2.so
56K     /usr/local/lib/R/site-library/rlang/libs/rlang.so
100K    /usr/local/lib/R/site-library/scales/libs/scales.so
496K    /usr/local/lib/R/site-library/stringi/libs/stringi.so
124K    /usr/local/lib/R/site-library/tibble/libs/tibble.so
164K    /usr/local/lib/R/site-library/tidyr/libs/tidyr.so
104K    /usr/local/lib/R/site-library/tidyselect/libs/tidyselect.so
344K    /usr/local/lib/R/site-library/xml2/libs/xml2.so
6.6M    total
root@de443801b3fc:/#

Down to six point six megabytes. Not bad for one command. The chart visualizes the respective reductions. Clearly, C++ packages (and their template use) lead to more debugging symbols than plain old C code. But once stripped, the size differences are not that large.

And just to be plain, what we showed previously in post #9 does the same, only already at installation stage. The effects are not cumulative.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link

Mon, 14 Aug 2017

#9: Compacting your Shared Libraries

Welcome to the nineth post in the recognisably rancid R randomness series, or R4 for short. Following on the heels of last week's post, we aim to look into the shared libraries created by R.

We love the R build process. It is robust, cross-platform, reliable and rather predicatable. It. Just. Works.

One minor issue, though, which has come up once or twice in the past is the (in)ability to fully control all compilation options. R will always recall CFLAGS, CXXFLAGS, ... etc as used when it was compiled. Which often entails the -g flag for debugging which can seriously inflate the size of the generated object code. And once stored in ${RHOME}/etc/Makeconf we cannot on the fly override these values.

But there is always a way. Sometimes even two.

The first is local and can be used via the (personal) ~/.R/Makevars file (about which I will have to say more in another post). But something I have been using quite a bite lately uses the flags for the shared library linker. Given that we can have different code flavours and compilation choices---between C, Fortran and the different C++ standards---one can end up with a few lines. I currently use this which uses -Wl, to pass an the -S (or --strip-debug) option to the linker (and also reiterates the desire for a shared library, presumably superfluous):

SHLIB_CXXLDFLAGS = -Wl,-S -shared
SHLIB_CXX11LDFLAGS = -Wl,-S -shared
SHLIB_CXX14LDFLAGS = -Wl,-S -shared
SHLIB_FCLDFLAGS = -Wl,-S -shared
SHLIB_LDFLAGS = -Wl,-S -shared

Let's consider an example: my most recently uploaded package RProtoBuf. Built under a standard 64-bit Linux setup (Ubuntu 17.04, g++ 6.3) and not using the above, we end up with library containing 12 megabytes (!!) of object code:

edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ ls -lh src/RProtoBuf.so
-rwxr-xr-x 1 edd edd 12M Aug 14 20:22 src/RProtoBuf.so
edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ 

However, if we use the flags shown above in .R/Makevars, we end up with much less:

edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ ls -lh src/RProtoBuf.so 
-rwxr-xr-x 1 edd edd 626K Aug 14 20:29 src/RProtoBuf.so
edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ 

So we reduced the size from 12mb to 0.6mb, an 18-fold decrease. And the file tool still shows the file as 'not stripped' as it still contains the symbols. Only debugging information was removed.

What reduction in size can one expect, generally speaking? I have seen substantial reductions for C++ code, particularly when using tenmplated code. More old-fashioned C code will be less affected. It seems a little difficult to tell---but this method is my new build default as I continually find rather substantial reductions in size (as I tend to work mostly with C++-based packages).

The second option only occured to me this evening, and complements the first which is after all only applicable locally via the ~/.R/Makevars file. What if we wanted it affect each installation of a package? The following addition to its src/Makevars should do:

strippedLib: $(SHLIB)
        if test -e "/usr/bin/strip"; then /usr/bin/strip --strip-debug $(SHLIB); fi

.phony: strippedLib

We declare a new Makefile target strippedLib. But making it dependent on $(SHLIB), we ensure the standard target of this Makefile is built. And by making the target .phony we ensure it will always be executed. And it simply tests for the strip tool, and invokes it on the library after it has been built. Needless to say we get the same reduction is size. And this scheme may even pass muster with CRAN, but I have not yet tried.

Lastly, and acknowledgement. Everything in this post has benefited from discussion with my former colleague Dan Dillon who went as far as setting up tooling in his r-stripper repository. What we have here may be simpler, but it would not have happened with what Dan had put together earlier.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link

Thu, 10 Aug 2017

#8: Customizing Spell Checks for R CMD check

Welcome to the eight post in the ramblingly random R rants series, or R4 for short. We took a short break over the last few weeks due to some conferencing followed by some vacationing and general chill.

But we're back now, and this post gets us back to initial spirit of (hopefully) quick and useful posts. Perusing yesterday's batch of CRANberries posts, I noticed a peculiar new directory shown the in the diffstat output we use to compare two subsequent source tarballs. It was entitled .aspell/, in the top-level directory, and in two new packages by R Core member Kurt Hornik himself.

The context is, of course, the not infrequently-expressed desire to customize the spell checking done on CRAN incoming packages, see e.g. this r-package-devel thread.

And now we can as I verified with (the upcoming next release of) RcppArmadillo, along with a recent-enough (i.e. last few days) version of r-devel. Just copying what Kurt did, i.e. adding a file .aspell/defaults.R, and in it pointing to rds file (named as the package) containing a character vector with words added to the spell checker's universe is all it takes. For my package, see here for the peculiars.

Or see here:

edd@bud:~/git/rcpparmadillo/.aspell(master)$ cat defaults.R 
Rd_files <- vignettes <- R_files <- description <-
    list(encoding = "UTF-8",
         language = "en",
         dictionaries = c("en_stats", "RcppArmadillo"))
edd@bud:~/git/rcpparmadillo/.aspell(master)$ r -p -e 'readRDS("RcppArmadillo.rds")'
[1] "MPL"            "Sanderson"      "Templated"
[4] "decompositions" "onwards"        "templated"
edd@bud:~/git/rcpparmadillo/.aspell(master)$     

And now R(-devel) CMD check --as-cran ... is silent about spelling. Yay!

But take this with a grain of salt as this does not yet seem to be "announced" as e.g. yesterday's change in the CRAN Policy did not mention it. So things may well change -- but hey, it worked for me.

And this all is about aspell, here is something topical about a spell to close the post:

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link

Tue, 13 Jun 2017

#7: C++14, R and Travis -- A useful hack

Welcome to the seventh post in the rarely relevant R ramblings series, or R4 for short.

We took a short break as several conferences and other events interfered during the month of May, keeping us busy and away from this series. But we are back now with a short and useful hack I came up with this weekend.

The topic is C++14, i.e. the newest formally approved language standard for C++, and its support in R and on Travis CI. With release R 3.4.0 of a few weeks ago, R now formally supports C++14. Which is great.

But there be devils. A little known fact is that R hangs on to its configuration settings from its own compile time. That matters in cases such as the one we are looking at here: Travis CI. Travis is a tremendously useful and widely-deployed service, most commonly connected to GitHub driving "continuous integration" (the 'CI') testing after each commit. But Travis CI, for as useful as it is, is also maddingly conservative still forcing everybody to live and die by [Ubuntu 14.04]http://releases.ubuntu.com/14.04/). So while we all benefit from the fine work by Michael who faithfully provides Ubuntu binaries for distribution via CRAN (based on the Debian builds provided by yours truly), we are stuck with Ubuntu 14.04. Which means that while Michael can provide us with current R 3.4.0 it will be built on ancient Ubuntu 14.04.

Why does this matter, you ask? Well, if you just try to turn the very C++14 support added to R 3.4.0 on in the binary running on Travis, you get this error:

** libs
Error in .shlib_internal(args) : 
  C++14 standard requested but CXX14 is not defined

And you get it whether or not you define CXX14 in the session.

So R (in version 3.4.0) may want to use C++14 (because a package we submitted requests it), but having been built on the dreaded Ubuntu 14.04, it just can't oblige. Even when we supply a newer compiler. Because R hangs on to its compile-time settings rather than current environment variables. And that means no C++14 as its compile-time compiler was too ancient. Trust me, I tried: adding not only g++-6 (from a suitable repo) but also adding C++14 as the value for CXX_STD. Alas, no mas.

The trick to overcome this is twofold, and fairly straightforward. First off, we just rely on the fact that g++ version 6 defaults to C++14. So by supplying g++-6, we are in the green. We have C++14 by default without requiring extra options. Sweet.

The remainder is to tell R to not try to enable C++14 even though we are using it. How? By removing CXX_STD=C++14 on the fly and just for Travis. And this can be done easily with a small script configure which conditions on being on Travis by checking two environment variables:

#!/bin/bash

## Travis can let us run R 3.4.0 (from CRAN and the PPAs) but this R version
## does not know about C++14.  Even though we can select CXX_STD = C++14, R
## will fail as the version we use there was built in too old an environment,
## namely Ubuntu "trusty" 14.04.
##
## So we install g++-6 from another repo and rely on the fact that is
## defaults to C++14.  Sadly, we need R to not fail and hence, just on
## Travis, remove the C++14 instruction

if [[ "${CI}" == "true" ]]; then
    if [[ "${TRAVIS}" == "true" ]]; then 
        echo "** Overriding src/Makevars and removing C++14 on Travis only"
        sed -i 's|CXX_STD = CXX14||' src/Makevars
    fi
fi

I have deployed this now for two sets of builds in two distinct repositories for two "under-development" packages not yet on CRAN, and it just works. In case you turn on C++14 via SystemRequirements: in the file DESCRIPTION, you need to modify it here.

So to sum up, there it is: C++14 with R 3.4.0 on Travis. Only takes a quick Travis-only modification.

/code/r4 | permanent link

Sun, 30 Apr 2017

#6: Easiest package registration

Welcome to the sixth post in the really random R riffs series, or R4 for short.

Posts #1 and #2 discussed how to get the now de rigeur package registration information computed. In essence, we pointed to something which R 3.4.0 would have, and provided tricks for accessing it while R 3.3.3 was still R-released.

But now R 3.4.0 is out, and life is good! Or at least this is easier. For example, a few days ago I committed this short helper script pnrrs.r to littler:

#!/usr/bin/r

if (getRversion() < "3.4.0") stop("Not available for R (< 3.4.0). Please upgrade.", call.=FALSE)

tools::package_native_routine_registration_skeleton(".")

So with this example script pnrrs.r soft-linked to /usr/local/bin (or ~/bin) as I commonly do with littler helpers, all it takes is

cd some/R/package/source
pnrrs.r

and the desired file usable as src/init.c is on stdout. Editing NAMESPACE is quick too, and we're all done. See the other two posts for additional context. If you don't have littler, the above also works with Rscript.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link

Fri, 14 Apr 2017

#5: Easy package information

Welcome to the fifth post in the recklessly rambling R rants series, or R4 for short.

The third post showed an easy way to follow R development by monitoring (curated) changes on the NEWS file for the development version r-devel. As a concrete example, I mentioned that it has shown a nice new function (tools::CRAN_package_db()) coming up in R 3.4.0. Today we will build on that.

Consider the following short snippet:

library(data.table)

getPkgInfo <- function() {
    if (exists("tools::CRAN_package_db")) {
        dat <- tools::CRAN_package_db()
    } else {
        tf <- tempfile()
        download.file("https://cloud.r-project.org/src/contrib/PACKAGES.rds", tf, quiet=TRUE)
        dat <- readRDS(tf)              # r-devel can now readRDS off a URL too
    }
    dat <- as.data.frame(dat)
    setDT(dat)
    dat
}

It defines a simple function getPkgInfo() as a wrapper around said new function from R 3.4.0, ie tools::CRAN_package_db(), and a fallback alternative using a tempfile (in the automagically cleaned R temp directory) and an explicit download and read of the underlying RDS file. As an aside, just this week the r-devel NEWS told us that such readRDS() operations can now read directly from URL connection. Very nice---as RDS is a fantastic file format when you are working in R.

Anyway, back to the RDS file! The snippet above returns a data.table object with as many rows as there are packages on CRAN, and basically all their (parsed !!) DESCRIPTION info and then some. A gold mine!

Consider this to see how many package have a dependency (in the sense of Depends, Imports or LinkingTo, but not Suggests because Suggests != Depends) on Rcpp:

R> dat <- getPkgInfo()
R> rcppRevDepInd <- as.integer(tools::dependsOnPkgs("Rcpp", recursive=FALSE, installed=dat))
R> length(rcppRevDepInd)
[1] 998
R>

So exciting---we will hit 1000 within days! But let's do some more analysis:

R> dat[ rcppRevDepInd, RcppRevDep := TRUE]  # set to TRUE for given set
R> dat[ RcppRevDep==TRUE, 1:2]
           Package Version
  1:      ABCoptim  0.14.0
  2: AbsFilterGSEA     1.5
  3:           acc   1.3.3
  4: accelerometry   2.2.5
  5:      acebayes   1.3.4
 ---                      
994:        yakmoR   0.1.1
995:  yCrypticRNAs  0.99.2
996:         yuima   1.5.9
997:           zic     0.9
998:       ziphsmm   1.0.4
R>

Here we index the reverse dependency using the vector we had just computed, and then that new variable to subset the data.table object. Given the aforementioned parsed information from all the DESCRIPTION files, we can learn more:

R> ## likely false entries
R> dat[ RcppRevDep==TRUE, ][NeedsCompilation!="yes", c(1:2,4)]
            Package Version                                                                         Depends
 1:         baitmet   1.0.0                                                           Rcpp, erah (>= 1.0.5)
 2:           bea.R   1.0.1                                                        R (>= 3.2.1), data.table
 3:            brms   1.6.0                     R (>= 3.2.0), Rcpp (>= 0.12.0), ggplot2 (>= 2.0.0), methods
 4: classifierplots   1.3.3                             R (>= 3.1), ggplot2 (>= 2.2), data.table (>= 1.10),
 5:           ctsem   2.3.1                                           R (>= 3.2.0), OpenMx (>= 2.3.0), Rcpp
 6:        DeLorean   1.2.4                                                  R (>= 3.0.2), Rcpp (>= 0.12.0)
 7:            erah   1.0.5                                                               R (>= 2.10), Rcpp
 8:             GxM     1.1                                                                              NA
 9:             hmi   0.6.3                                                                    R (>= 3.0.0)
10:        humarray     1.1 R (>= 3.2), NCmisc (>= 1.1.4), IRanges (>= 1.22.10),\nGenomicRanges (>= 1.16.4)
11:         iNextPD   0.3.2                                                                    R (>= 3.1.2)
12:          joinXL   1.0.1                                                                    R (>= 3.3.1)
13:            mafs   0.0.2                                                                              NA
14:            mlxR   3.1.0                                                           R (>= 3.0.1), ggplot2
15:    RmixmodCombi     1.0              R(>= 3.0.2), Rmixmod(>= 2.0.1), Rcpp(>= 0.8.0), methods,\ngraphics
16:             rrr   1.0.0                                                                    R (>= 3.2.0)
17:        UncerIn2     2.0                          R (>= 3.0.0), sp, RandomFields, automap, fields, gstat
R> 

There are a full seventeen packages which claim to depend on Rcpp while not having any compiled code of their own. That is likely false---but I keep them in my counts, however relunctantly. A CRAN-declared Depends: is a Depends:, after all.

Another nice thing to look at is the total number of package that declare that they need compilation:

R> ## number of packages with compiled code
R> dat[ , .(N=.N), by=NeedsCompilation]
   NeedsCompilation    N
1:               no 7625
2:              yes 2832
3:               No    1
R>

Isn't that awesome? It is 2832 out of (currently) 10458, or about 27.1%. Just over one in four. Now the 998 for Rcpp look even better as they are about 35% of all such packages. In order words, a little over one third of all packages with compiled code (which may be legacy C, Fortran or C++) use Rcpp. Wow.

Before closing, one shoutout to Dirk Schumacher whose thankr which I made the center of the last post is now on CRAN. As a mighty fine and slim micropackage without external dependencies. Neat.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link

Sat, 08 Apr 2017

#4: Simpler shoulders()

Welcome to the fourth post in the repulsively random R ramblings series, or R4 for short.

My twitter feed was buzzing about a nice (and as yet unpublished, ie not-on-CRAN) package thankr by Dirk Schumacher which compiles a list of packages (ordered by maintainer count) for your current session (or installation or ...) with a view towards saying thank you to those whose packages we rely upon. Very nice indeed.

I had a quick look and run it twice ... and had a reaction of ewwww, really? as running it twice gave different results as on the second instance a boatload of tibblyverse packages appeared. Because apparently kids these day can only slice data that has been tidied or something.

So I had another quick look ... and put together an alternative version using just base R (as there was only one subfunction that needed reworking):

source(file="https://raw.githubusercontent.com/dirkschumacher/thankr/master/R/shoulders.R")
format_pkg_df <- function(df) { # non-tibblyverse variant
    tb <- table(df[,2])
    od <- order(tb, decreasing=TRUE)
    ndf <- data.frame(maint=names(tb)[od], npkgs=as.integer(tb[od]))
    colpkgs <- function(m, df) { paste(df[ df$maintainer == m, "pkg_name"], collapse=",") }
    ndf[, "pkg"] <- sapply(ndf$maint, colpkgs, df)
    ndf
}

A nice side benefit is that the function is now free of external dependencies (besides, of course, base R). Running this in the ESS session I had open gives:

R> shoulders()  ## by Dirk Schumacher, with small modifications
                               maint npkgs                                                                 pkg
1 R Core Team <R-core@r-project.org>     9 compiler,graphics,tools,utils,grDevices,stats,datasets,methods,base
2 Dirk Eddelbuettel <edd@debian.org>     4                                  RcppTOML,Rcpp,RApiDatetime,anytime
3  Matt Dowle <mattjdowle@gmail.com>     1                                                          data.table
R> 

and for good measure a screenshot is below:

I think we need a catchy moniker for R work using good old base R. SoberVerse? GrumbyOldFolksR? PlainOldR? Better suggestions welcome.

Edit on 2017-04-09: And by now Dirk Schumacher fixed that little bug in thankr which was at the start of this. His shoulders() function is now free of side effects, and thankr is now a clean micropackage free of external depends from any verse, be it tiddly or grumpy. I look forward to seeing it on CRAN soon!

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link

Thu, 06 Apr 2017

#3: Follow R-devel

Welcome to the third post in the rarely relevant R recommendation series, or R4 for short.

Today will be brief, but of some importance. In order to know where R is going next, few places provide a better vantage point than the actual ongoing development.

A few years ago, I mentioned to Duncan Murdoch how straightforward the setup of my CRANberries feed (and site) was. After all, static blog compilers converting textual input to html, rss feed and whatnot have been around for fifteen years (though they keep getting reinvented). He took this to heart and built the (not too pretty) R-devel daily site (which also uses a fancy diff tool as it shows changes in NEWS) as well as a more general description of all available sub-feeds. I follow this mostly through blog aggregations -- Google Reader in its day, now Feedly. A screenshot is below just to show that it doesn't have to be ugly just because it is on them intertubes:

This shows a particularly useful day when R-devel folded into the new branch for what will be the R 3.4.0 release come April 21. The list of upcoming changes is truly impressive and quite comprehensive -- and the package registration helper, focus of posts #1 and #2 here, is but one of these many changes.

One function I learned about that day is tools::CRAN_package_db(), a helper to get a single (large) data.frame with all package DESCRIPTION information. Very handy. Others may have noticed that CRAN repos now have a new top-level file PACKAGES.rds and this function does indeed just fetch it--which you could do with a similar one-liner in R-release as well. Still very handy.

But do read about changes in R-devel and hence upcoming changes in R 3.4.0. Lots of good things coming our way.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link

Thu, 30 Mar 2017

#2: Even Easier Package Registration

Welcome to the second post in rambling random R recommendation series, or R4 for short.

Two days ago I posted the initial (actual) post. It provided context for why we need package registration entries (tl;dr: because R CMD check now tests for it, and because it The Right Thing to do, see documentation in the posts). I also showed how generating such a file src/init.c was essentially free as all it took was single call to a new helper function added to R-devel by Brian Ripley and Kurt Hornik.

Now, to actually use R-devel you obviously need to have it accessible. There are a myriad of ways to achieve that: just compile it locally as I have done for years, use a Docker image as I showed in the post -- or be creative with eg Travis or win-builder both of which give you access to R-devel if you're clever about it.

But as no good deed goes unpunished, I was of course told off today for showing a Docker example as Docker was not "Easy". I think the formal answer to that is baloney. But we leave that aside, and promise to discuss setting up Docker at another time.

R is after all ... just R. So below please find a script you can save as, say, ~/bin/pnrrs.r. And calling it---even with R-release---will generate the same code snippet as I showed via Docker. Call it a one-off backport of the new helper function -- with a half-life of a few weeks at best as we will have R 3.4.0 as default in just a few weeks. The script will then reduce to just the final line as the code will be present with R 3.4.0.

#!/usr/bin/r

library(tools)

.find_calls_in_package_code <- tools:::.find_calls_in_package_code
.read_description <- tools:::.read_description

## all what follows is from R-devel aka R 3.4.0 to be

package_ff_call_db <- function(dir) {
    ## A few packages such as CDM use base::.Call
    ff_call_names <- c(".C", ".Call", ".Fortran", ".External",
                       "base::.C", "base::.Call",
                       "base::.Fortran", "base::.External")

    predicate <- function(e) {
        (length(e) > 1L) &&
            !is.na(match(deparse(e[[1L]]), ff_call_names))
    }

    calls <- .find_calls_in_package_code(dir,
                                         predicate = predicate,
                                         recursive = TRUE)
    calls <- unlist(Filter(length, calls))

    if(!length(calls)) return(NULL)

    attr(calls, "dir") <- dir
    calls
}

native_routine_registration_db_from_ff_call_db <- function(calls, dir = NULL, character_only = TRUE) {
    if(!length(calls)) return(NULL)

    ff_call_names <- c(".C", ".Call", ".Fortran", ".External")
    ff_call_args <- lapply(ff_call_names,
                           function(e) args(get(e, baseenv())))
    names(ff_call_args) <- ff_call_names
    ff_call_args_names <-
        lapply(lapply(ff_call_args,
                      function(e) names(formals(e))), setdiff,
               "...")

    if(is.null(dir))
        dir <- attr(calls, "dir")

    package <- # drop name
        as.vector(.read_description(file.path(dir, "DESCRIPTION"))["Package"])

    symbols <- character()
    nrdb <-
        lapply(calls,
               function(e) {
                   if (startsWith(deparse(e[[1L]]), "base::"))
                       e[[1L]] <- e[[1L]][3L]
                   ## First figure out whether ff calls had '...'.
                   pos <- which(unlist(Map(identical,
                                           lapply(e, as.character),
                                           "...")))
                   ## Then match the call with '...' dropped.
                   ## Note that only .NAME could be given by name or
                   ## positionally (the other ff interface named
                   ## arguments come after '...').
                   if(length(pos)) e <- e[-pos]
                   ## drop calls with only ...
                   if(length(e) < 2L) return(NULL)
                   cname <- as.character(e[[1L]])
                   ## The help says
                   ##
                   ## '.NAME' is always matched to the first argument
                   ## supplied (which should not be named).
                   ##
                   ## But some people do (Geneland ...).
                   nm <- names(e); nm[2L] <- ""; names(e) <- nm
                   e <- match.call(ff_call_args[[cname]], e)
                   ## Only keep ff calls where .NAME is character
                   ## or (optionally) a name.
                   s <- e[[".NAME"]]
                   if(is.name(s)) {
                       s <- deparse(s)[1L]
                       if(character_only) {
                           symbols <<- c(symbols, s)
                           return(NULL)
                       }
                   } else if(is.character(s)) {
                       s <- s[1L]
                   } else { ## expressions
                       symbols <<- c(symbols, deparse(s))
                       return(NULL)
                   }
                   ## Drop the ones where PACKAGE gives a different
                   ## package. Ignore those which are not char strings.
                   if(!is.null(p <- e[["PACKAGE"]]) &&
                      is.character(p) && !identical(p, package))
                       return(NULL)
                   n <- if(length(pos)) {
                            ## Cannot determine the number of args: use
                            ## -1 which might be ok for .External().
                            -1L
                        } else {
                            sum(is.na(match(names(e),
                                            ff_call_args_names[[cname]]))) - 1L
                        }
                   ## Could perhaps also record whether 's' was a symbol
                   ## or a character string ...
                   cbind(cname, s, n)
               })
    nrdb <- do.call(rbind, nrdb)
    nrdb <- as.data.frame(unique(nrdb), stringsAsFactors = FALSE)

    if(NROW(nrdb) == 0L || length(nrdb) != 3L)
        stop("no native symbols were extracted")
    nrdb[, 3L] <- as.numeric(nrdb[, 3L])
    nrdb <- nrdb[order(nrdb[, 1L], nrdb[, 2L], nrdb[, 3L]), ]
    nms <- nrdb[, "s"]
    dups <- unique(nms[duplicated(nms)])

    ## Now get the namespace info for the package.
    info <- parseNamespaceFile(basename(dir), dirname(dir))
    ## Could have ff calls with symbols imported from other packages:
    ## try dropping these eventually.
    imports <- info$imports
    imports <- imports[lengths(imports) == 2L]
    imports <- unlist(lapply(imports, `[[`, 2L))

    info <- info$nativeRoutines[[package]]
    ## Adjust native routine names for explicit remapping or
    ## namespace .fixes.
    if(length(symnames <- info$symbolNames)) {
        ind <- match(nrdb[, 2L], names(symnames), nomatch = 0L)
        nrdb[ind > 0L, 2L] <- symnames[ind]
    } else if(!character_only &&
              any((fixes <- info$registrationFixes) != "")) {
        ## There are packages which have not used the fixes, e.g. utf8latex
        ## fixes[1L] is a prefix, fixes[2L] is an undocumented suffix
        nrdb[, 2L] <- sub(paste0("^", fixes[1L]), "", nrdb[, 2L])
        if(nzchar(fixes[2L]))
            nrdb[, 2L] <- sub(paste0(fixes[2L]), "$", "", nrdb[, 2L])
    }
    ## See above.
    if(any(ind <- !is.na(match(nrdb[, 2L], imports))))
        nrdb <- nrdb[!ind, , drop = FALSE]

    ## Fortran entry points are mapped to l/case
    dotF <- nrdb$cname == ".Fortran"
    nrdb[dotF, "s"] <- tolower(nrdb[dotF, "s"])

    attr(nrdb, "package") <- package
    attr(nrdb, "duplicates") <- dups
    attr(nrdb, "symbols") <- unique(symbols)
    nrdb
}

format_native_routine_registration_db_for_skeleton <- function(nrdb, align = TRUE, include_declarations = FALSE) {
    if(!length(nrdb))
        return(character())

    fmt1 <- function(x, n) {
        c(if(align) {
              paste(format(sprintf("    {\"%s\",", x[, 1L])),
                    format(sprintf(if(n == "Fortran")
                                       "(DL_FUNC) &F77_NAME(%s),"
                                   else
                                       "(DL_FUNC) &%s,",
                                   x[, 1L])),
                    format(sprintf("%d},", x[, 2L]),
                           justify = "right"))
          } else {
              sprintf(if(n == "Fortran")
                          "    {\"%s\", (DL_FUNC) &F77_NAME(%s), %d},"
                      else
                          "    {\"%s\", (DL_FUNC) &%s, %d},",
                      x[, 1L],
                      x[, 1L],
                      x[, 2L])
          },
          "    {NULL, NULL, 0}")
    }

    package <- attr(nrdb, "package")
    dups <- attr(nrdb, "duplicates")
    symbols <- attr(nrdb, "symbols")

    nrdb <- split(nrdb[, -1L, drop = FALSE],
                  factor(nrdb[, 1L],
                         levels =
                             c(".C", ".Call", ".Fortran", ".External")))

    has <- vapply(nrdb, NROW, 0L) > 0L
    nms <- names(nrdb)
    entries <- substring(nms, 2L)
    blocks <- Map(function(x, n) {
                      c(sprintf("static const R_%sMethodDef %sEntries[] = {",
                                n, n),
                        fmt1(x, n),
                        "};",
                        "")
                  },
                  nrdb[has],
                  entries[has])

    decls <- c(
        "/* FIXME: ",
        "   Add declarations for the native routines registered below.",
        "*/")

    if(include_declarations) {
        decls <- c(
            "/* FIXME: ",
            "   Check these declarations against the C/Fortran source code.",
            "*/",
            if(NROW(y <- nrdb$.C)) {
                 args <- sapply(y$n, function(n) if(n >= 0)
                                paste(rep("void *", n), collapse=", ")
                                else "/* FIXME */")
                c("", "/* .C calls */",
                  paste0("extern void ", y$s, "(", args, ");"))
           },
            if(NROW(y <- nrdb$.Call)) {
                args <- sapply(y$n, function(n) if(n >= 0)
                               paste(rep("SEXP", n), collapse=", ")
                               else "/* FIXME */")
               c("", "/* .Call calls */",
                  paste0("extern SEXP ", y$s, "(", args, ");"))
            },
            if(NROW(y <- nrdb$.Fortran)) {
                 args <- sapply(y$n, function(n) if(n >= 0)
                                paste(rep("void *", n), collapse=", ")
                                else "/* FIXME */")
                c("", "/* .Fortran calls */",
                  paste0("extern void F77_NAME(", y$s, ")(", args, ");"))
            },
            if(NROW(y <- nrdb$.External))
                c("", "/* .External calls */",
                  paste0("extern SEXP ", y$s, "(SEXP);"))
            )
    }

    headers <- if(NROW(nrdb$.Call) || NROW(nrdb$.External))
        c("#include <R.h>", "#include <Rinternals.h>")
    else if(NROW(nrdb$.Fortran)) "#include <R_ext/RS.h>"
    else character()

    c(headers,
      "#include <stdlib.h> // for NULL",
      "#include <R_ext/Rdynload.h>",
      "",
      if(length(symbols)) {
          c("/*",
            "  The following symbols/expresssions for .NAME have been omitted",
            "", strwrap(symbols, indent = 4, exdent = 4), "",
            "  Most likely possible values need to be added below.",
            "*/", "")
      },
      if(length(dups)) {
          c("/*",
            "  The following name(s) appear with different usages",
            "  e.g., with different numbers of arguments:",
            "", strwrap(dups, indent = 4, exdent = 4), "",
            "  This needs to be resolved in the tables and any declarations.",
            "*/", "")
      },
      decls,
      "",
      unlist(blocks, use.names = FALSE),
      ## We cannot use names with '.' in: WRE mentions replacing with "_"
      sprintf("void R_init_%s(DllInfo *dll)",
              gsub(".", "_", package, fixed = TRUE)),
      "{",
      sprintf("    R_registerRoutines(dll, %s);",
              paste0(ifelse(has,
                            paste0(entries, "Entries"),
                            "NULL"),
                     collapse = ", ")),
      "    R_useDynamicSymbols(dll, FALSE);",
      "}")
}

package_native_routine_registration_db <- function(dir, character_only = TRUE) {
    calls <- package_ff_call_db(dir)
    native_routine_registration_db_from_ff_call_db(calls, dir, character_only)
}

package_native_routine_registration_db <- function(dir, character_only = TRUE) {
    calls <- package_ff_call_db(dir)
    native_routine_registration_db_from_ff_call_db(calls, dir, character_only)
}

package_native_routine_registration_skeleton <- function(dir, con = stdout(), align = TRUE,
                                                         character_only = TRUE, include_declarations = TRUE) {
    nrdb <- package_native_routine_registration_db(dir, character_only)
    writeLines(format_native_routine_registration_db_for_skeleton(nrdb,
                align, include_declarations),
               con)
}

package_native_routine_registration_skeleton(".")  ## when R 3.4.0 is out you only need this line

Here I use /usr/bin/r as I happen to like littler a lot, but you can use Rscript the same way.

Easy enough now?

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link

Wed, 29 Mar 2017

#1: Easy Package Registration

Welcome to the first actual post in the R4 series, following the short announcement earlier this week.

Context

Last month, Brian Ripley announced on r-devel that registration of routines would now be tested for by R CMD check in r-devel (which by next month will become R 3.4.0). A NOTE will be issued now, this will presumably turn into a WARNING at some point. Writing R Extensions has an updated introduction) of the topic.

Package registration has long been available, and applies to all native (i.e. "compiled") function via the .C(), .Call(), .Fortran() or .External() interfaces. If you use any of those -- and .Call() may be the only truly relevant one here -- then is of interest to you.

Brian Ripley and Kurt Hornik also added a new helper function: tools::package_native_routine_registration_skeleton(). It parses the R code of your package and collects all native function entrypoints in order to autogenerate the registration. It is available in R-devel now, will be in R 3.4.0 and makes adding such registration truly trivial.

But as of today, it requires that you have R-devel. Once R 3.4.0 is out, you can call the helper directly.

As for R-devel, there have always been at least two ways to use it: build it locally (which we may cover in another R4 installment), or using Docker. Here will focus on the latter by relying on the Rocker project by Carl and myself.

Use the Rocker drd Image

We assume you can run Docker on your system. How to add it on Windows, macOS or Linux is beyond our scope here today, but also covered extensively elsewhere. So we assume you can execute docker and e.g. bring in the 'daily r-devel' image drd from our Rocker project via

~$ docker pull rocker/drd

With that, we can use R-devel to create the registration file very easily in a single call (which is a long command-line we have broken up with one line-break for the display below).

The following is real-life example when I needed to add registration to the RcppTOML package for this week's update:

~/git/rcpptoml(master)$ docker run --rm -ti -v $(pwd):/mnt rocker/drd \  ## line break
             RD --slave -e 'tools::package_native_routine_registration_skeleton("/mnt")'
#include <R.h>
#include <Rinternals.h>
#include <stdlib.h> // for NULL
#include <R_ext/Rdynload.h>

/* FIXME: 
   Check these declarations against the C/Fortran source code.
*/

/* .Call calls */
extern SEXP RcppTOML_tomlparseImpl(SEXP, SEXP, SEXP);

static const R_CallMethodDef CallEntries[] = {
    {"RcppTOML_tomlparseImpl", (DL_FUNC) &RcppTOML_tomlparseImpl, 3},
    {NULL, NULL, 0}
};

void R_init_RcppTOML(DllInfo *dll)
{
    R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
    R_useDynamicSymbols(dll, FALSE);
}
edd@max:~/git/rcpptoml(master)$ 

Decompose the Command

We can understand the docker command invocation above through its components:

  • docker run is the basic call to a container
  • --rm -ti does subsequent cleanup (--rm) and gives a terminal (-t) that is interactive (-i)
  • -v $(pwd):/mnt uses the -v a:b options to make local directory a available as b in the container; here $(pwd) calls print working directory to get the local directory which is now mapped to /mnt in the container
  • rocker/drd invokes the 'drd' container of the Rocker project
  • RD is a shorthand to the R-devel binary inside the container, and the main reason we use this container
  • -e 'tools::package_native_routine_registration_skeleton("/mnt") calls the helper function of R (currently in R-devel only, so we use RD) to compute a possible init.c file based on the current directory -- which is /mnt inside the container

That it. We get a call to the R function executed inside the Docker container, examining the package in the working directory and creating a registration file for it which is display to the console.

Copy the Output to src/init.c

We simply copy the output to a file src/init.c; I often fold one opening brace one line up.

Update NAMESPACE

We also change one line in NAMESPACE from (for this package) useDynLib("RcppTOML") to useDynLib("RcppTOML", .registration=TRUE). Adjust accordingly for other package names.

That's it!

And with that we a have a package which no longer provokes the NOTE as seen by the checks page. Calls to native routines are now safer (less of a chance for name clashing), get called quicker as we skip the symbol search (see the WRE discussion) and best of all this applies to all native routines whether written by hand or written via a generator such as Rcpp Attributes.

So give this a try to get your package up-to-date.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link

Mon, 27 Mar 2017

#0: Introducing R^4

So I had been toying with the idea of getting back to the blog and more regularly writing / posting little tips and tricks. I even started taking some notes but because perfect is always the enemy of the good it never quite materialized.

But the relatively broad discussion spawned by last week's short rant on Suggests != Depends made a few things clear. There appears to be an audience. It doesn't have to be long. And it doesn't have to be too polished.

So with that, let's get the blogging back from micro-blogging.

This note forms post zero of what will a new segment I call R4 which is shorthand for relatively random R rambling.

Stay tuned.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

/code/r4 | permanent link