Sat, 30 Jun 2012

Getting numpy data into R

The other day, I found myself confronted with a large number of large files. Which were presented in (gzip-)compressed ascii format---which R reads directly via gzfile() connections---as well as (compressed) numpy files.

The numpy can be read very efficiently into Python. We can do the same in R via save() and load(), of course. But the trouble is that you need to read them first. And reading hundreds of megabytes from ascii is slow, no matter which language you use. Concerning R, I poked aound scan(), played with the colClasses argument and looked at the recent LaF package written just for this purpose. And all these solutions were still orders of magnitude slower than reading numpy. Which is no surprise as it is really hard to beat binary formats when you have to parse countless ascii tokens.

So the obvious next idea was to read the numpy file in Python, and to write a simple binary format. One helpful feature with this data set was that it contained only regular (rectangular) matrices of floats. So we could just store two integers for the dimensions, followed by the total data in either one large binary blob, or a sequence of column vectors.

But one minor trouble was that the Intertubes lead to no easy solution to unpack the numpy format. StackOverflow had plenty of question around this topic converned with, say, how to serialize in language-independent way. But no converters. And nobody local knew how to undo the "pickle" format underlying numpy.

But a remote friend did: Laurent, well-known for his Rpy2 package, pointed me towards using the struct module and steered me towards the solution shown below. So a shameless plug: if you need a very experienced Python or R consultant for sciece work, consider his consulting firm.

Finally, to round out this post, let's show the simple solution we crafted so that the next guy searching the Intertubes will have an easier. Let us start with a minimal Python program writing numpy data to disk:

#!/usr/bin/env python
#
# simple example for creating numpy data to demonstrate converter

import numpy as np

# simple float array
a = np.arange(15).reshape(3,5) * 1.1

outfile = "/tmp/data.npy"
np.save(outfile, a)

Next, the simple Python converter to create a binary file containing two integers for row and column dimension, followed by row times columns of floats:

#!/usr/bin/python
#
# read a numpy file, and write a simple binary file containing
#   two integers 'n' and 'k' for rows and columns
#   n times k floats with the actual matrix
# which can be read by any application or language that can read binary
 
import struct
import numpy as np

inputfile = "/tmp/data.npy"
outputfile = "/tmp/data.bin"

# load from the file
mat = np.load(inputfile)

# create a binary file
binfile = file(outputfile, 'wb')
# and write out two integers with the row and column dimension
header = struct.pack('2I', mat.shape[0], mat.shape[1])
binfile.write(header)
# then loop over columns and write each
for i in range(mat.shape[1]):
    data = struct.pack('%id' % mat.shape[0], *mat[:,i])
    binfile.write(data)
binfile.close()

Lastly, a quick littler script showing how R can read the data in a handful of lines:

#!/usr/bin/r

infile <- "/tmp/data.bin"
con <- file(infile, "rb")
dim <- readBin(con, "integer", 2)
Mat <- matrix( readBin(con, "numeric", prod(dim)), dim[1], dim[2])
close(con)

print(Mat)

That did the job---and I already used to converter to read a few weeks worth of data for further analysis in R. This obviously isn't the last word on possible solutions as the additional temporary file can be wasteful (unless it forms a cache for data read multiple times). If someone has nice solutions, please don't hold back and contact me. Thanks again to Laurent for the winning suggestion concerning struct, and help in getting the examples shown here to work.

/code/snippets | permanent link