||Thinking inside the box|
Sat, 30 Jun 2012
Getting numpy data into R
The numpy can be read very efficiently into Python. We can do the same in R
So the obvious next idea was to read the numpy file in Python, and to write a simple binary format. One helpful feature with this data set was that it contained only regular (rectangular) matrices of floats. So we could just store two integers for the dimensions, followed by the total data in either one large binary blob, or a sequence of column vectors.
But one minor trouble was that the Intertubes lead to no easy solution to unpack the numpy format. StackOverflow had plenty of question around this topic converned with, say, how to serialize in language-independent way. But no converters. And nobody local knew how to undo the "pickle" format underlying numpy.
But a remote friend did:
well-known for his Rpy2
package, pointed me towards using the
Finally, to round out this post, let's show the simple solution we crafted so that the next guy searching the Intertubes will have an easier. Let us start with a minimal Python program writing numpy data to disk:
#!/usr/bin/env python # # simple example for creating numpy data to demonstrate converter import numpy as np # simple float array a = np.arange(15).reshape(3,5) * 1.1 outfile = "/tmp/data.npy" np.save(outfile, a)
Next, the simple Python converter to create a binary file containing two integers for row and column dimension, followed by row times columns of floats:
#!/usr/bin/python # # read a numpy file, and write a simple binary file containing # two integers 'n' and 'k' for rows and columns # n times k floats with the actual matrix # which can be read by any application or language that can read binary import struct import numpy as np inputfile = "/tmp/data.npy" outputfile = "/tmp/data.bin" # load from the file mat = np.load(inputfile) # create a binary file binfile = file(outputfile, 'wb') # and write out two integers with the row and column dimension header = struct.pack('2I', mat.shape, mat.shape) binfile.write(header) # then loop over columns and write each for i in range(mat.shape): data = struct.pack('%id' % mat.shape, *mat[:,i]) binfile.write(data) binfile.close()
Lastly, a quick littler script showing how R can read the data in a handful of lines:
That did the job---and I already used to converter to read a few weeks worth of data for further analysis in R. This obviously isn't the last word on possible solutions as the additional temporary file can be wasteful (unless it forms a cache for data read multiple times). If someone has nice solutions, please don't hold back and contact me. Thanks again to Laurent for the winning suggestion concerning#!/usr/bin/r infile <- "/tmp/data.bin" con <- file(infile, "rb") dim <- readBin(con, "integer", 2) Mat <- matrix( readBin(con, "numeric", prod(dim)), dim, dim) close(con) print(Mat)