The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
houba provides manipulation of large data through memory-mapped files, supporting vectors, matrices, and arrays. This allows to work with large datasets by keeping them on disk.
houba defines three S4 classes:
mvector for memory-mapped vectorsmmatrix for memory-mapped matricesmarray for memory-mapped arraysCurrently, it supports float, double,
integer and char data types.
houba allows to extract sub-vectors or sub-matrices,
and to make assignments. It also performs component wise arithmetic
operations (currently no matrix arithmetic). In-place arithmetic
operations are supported. rowSums, colSums,
rowMeans, colMeans methods are defined for
memory-mapped matrices.
A minimal compatibility with the bigmemory package is provided through descriptor files.
NOTE 1 A current limitation of houba is that it relies on R integers for indices, thus vectors of length larger than 2,147,483,647 can’t be manipulated. Same limitations apply to matrices and arrays dimensions.
NOTE 2 houba relies on the C++ header only library mio by vimpunk, which is under MIT Licence : https://github.com/vimpunk/mio.
To create zero-filled objects, associated with new files, use
mvector, mmatrix and marray.
Here we create a memory-mapped vector of length 100, associated with a temporary file:
A <- mvector(datatype = "double", length = 100)
A## A mvector of length 100 
## data type:  double 
## File: /tmp/Rtmppw0bVM/mmatrix209eaecdac51d 
## --- excerpt
## [1] 0 0 0 0 0We can specify the filename for the backing file. Here we create a memory-mapped matrix:
filename <- file.path(tempdir(), "integers120")
B <- mmatrix(datatype = "integer", nrow = 12, ncol = 10, filename = filename)
B## A mmatrix with 12 rows and 10 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/integers120 
## --- excerpt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0Similarly, marray("float", c(10, 20, 3)) a 10 by 20 by 3
array.
The methods as.mvector, as.mmatrix and
as.marray allow to create a file corresponding to the
content of a R object.
# Convert regular R objects to memory-mapped objects
a <- matrix(1:20, 4, 5)
A <- as.mmatrix(a, datatype = "float")
A## A mmatrix with 4 rows and 5 cols
## data type:  float 
## File: /tmp/Rtmppw0bVM/mmatrix209eae526be3f1 
## --- excerpt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20If datatype is not provided, the method will use
integer of double, depending on the type of
the R object.
v <- 1:10
V <- as.mvector(v)
V## A mvector of length 10 
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae70a74bd6 
## --- excerpt
## [1] 1 2 3 4 5These methods also have an argument filename.
You can recover a R object using as.vector,
as.matrix and as.array:
as.vector(V)##  [1]  1  2  3  4  5  6  7  8  9 10An existing file can be mapped, as long as is has the good size. Here
we use the file mapped in B created above.
C <- mvector("int", 120, filename)
C## A read-only mvector of length 120 
## data type:  integer 
## File: /tmp/Rtmppw0bVM/integers120 
## --- excerpt
## [1] 0 0 0 0 0Providing an incompatible size will raise an error.
D <- mvector("int", 100, filename)## Error: The file size doesn't match the matrix sizeThe mvector C is read-only, this is the default when
mapping an existing file. You can change this by providing the argument
readonly = FALSE to mvector.
As C and B are mapping the same files,
modifying one object should modify the other:
B[1:4] <- 1:4
C## A read-only mvector of length 120 
## data type:  integer 
## File: /tmp/Rtmppw0bVM/integers120 
## --- excerpt
## [1] 1 2 3 4 0However this may not work always well, depending on your system, or
when a file is mapped through several R sessions. The function
flush makes sure all changes are written on disk:
B[1:4] <- 2:5
flush(B)
C## A read-only mvector of length 120 
## data type:  integer 
## File: /tmp/Rtmppw0bVM/integers120 
## --- excerpt
## [1] 2 3 4 5 0Descriptor files aim to provide a minimal compatibility with the bigmemory package.
To create a descriptor file associated is a mapped file, use
descriptor.file. We illustrate it here on the matrix
B created above.
B## A mmatrix with 12 rows and 10 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/integers120 
## --- excerpt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    0    0    0    0
## [2,]    3    0    0    0    0
## [3,]    4    0    0    0    0
## [4,]    5    0    0    0    0
## [5,]    0    0    0    0    0dsc <- descriptor.file(B)## Warning in mk.descriptor.file(object@file, object@dim[1], object@dim[2], : Creating
## a descriptor file for an object stored in tmp directoryDescriptor files can be read with read.descriptor:
D <- read.descriptor(dsc)
D## A read-only mmatrix with 12 rows and 10 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM//integers120 
## --- excerpt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    0    0    0    0
## [2,]    3    0    0    0    0
## [3,]    4    0    0    0    0
## [4,]    5    0    0    0    0
## [5,]    0    0    0    0    0The descriptor files created by houba can be read with the package bigmemory:
We first load the package and read the descriptor file:
library(bigmemory)
desc <- dget(dsc)We then attach the file:
bm <- attach.big.matrix(desc)The resulting object maps the same datafile:
bm[,1]##  [1] 2 3 4 5 0 0 0 0 0 0 0 0Note that alhougj houba allows to create descriptor files for marrays, these won’t be accepted by bigmemory which doesn’t handle arrays.
When restoring data from a previous session, pointers to external
objects are broken, making objects unsuable. If the underlying data file
still exists, you can use restore to overcome the
problem.
Here we simulate this behaviour on the matrix B, using
save.image.
B## A mmatrix with 12 rows and 10 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/integers120 
## --- excerpt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    0    0    0    0
## [2,]    3    0    0    0    0
## [3,]    4    0    0    0    0
## [4,]    5    0    0    0    0
## [5,]    0    0    0    0    0rdata_file <- tempfile(fileext = ".rda")
save.image(rdata_file)Now we erase B:
rm(B)And we load the saved image:
load(rdata_file)
B## A mmatrix with a broken external ptr ! Try using restore()The pointer in B is broken, but can be restored as
this:
B <- restore(B)
B## A mmatrix with 12 rows and 10 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/integers120 
## --- excerpt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    0    0    0    0
## [2,]    3    0    0    0    0
## [3,]    4    0    0    0    0
## [4,]    5    0    0    0    0
## [5,]    0    0    0    0    0You can create a copy with copy. This will also create a
new file.
C <- copy(B)
C## A mmatrix with 12 rows and 10 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae4d8ad073 
## --- excerpt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    0    0    0    0
## [2,]    3    0    0    0    0
## [3,]    4    0    0    0    0
## [4,]    5    0    0    0    0
## [5,]    0    0    0    0    0This function have an argument filename. It can in
particular be used to save data that are stored in a temporary file.
The dimensions of an object can be accessed through
dim.
a <- matrix(1:12, 3, 4)
A <- as.mmatrix(a)
A## A mmatrix with 3 rows and 4 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae3d4271a7 
## --- excerpt
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12dim(A)## [1] 3 4You can change the dimensions:
dim(A) <- c(4, 3)
A## A mmatrix with 4 rows and 3 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae3d4271a7 
## --- excerpt
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12Setting the dimensions to NULL creates a mvector:
dim(A) <- NULL
A## A mvector of length 12 
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae3d4271a7 
## --- excerpt
## [1] 1 2 3 4 5Similarly, you can obtain an marray:
dim(A) <- c(2,2,3)
A## A marray with dimensions 2 2 3 
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae3d4271a7You can access elements of a memory-mapped object just as regular objects.
Let us create a memory-mapped matrix
a <- matrix( sample(0:99, 2500, TRUE), 50, 50)
A <- as.mmatrix(a)Acessing a single element:
A[1,1]## [1] 1Accessing a row:
A[1,]##  [1]  1 73 97 81 34 34 11  4 37 29  9 96 95  3 55 52 48 37  4 48 56 83 79  2 22 95 94
## [28] 81 91 55 58 90 11 88 89 75 40 77 68  8 53 10 70 33 88 19 52 67 98 99The result here is a R object. This behaviour actually depends on its size! The default is to return a R object if the result’s size is less than one million, and else to return a memory-mapped object.
This can be changed through the option max.size, as
follows:
houba(max.size = 20)## $max.size
## [1] 20And now, accessing to the first row will sends a new memory-mapped object:
A[1,]## A mmatrix with 1 rows and 50 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae7743c7a6 
## --- excerpt
## [1]  1 73 97 81 34Again, you can use R syntax to assign values:
A[1,1] <- 0
A[2,] <- 10
A## A mmatrix with 50 rows and 50 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae5f63f5b3 
## --- excerpt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   73   97   81   34
## [2,]   10   10   10   10   10
## [3,]    7   73   44    6    8
## [4,]   66   64   27    7   71
## [5,]   24   58   93   65    2Assignement with another memory-mapped object is also possible:
V <- as.mvector(1:50, "int")
A[3,] <- V
A## A mmatrix with 50 rows and 50 cols
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae5f63f5b3 
## --- excerpt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   73   97   81   34
## [2,]   10   10   10   10   10
## [3,]    1    2    3    4    5
## [4,]   66   64   27    7   71
## [5,]   24   58   93   65    2There is no type promotion. Assigning a floating point value to an integer object will cast it to integer:
A[1,1] <- pi
A[1,1]## [1] 3Arithmetic operations are available with the usual R syntax.
a <- matrix( sample.int(16), 4, 4)
A <- as.mmatrix(a, datatype = "float")
A <- 1 + 2*A
A## A mmatrix with 4 rows and 4 cols
## data type:  float 
## File: /tmp/Rtmppw0bVM/mmatrix209eae52a8e8c4 
## --- excerpt
##      [,1] [,2] [,3] [,4]
## [1,]   33   19   27    5
## [2,]   31    9   25   23
## [3,]   29   11   21   15
## [4,]   17    3   13    7Memory-mapped objects can be used for both operands:
B <- A + 2
C <- A / B
C## A mmatrix with 4 rows and 4 cols
## data type:  float 
## File: /tmp/Rtmppw0bVM/mmatrix209eae7fcdb445 
## --- excerpt
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.9428571 0.9047619 0.9310345 0.7142857
## [2,] 0.9393939 0.8181818 0.9259259 0.9200000
## [3,] 0.9354839 0.8461539 0.9130435 0.8823529
## [4,] 0.8947368 0.6000000 0.8666667 0.7777778There is no type promotion. If the two operands have different types, the type of the result is the type of the left operand.
Let’s create to vectors with type float and
integer:
A <- as.mvector( seq(0, 1, length = 11), datatype = "float" )
B <- as.mvector( 0:10, datatype = "integer" )Now A + B has type float:
A + B## A mvector of length 11 
## data type:  float 
## File: /tmp/Rtmppw0bVM/mmatrix209eae6051408 
## --- excerpt
## [1] 0.0 1.1 2.2 3.3 4.4and B + A has type integer:
B + A## A mvector of length 11 
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae6795eb37 
## --- excerpt
## [1] 0 1 2 3 4We can modify the data without creating copies:
V <- as.mvector(1:20, "float")
W <- as.mvector(sample.int(20))
inplace.sum(V, 1)          # Add 1 to all elements
inplace.prod(V, W)         # Multiply elements of V by elements of W
inplace.minus(V, c(1,2))   # Subtract c(1,2) from all elements (recycling)
inplace.div(V, 4)          # Divide all elements by 4
inplace.opposite(V)        # Take opposite of all elements
inplace.inverse(V)         # Take reciprocal of all elements
V## A mvector of length 20 
## data type:  float 
## File: /tmp/Rtmppw0bVM/mmatrix209eae62424780 
## --- excerpt
## [1] -0.10810811 -0.08163265 -0.06779661 -0.30769232 -0.04210526houba provides analogs to rowSums,
rowMeans, colSums, colMeans, and
apply, for memory-mapped matrices (but not for memory
mapped arrays).
a <- matrix( sample.int(100), 10, 10)
A <- as.mmatrix(a)
# Row sums and meands
rowSums(A)##  [1] 570 519 545 415 503 541 344 445 598 570rowMeans(A)##  [1] 57.0 51.9 54.5 41.5 50.3 54.1 34.4 44.5 59.8 57.0Here the result is a R object, because its size does not exceed the
value of the option max.size. In the contrary case, it will
be a memory-mapped object:
houba(max.size = 5)## $max.size
## [1] 5rowSums(A)## A mvector of length 10 
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae253fd3be 
## --- excerpt
## [1] 570 519 545 415 503The apply method will extract row or lines to R objects.
Again, the type of the result depends on the max.size
option.
If the size of the result is larger than max.size, a
memory mapped object is returned:
houba(max.size = 5)## $max.size
## [1] 5apply(A, 1, sd)## A mvector of length 10 
## data type:  double 
## File: /tmp/Rtmppw0bVM/mmatrix209eae5adedfe8 
## --- excerpt
## [1] 26.91963 29.75623 27.64155 36.13324 22.04566The data type of this object will be double or
integer, depending on the values returned by the function.
For example, the sum function will return integers:
apply(A, 1, sum)## A mvector of length 10 
## data type:  integer 
## File: /tmp/Rtmppw0bVM/mmatrix209eae70957650 
## --- excerpt
## [1] 570 519 545 415 503And if the size of the result is smaller than max.size,
a R object is returned:
houba(max.size = 1e6)## $max.size
## [1] 1e+06apply(A, 1, sd)##  [1] 26.91963 29.75623 27.64155 36.13324 22.04566 30.06456 32.67415 27.80587 31.73081
## [10] 26.43230You may e-mail the author if for bug reports, feature requests, or contributions. The source of the package is on github.
Houba, hop!
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.