The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
R package {bigstatsr} provides functions for fast statistical
analysis of large-scale data encoded as matrices. The package can handle
matrices that are too large to fit in memory thanks to memory-mapping to
binary files on disk. This is very similar to the format
big.matrix
provided by R package
{bigmemory}, which is no longer used by this
package (see the
corresponding vignette). As inputs, package {bigstatsr} uses Filebacked
Big Matrices (FBM).
Note that most of the algorithms of this package don’t handle missing values.
# For the CRAN version
install.packages("bigstatsr")
# For the latest version
::install_github("privefl/bigstatsr") remotes
library(bigstatsr)
# Create the data on disk
<- FBM(5e3, 10e3, backingfile = "test")$save()
X # If you open a new session you can do
<- big_attach("test.rds")
X
# Fill it by chunks with random values
<- matrix(0, nrow(X), 5); U[] <- rnorm(length(U))
U <- matrix(0, ncol(X), 5); V[] <- rnorm(length(V))
V <- nb_cores()
NCORES # X = U V^T + E
big_apply(X, a.FUN = function(X, ind, U, V) {
<- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind))
X[, ind] NULL ## you don't want to return anything here
a.combine = 'c', ncores = NCORES, U = U, V = V)
}, # Check some values
1:5, 1:5]
X[
# Compute first 10 PCs
<- big_randomSVD(X, fun.scaling = big_scale(),
obj.svd k = 10, ncores = NCORES)
plot(obj.svd)
# Cleanup
unlink(paste0("test", c(".bk", ".rds")))
Learn more with this introduction to package {bigstatsr}.
If you want to use Rcpp code, look at this tutorial.
Package {bigstatsr} uses package {foreach} for its parallelization tasks. Learn more on parallelism with {foreach} with this tutorial.
Computing the null space of a big matrix (works if one dimension is not too large)
How to make a great R reproducible example?
Please open an issue if you find a bug.
If you want help using {bigstatsr}, please open an issue as well or post on Stack Overflow with the tag bigstatsr.
I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.
Privé, Florian, et al. “Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.” Bioinformatics 34.16 (2018): 2781-2787.
Privé, Florian, Hugues Aschard, and Michael GB Blum. “Efficient implementation of penalized regression for genetic risk prediction.” Genetics 212.1 (2019): 65-74.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.