1 Overview

This package implements several matrix operations using Matrix and DelayMatrix objects as well as HDF5 data files. Some basic algebra operations that can also be computed that are useful to implement statistical analyses using standard methodologies such as principal component analyses (PCA) or least squares estimation. The package also contains specific statistical methods mainly used in omic data analysis such as lasso regression. All procedures to work with objects loaded in memory can be found in BigDataStatMeth_memory vignette.

2 Prerequisites

The package requires other packages from CRAN and Bioconductor to be installed.

  • CRAN: Matrix, RcppEigen and RSpectra.
  • Bioconductor: HDF5array, rhdf5

Dealing with DelayedArray requires: beachmat, DelayedArray. The user can execute this code to install the required packages:

# Install BiocManager (if not previously installed)
install.packages("BiocManager") 

# Install required packages
BiocManager::install(c("Matrix", "RcppEigen", "RSpectra",
                       "beachmat", "DelayedArray",
                       "HDF5Array", "rhdf5"))

Our package needs to be installed from source code. In such cases, a collection of software (e.g. C, C++, Fortran, …) are required, mainly for Windows users. These programs can be installed using Rtools.

3 Install package

Once required packages and Rtools are installed, BigDataStatMeth package can be installed from our GitHub repository as follows:

# Install devtools and load library (if not previously installed)
install.packages("devtools") 
library(devtools)

# Install BigDataStatMeth 
install_github("isglobal-brge/BigDataStatMeth")

4 Getting started

First, let us start by loading the required packages to describe the main capabilities of the package

library(rhdf5)
library(BigDataStatMeth)

5 Previous knowledge

5.1 HDF5 data files

Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. It is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

HDF5 is a technology suite that makes possible the management of extremely large and complex data collections, can accomodate virtually every kind of data in a single file, sequences, images, SNP matrices, and every other type of data and metadata associated with an experiment.

There is no limit on the number or size of data objects in the collection, giving great flexibility for omic data. Is high-performance I/O with a rich set of integrated performance features that allow for access time and storage space optimizations

HDF5 file structure include only two major types of object:

  • Datasets, which are multidimensional arrays of a homogeneous type. For example, datasets for omics data could be to genomics, transcriptomics, epigenomics, proteomics and/or metabolomics experiments

  • Groups, which are container structures which can hold datasets and other groups

This results in a truly hierarchical, filesystem-like data format

HDF5 hierarchical structure

Figure 1: HDF5 hierarchical structure

5.2 Basics in rhdf5

Create hdf5 file

We have implemented the bdCreate_hdf5_matrix_file () function to create an hdf5 file with a group and a dataset in a single step. This function allows to create datasets from DelayedMAtrix objects using beachmat library as well as standard R matrices from Matrix objects.

library(rhdf5)

set.seed(5234)
n <- 500
m <- 600
A <- matrix(rnorm(n*m,mean=0,sd=1), n,m)

# We also can create a dataset from R matrix object
bdCreate_hdf5_matrix_file(filename = "robject.hdf5", 
                          object = A, 
                          group = "INPUT", 
                          dataset = "A")

File and dataset has been created

We see 0 in the console, indicating that no errors where found when creating the hdf5 file. Notice that a file called “robject.hdf5” will be created in the working directory

dir()
 [1] "BigDataStatMeth_hdf5.R"     "BigDataStatMeth_hdf5.Rmd"  
 [3] "BigDataStatMeth_hdf5_cache" "BigDataStatMeth_memory.Rmd"
 [5] "colesterol.csv"             "colesterol_new.csv"        
 [7] "img_applications"           "imgs"                      
 [9] "references.bib"             "references_application.bib"
[11] "robject.hdf5"              

Add datasets in hdf5 file

The function bdAdd_hdf5_matrix () allows to add a dataset in a existing hdf5 file. We can create the dataset in any group, if group doesn’t exists in file, the group is created before append the dataset. The object can be either an R object or a DelayedArray object.

set.seed(5234)
n <- 100
m <- 10000
A <- matrix(rnorm(n*m,mean=0,sd=1), n, m)

set.seed(5234)
n <- 50
m <- 12000
B <- matrix(rnorm(n*m,mean=3,sd=0.5), n, m)

# We create another data file (delayed.hdf5) with a matrix A.
# The group is called INPUT
bdCreate_hdf5_matrix_file(filename = "delayed.hdf5", 
                        object = A, 
                        group = "INPUT", 
                        dataset = "A")

File and dataset has been created

# And them, we add another matrix B to the same group
bdAdd_hdf5_matrix(object = B, 
                filename = "delayed.hdf5", 
                group = "INPUT", 
                dataset = "B")

Dataset has been added to the file

Create transposed matrix in hdf5 file

The function Add_hdf5_matrix () also allows to create a dataset with a transposed matrix with the parameter transpose = TRUE. For example, we can add a dataset in hdf5 with the B transposed matrix with :

# Create dataset BTransposed with B transposed matrix in delayed.hdf5 file at INPUT group
bdAdd_hdf5_matrix(B, "delayed.hdf5", "INPUT", "BTransposed", transp = TRUE);

Dataset has been added to the file

Open and get hdf5 content file

We can open an existing file show contents and access data using functions from rhdf5 package. rhdf5 is an R interface for HDF5. The file must always be opened before working with it.

With htls () function we get the the list of an hdf5 content without the need of open the file.

# Examine hierarchy before open file
h5ls("delayed.hdf5")
   group        name       otype dclass         dim
0      /       INPUT   H5I_GROUP                   
1 /INPUT           A H5I_DATASET  FLOAT 100 x 10000
2 /INPUT           B H5I_DATASET  FLOAT  50 x 12000
3 /INPUT BTransposed H5I_DATASET  FLOAT  12000 x 50

Once opened, it can be seen just typing its name. In that case, we only get the current level in the hierarchical tree

# Open file
h5fdelay <- H5Fopen("delayed.hdf5")
# Show hdf5 hierarchy (groups)
h5fdelay
HDF5 FILE 
        name /
    filename 

   name     otype dclass dim
0 INPUT H5I_GROUP           

NOTE: We can also use hdfview

Access datasets data

The $ operator can be used to access the next group level, this operator reads the object from disk. We can assign the dataset contents to an R object in order to work with it.

Bdata <- h5fdelay$INPUT$B
Bdata[1:3,1:5]
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 3.562285 2.699083 3.842990 3.400242 3.412452
[2,] 2.702681 2.313405 2.747726 3.044964 3.035956
[3,] 3.057644 3.251765 3.260262 3.322551 3.739247

Or we can get the data for the transposed version of this matrix created before

BdataTrans <- h5fdelay$INPUT$BTransposed
BdataTrans[1:5,1:3]
         [,1]     [,2]     [,3]
[1,] 3.562285 2.702681 3.057644
[2,] 2.699083 2.313405 3.251765
[3,] 3.842990 2.747726 3.260262
[4,] 3.400242 3.044964 3.322551
[5,] 3.412452 3.035956 3.739247

IMPORTANT NOTE: This is a very interesting feature. Imaging you wan to compute the mean of each column in the matrix A. If you want to do this in R, you first need to load the file and then use any of the existing functions in R. If A is big, loading all the matrix in R can be inneficient since you actually only need the information in one column

out <- sapply(1:100, function(i){
  mean(h5fdelay$INPUT$A[,i])
  })
head(out)
[1] -0.05107500  0.25916486  0.14013419  0.09320278 -0.13489573  0.10595091

There are efficient ways of reading a subset of data without the need of reading the entire file. To this end, see this Practical tips.

Close hdf5 file

After work with hdf5 the user always must close the file. We can close only one file with H5Fclose function or close all hdf5 opened files with h5closeAll function

# Close delayed.hdf5 file
H5Fclose(h5fdelay)

# Open 2 files and close all
h5fdelay <- H5Fopen("delayed.hdf5")
h5fr <- H5Fopen("robject.hdf5")

h5closeAll()

More information about working with hdf5 files in R can be found in several vignettes.

6 Convert text file to HDF5

Usually we have files with a large amount of data to be loaded into R. For Big data problem, R can’t deal with this data so we cannot efficiently work with then. In order to resolve this issue, you can work directly from data on disk with datasets stored in an HDF5 data file.

You can easily change your text files to HDF5 using the function bdImport_text_to_hdf5 (), to do that, we only have to define the input file with the filename parameter and the HDF5 destination dataset with parameters outputfile for the output file name, outGroup with the group to store the dataset and the dataset name with the parameter outDataset. It should be considered that:

  • This function only allows to import numeric data except for rownames and columnames where character are allowed.

  • Data can be delimited by different characters. By default, we use tabs (“\t”) but it can be changed easily with sep parameter.

In this example we convert the “colesterol.csv” file and we will save the data to a file ‘colesterol_file.hdf5’ under the group ‘COLESTEROLGROUP’ in the dataset ‘COLESTEROLDATASET.’ In this example the text file contains column names so we set the parameter header = TRUE.

bdImport_text_to_hdf5(filename = "colesterol.csv", 
                      sep=',', 
                      outputfile = "colesterol_file.hdf5", 
                      outGroup = "COLESTEROL", 
                      outDataset = "COLESTEROLDATA", 
                      header = TRUE)

COLESTEROLDATA dataset has been imported successfully.

If we observe the new file colesterol_file.hdf5 and its content


h5ls("colesterol_file.hdf5")
                                 group                     name       otype
0                                    /               COLESTEROL   H5I_GROUP
1                          /COLESTEROL .COLESTEROLDATA_dimnames   H5I_GROUP
2 /COLESTEROL/.COLESTEROLDATA_dimnames                        2 H5I_DATASET
3                          /COLESTEROL           COLESTEROLDATA H5I_DATASET
    dclass       dim
0                   
1                   
2 COMPOUND        10
3    FLOAT 1000 x 10

# We can open the file and have access to the data
h5Colesterol <- H5Fopen("colesterol_file.hdf5")

# Show hdf5 content dataset
h5Colesterol$COLESTEROL$COLESTEROLDATA[1:5, 1:6]
         [,1]     [,2]     [,3]      [,4]     [,5]     [,6]
[1,] 223.7348 55.25039 14.90246 0.9026095 10.22067 1.450970
[2,] 248.1820 53.47404 25.12592 0.8710345 17.10225 1.002655
[3,] 180.2071 58.54268 12.95197 0.8882435 11.21530 1.073924
[4,] 200.1522 62.58284 28.27819 0.9361357 12.79399 1.129373
[5,] 234.3281 68.70254 13.21404 0.7775238 11.39689 1.235655

# Show colnames 
head(h5Colesterol$COLESTEROL$.COLESTEROLDATA_dimnames$`2`)
             chr
1 "TCholesterol"
2          "Age"
3      "Insulin"
4   "Creatinine"
5          "BUN"
6         "LLDR"

H5Fclose(h5Colesterol)

NOTE: We can overwrite an existing hdf5 file with new data by setting overwrite = TRUE

7 Operation with HDF5 files

Once our data are available as hdf5 files, we can operate directly on it without having data on memory (i.e we have access from disk. In BigDataStatMeth we have implemented most of the common matrix operations and algebra functions to help users how are not familiar with this type of implementation. Next section provides several examples

7.1 HDF5 Matrix Multiplication

In this section, different products of matrices and vectors are introduced. The methods implement different strategies including block multiplication algorithms and the use of parallel implementations with hdf5 files datasets.

BigDataStatMeth allows to make an hdf5 matrix multiplication with :

  • two on memory matrices, previously saving the data in a file
  • two matrices inside an hdf5 file.

7.1.1 Hdf5 Matrix multiplication from data in memory

To work with big matrices bdblockmult() saves matrices in hdf5 file format in order to be able to operate with them in blocks and not overload the memory, by default are considered large matrices if the number of rows or columns is greater than 5000, but it can be changed with bigmatrix argument.

In that case, by default the function bdblockmult(), by default, this function creates the tmp_blockmult.hdf5 file with two groups the INPUT group with datasets A and B with original matrices and the OUTPUT group with dataset C with the results. The file name can be set with outputfile parameter.

n <- 1000
p <- 10000
Abig <- matrix(rnorm(n*n), nrow=n, ncol=n)
Bbig <- matrix(rnorm(n*p), nrow=n, ncol=p)


# We want to force it to run in memory
AxBNOBig <- bdblockmult(Abig, Bbig, bigmatrix = 100000)

# We consider a big matrix if number of rows or columns are > 500
AxBBig3000 <- bdblockmult(Abig, Bbig, bigmatrix = 500)

Depending on whether the calculation has been performed directly in memory or from an hdf5 file, the returned object type is different.

If we work in memory results are returned as a current r matrix object,

class(AxBNOBig)
[1] "matrix" "array" 
AxBNOBig[1:5,1:5]
          [,1]      [,2]       [,3]       [,4]      [,5]
[1,] -20.57999 -20.01243 -32.953955  -8.575015 -37.36579
[2,]  34.33171  69.94140  35.517190 -34.578343   5.33914
[3,]  -6.24148 -18.50118  13.384615  29.195395  20.34906
[4,] -30.25641  21.88857   6.350232  47.716476 -15.44486
[5,] -25.54031 -28.25632   8.399138  -3.773078 -14.57807

if we work in disk, we return a list with the filnme and the dataset that contains the results inside the hdf5 file. Then we can open the hdf5 file and get the result matrix and the original matrices,

To work in R with hdf5 data object we can use the rhdf5 function library from bioconductor, we can open file and read all the content

h5data <- H5Fopen(AxBBig3000$file)

# We can get where data is stored
AxBBig3000$dataset
[1] "OUTPUT//C"

# We observe that the data is in folder OUTPUT dataset C
reshdf5 <- h5data$OUTPUT$C

reshdf5[1:5,1:5]
          [,1]      [,2]       [,3]       [,4]      [,5]
[1,] -20.57999 -20.01243 -32.953955  -8.575015 -37.36579
[2,]  34.33171  69.94140  35.517190 -34.578343   5.33914
[3,]  -6.24148 -18.50118  13.384615  29.195395  20.34906
[4,] -30.25641  21.88857   6.350232  47.716476 -15.44486
[5,] -25.54031 -28.25632   8.399138  -3.773078 -14.57807

all.equal(reshdf5, AxBNOBig)
[1] TRUE

Remember that it is It is important to close all dataset, group, and file handles when not used anymore,

# Close file
H5Fclose(h5data)

7.1.2 Hdf5 Matrix multiplication from data stored in a file

If we have the matrices stored in hdf5 data file, we can use the function bdblockmult_hdf5. This function allows to perform a matrix multiplication from two matrices stored in a file.

To use the bdblockmult_hdf5, we have to set the parameters filename with the file that contains the matrices, group, here, we have to indicate under which group are stored the matrices and the datasets names where data is stored parameters a and b. By default, the results group can be set by outgroup parameter by default, the output group is OUTPUT .

To show how this function works, we will use the matrices created with bdblockmult function previously used and we will store the results under the ‘HDF5_RES’ group

res <- bdblockmult_hdf5(filename = 'tmp_blockmult.hdf5', group="INPUT",
                        a="A", b="B", outgroup = 'HDF5_RES')

# We show the hdf5 content
 h5ls(res$file)
      group     name       otype dclass          dim
0         / HDF5_RES   H5I_GROUP                    
1 /HDF5_RES    A_x_B H5I_DATASET  FLOAT 1000 x 10000
2         /    INPUT   H5I_GROUP                    
3    /INPUT        A H5I_DATASET  FLOAT  1000 x 1000
4    /INPUT        B H5I_DATASET  FLOAT 1000 x 10000
5         /   OUTPUT   H5I_GROUP                    
6   /OUTPUT        C H5I_DATASET  FLOAT 1000 x 10000

Now, we compare the obtained results with bdblockmult_hdf5 and bdblockmult functions .

# Get content
h5data <- H5Fopen(res$file)

# We can get where data is stored
res$dataset
[1] "HDF5_RES/A_x_B"

# We get the results with bdblockmult (previous step)
resblockmult <- h5data$OUTPUT$C

# and the results obtained from bdblockmult_hdf5
resblockmult_hdf5 <- h5data$HDF5_RES$A_x_B
resblockmult_hdf5[1:5,1:5]
          [,1]      [,2]       [,3]       [,4]      [,5]
[1,] -20.57999 -20.01243 -32.953955  -8.575015 -37.36579
[2,]  34.33171  69.94140  35.517190 -34.578343   5.33914
[3,]  -6.24148 -18.50118  13.384615  29.195395  20.34906
[4,] -30.25641  21.88857   6.350232  47.716476 -15.44486
[5,] -25.54031 -28.25632   8.399138  -3.773078 -14.57807

# Close the file
h5closeAll()

all.equal(resblockmult, resblockmult_hdf5)
[1] TRUE

We obtain the same result using bdblockmult and bdblockmult_hdf5.

7.2 Cross-product and Transposed Cross-product with hdf5 files

To perform a cross-product \(C = A^t A\) you can call bdCrossprod_hdf5() function. We just have to set the parameters filename, group and A with the file, group and the dataset that we want to use and optionally the outgroup parameter to set the output group where we want to store results.

The result dataset is stored with CrossProd_x pattern.

res <- bdCrossprod_hdf5(filename = 'robject.hdf5', 
                        group="INPUT", 
                        A="A", 
                        outgroup = "RESULTS")

We obtain the expected values computed using crossprod function.


# Get content
h5data <- H5Fopen(res$file)

# We can get wher data is stored
res$dataset
[1] "RESULTS//CrossProd_A"

# We get the Crossprod Results and the original matrix
resCProd <- h5data$RESULTS$CrossProd_AxA
A <- h5data$INPUT$A

# Close the file
h5closeAll()

# Show results
resCProd[1:5,1:5]
            [,1]       [,2]        [,3]      [,4]       [,5]
[1,] 505.6616915  10.611020  -0.3666995  13.24150   8.001664
[2,]  10.6110197 489.229418  -9.4790008 -10.69272 -22.537412
[3,]  -0.3666995  -9.479001 497.0000736 -16.47190  -1.484056
[4,]  13.2415011 -10.692717 -16.4719016 583.94688  -4.379890
[5,]   8.0016642 -22.537412  -1.4840559  -4.37989 525.367876

all.equal(resCProd, crossprod(A))
[1] TRUE

You may also compute the Crossprod with two different matrix \(C = A^t B\) ,


set.seed(5234)
n <- 500
m <- 600
B <- matrix(rnorm(n*m,mean=0,sd=1), n,m)

bdAdd_hdf5_matrix(B, filename = 'robject.hdf5', group="INPUT", dataset = "B2")

Dataset has been added to the file

# Get Crossprod with two matrix
res <- bdCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", 
                        groupB = "INPUT", B = "B2", outgroup = "RESULTS")

We obtain the expected values computed using crossprod function.


# Get content
h5data <- H5Fopen(res$file)

# We can get wher data is stored
res$dataset
[1] "RESULTS//CrossProd_AxB2"

# We get the Crossprod Results and the original matrix
resCProd <- h5data$RESULTS$CrossProd_AxB2
A <- h5data$INPUT$A
B <- h5data$INPUT$B2

# Close the file
h5closeAll()

# Show results
resCProd[1:5,1:5]
            [,1]       [,2]        [,3]      [,4]       [,5]
[1,] 505.6616915  10.611020  -0.3666995  13.24150   8.001664
[2,]  10.6110197 489.229418  -9.4790008 -10.69272 -22.537412
[3,]  -0.3666995  -9.479001 497.0000736 -16.47190  -1.484056
[4,]  13.2415011 -10.692717 -16.4719016 583.94688  -4.379890
[5,]   8.0016642 -22.537412  -1.4840559  -4.37989 525.367876

all.equal(resCProd, crossprod(A,B))
[1] TRUE

you may also set use bdtCrossprod_hdf5() function to get a transposed cross-product \(C = A A^t\) . In that case the result dataset is stored with tCrossProd_x pattern.

res <- bdtCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", outgroup = "RESULTS")

We obtain the expected values computed using crossprod function.


# Get content
h5data <- H5Fopen(res$file)

# We can get wher data is stored
res$dataset
[1] "RESULTS//tCrossProd_A"

# We get the Crossprod Results and the original matrix
restCProd <- h5data$RESULTS$tCrossProd_AxA
A <- h5data$INPUT$A

# Close the file
h5closeAll()

# Show results
restCProd[1:5,1:5]
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 578.58545  17.23237 -13.95637 -27.63535  56.70321
[2,]  17.23237 603.99446 -43.22671  29.50159 -38.76707
[3,] -13.95637 -43.22671 558.64887  19.51781  21.23500
[4,] -27.63535  29.50159  19.51781 645.86835 -40.12507
[5,]  56.70321 -38.76707  21.23500 -40.12507 542.25119

all.equal(restCProd, tcrossprod(A))
[1] TRUE

You may also compute the tCrossprod with two different matrix \(C = A B^t\)


set.seed(5234)
n <- 500
m <- 600
B <- matrix(rnorm(n*m,mean=0,sd=1), n,m)

bdAdd_hdf5_matrix(B, filename = 'robject.hdf5', group="INPUT", dataset = "B3")

Dataset has been added to the file

# Get Crossprod with two matrix
res <- bdtCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", 
                         groupB = "INPUT", B = "B3", outgroup = "RESULTS")

We obtain the expected values computed using crossprod function.


# Get content
h5data <- H5Fopen(res$file)

# We can get wher data is stored
res$dataset
[1] "RESULTS//tCrossProd_AxB3"

# We get the Crossprod Results and the original matrix
restCProd <- h5data$RESULTS$tCrossProd_AxB3
A <- h5data$INPUT$A
B <- h5data$INPUT$B3

# Close the file
h5closeAll()

# Show results
restCProd[1:5,1:5]
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 578.58545  17.23237 -13.95637 -27.63535  56.70321
[2,]  17.23237 603.99446 -43.22671  29.50159 -38.76707
[3,] -13.95637 -43.22671 558.64887  19.51781  21.23500
[4,] -27.63535  29.50159  19.51781 645.86835 -40.12507
[5,]  56.70321 -38.76707  21.23500 -40.12507 542.25119

all.equal(restCProd, tcrossprod(A,B))
[1] TRUE

7.3 Singular Value Decomposition (SVD)

The SVD of an \(m \times n\) real or complex matrix \(A\) is a factorization of the form:

\[U\Sigma { V }^{ T }\]

where :

  • \(U\) is a \(m \times m\) real or complex unitary matrix
  • \(\Sigma\) is a \(m \times n\) rectangular diagonal matrix with non-negative real numbers on the diagonal
  • \(V\) is a \(n \times n\) real or complex unitary matrix.

Notice that:

  • The diagonal entries \(\sigma_i\) of \(\Sigma\) are known as the singular values of \(A\).
  • The columns of \(U\) are called the left-singular vectors of \(A\).
  • The columns of \(V\) are called the right-singular vectors of \(A\).

7.3.1 Block Singular Values Decomposition

This method was developed by M. A. Iwen and B. W. Ong (2016). The authors introduced a distributed and incremental SVD algorithm that is useful for agglomerative data analysis on large networks. The algorithm calculates the singular values and left singular vectors of a matrix A, by first, partitioning it by columns. This creates a set of submatrices of A with the same number of rows, but only some of its columns. After that, the SVD of each of the submatrices is computed. The final step consists of combining the results obtained by merging them again and computing the SVD of the resulting matrix.

Flowchart for a two-level hierarchical Block SVD algorithm

Figure 2: Flowchart for a two-level hierarchical Block SVD algorithm

This method is implemented in bdSVD_hdf5 function, this function works directly on hdf5 data format, loading in memory only the data to perform calculations and saving the results again in the hdf5 file for later use.

We have to indicate the file to work with, the dataset name and the group where the dataset is located :

# Create dataframe data with 'odata' matrix in delayed hdf5 file at OMIC group
set.seed(5234)
n <- 150000
m <- 50
odata <- matrix(rnorm(n*m, mean=0, sd=1), n,m)

bdAdd_hdf5_matrix(odata, "delayed.hdf5", "OMICS", "data")

Dataset has been added to the file

# Direct from hdf5 data file
svdh5 <- bdSVD_hdf5("delayed.hdf5", "OMICS", "data")

# with R implementation from data in memory
test <- H5Fopen("delayed.hdf5")
# get results svd (d)
svd_hdf5_d <- test$SVD$data$d[1:7]
# get data
omdata <- test$OMICS$data
h5closeAll()

# Results in hdf5 file for d
svd_hdf5_d[1:7]
[1] 393.7274 393.3720 393.0192 392.6845 392.4629 391.7279 391.5869

svd <- svd(scale(omdata))
svd$d[1:7]
[1] 393.7274 393.3720 393.0192 392.6845 392.4629 391.7279 391.5869

Like in Simple Singular Value Decomposition we can normalize, center or scale data before proceed with SVD decomposition with bscale and bcenter parameters, by default this parameter are TRUE, data is normalized before SVD decomposition. To proceed with SVD without normalization :

# Direct from hdf5 data file
svdh5 <- bdSVD_hdf5("delayed.hdf5", "OMICS", "data", 
                    bcenter = FALSE, bscale = FALSE)
SVD - old dataset have been REMOVED 

# get results svd (d)
test <- H5Fopen("delayed.hdf5")
svd_hdf5_d <- test$SVD$data$d[1:7] 
h5closeAll()

# SVD (d) from file - data not normalized
svd_hdf5_d
[1] 393.8092 393.3840 393.1712 392.8305 392.7400 391.7171 391.6472

# with R implementation from data in memory
svd <- svd(omdata)
svd$d[1:7]
[1] 393.8092 393.3840 393.1712 392.8305 392.7400 391.7171 391.6472

In the SVD decomposition by blocks we can indicate the number of decomposition levels and number of local SVDs to concatenate at each level with parameters q and k respectively, by default q = 1 one level with k=2.

# Block decomposition with 1 level and 4 local SVDs at each level
svdh5 <- bdSVD_hdf5("delayed.hdf5", "OMICS", "data", q=1, k=4)
SVD - old dataset have been REMOVED 

# get results svd (d)
fprova <- H5Fopen("delayed.hdf5")
  svd_hdf5_d <- fprova$SVD$data$d[1:7]
h5closeAll()

# SVD (d) from file - data not normalized
svd_hdf5_d
[1] 393.7274 393.3720 393.0192 392.6845 392.4629 391.7279 391.5869

# with R implementation from data in memory
svd <- svd(scale(omdata))
svd$d[1:7]
[1] 393.7274 393.3720 393.0192 392.6845 392.4629 391.7279 391.5869

8 Utils for data analysis working directly in hdf5 files in genomic data

We have also implemented a set of functions to deal with genomic data which is a setting were big data is normally present. These are described in the next sections

8.1 Genomic data imputation

Imputation in genetics refers to the statistical inference of unobserved genotypes. In genetic epidemiology and quantitative genetics, researchers aim at identifying genomic locations where variation between individuals is associated with variation in traits of interest between individuals.

At the moment BigDataStatMeth has implemented snps imputation following the encoding used in Genomic Data Structure file format (GDS files) where SNPs are codded as 0 : two B alleles, 1 : one A allele and one B allele, 2 : two A alleles and 3 : missing data. BigDataStatMeth impute data where SNPs values are ‘3’ (missing data).

The imputation method implemented is generated from the information of the SNP data that contains the missing data, it is performed by generating a random value following a discrete distribution. If in the SNP we find 70% of ‘0,’ 25% of ‘1’ and 5% of ‘2,’ the value of the missing data will be imputed with a probability of 70% to be ‘0,’ 25% to be ‘1’ and 5% to be ‘2.’

We first simulate a genotype matrix with 0, 1, 2 and 3 and store it in hdf5 file and show data stored in file:

set.seed(108432)
geno.sim <- matrix(sample(0:3, 10000, replace = TRUE), byrow = TRUE, ncol = 10)
bdAdd_hdf5_matrix(geno.sim, "delayed.hdf5", "OMICS", "geno")

Dataset has been added to the file

# Get data and show the first 5 rows
h5fsvd = H5Fopen("delayed.hdf5")
geno <- h5fsvd$OMICS$geno
h5closeAll()

geno[1:5,1:10]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    2    0    2    3    0    2    3    1     0
[2,]    2    3    0    0    0    1    1    3    3     1
[3,]    1    2    1    1    1    0    0    3    1     2
[4,]    3    2    2    1    3    1    2    2    1     2
[5,]    3    1    2    3    3    1    0    3    1     1

Remember, we allways have to close hdf5 data file. Now we impute values where genotype = ‘3’ (missing data in GDS format). To apply imputation we only have to indicate the file where dataset is, the group and the dataset name, optionally, we indicate were we want to store results with outputgroup and outputfolder, by default the results are stored in the input dataset. An important parameter is bycols, with bycols we can inform if we want to impute data by cols bycols = TRUE or by rows bycols = FALSE, the default imputation implemented method is by cols.

In next example we perform the imputation in “imputedgeno” dataset inside the group “OMICS” and the imputation method applied takes into account the probabilities in columns (default).

bdImpute_snps_hdf5("delayed.hdf5", group="OMICS", dataset="geno",
                outgroup="OMICS", outdataset="imputedgeno")
SNPs with missing values has been imputed

Now we get the imputation results

# Get imputed data and show the first 5 rows
h5fsvd = H5Fopen("delayed.hdf5")
imputedgeno <- h5fsvd$OMICS$imputedgeno
h5closeAll()

imputedgeno[1:5,1:10]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    2    0    2    0    0    2    2    1     0
[2,]    2    1    0    0    0    1    1    2    1     1
[3,]    1    2    1    1    1    0    0    0    1     2
[4,]    2    2    2    1    2    1    2    2    1     2
[5,]    0    1    2    1    0    1    0    1    1     1

8.2 Normalization (center and scale) of genomic data

Normalization is one of the most important procedures in genomics data analysis. A typical dataset contains more than one sample and we are almost always interested in making comparisons between these and in order to make comparisons we need to normalize data.

In BigDataStatMeth we implemented a normalization method that works directly with datasets stored in hdf5 file format, like in other functions implemented in BigDataSet that works with hdf5 data files, we have to indicate where the data is, the filename, the name of the group under the dataset is and the dataset name. In normalization function, we can indicate if we want to center data, scale or center and scale (default option). The applied formula to normalize data by default in implemented function is : \[ \frac{X - \mu}{\sigma}\]

The implemented function to normalize data direct from hdf5 data file is bdNormalize_hdf5.

To show you an example, we will use the imputed geno dataset imputedgeno created before in imputation method example. To normalize data we have to indicate where the dataset is stored and we have two optiona paramters bcenter and bscale with default value = TRUE. In order to normalize data, we don’t need to modify these optional parameters. If we only want to center data, then we have to put bscale = FALSE or if we only want to scale data bcenter parameter must be FALSE.

Important : The normalization results are stored under “NORMALIZED” group, under this group, we found all normalized datasets with the same structure under root, for example, if we normalize dataset genoimputed under OMICS group, the normalized results will be in “NORMALIZED”/“OMICS”/genoimputed

Let show an example normalizing imputed geno under group “OMICS” in delayed.hdf5 file:


bdNormalize_hdf5("delayed.hdf5", group="OMICS", dataset="imputedgeno")
Normalization has been computed

The results will be under groups “NORMALIZED” - “OMICS” in dataset genoimputed

# Geno after normalization
h5fsvd = H5Fopen("delayed.hdf5")
genonormalized <- h5fsvd$NORMALIZED$OMICS$geno
h5closeAll()

genonormalized[1:5,1:10]
NULL

8.3 QC - Remove low data

Missing data in meta-analysis of genetic association studies are unavoidable. In BigDataStatMeth we implemented a function to remove those SNPs with a hight percentage of missing data. The implemented function to remove SNPs wit a certain percentage of missing data is the function bdRemovelowdata

Like in imputation method, in remove low data method we have to inform the input data indicating the filename, the group and the dataset name and we also have to inform the dataset were data will be stored after removing the SNPs with parameters outgroup and outdataset.

To remove low data we have to infrorm if we want to remove data by cols bycols = TRUE or by rows bycols = FALSE and the percentage of missing data pcent parameter. pcent of missing data refers to missing percentage in columns if we defined bycols = TRUE or pcent refers to rows if bycols = FALSE.

To show how renove low data works we use the previous dataset stored in delayed.hdf5 file, in that case, we will assume that SNPs are in columns and we will remove those SNPs where missing data is greater than 40%.

bdRemovelowdata("delayed.hdf5", group="OMICS", dataset="geno",
                outgroup="OMICS", outdataset="removedgeno",
                bycols = TRUE, pcent = 0.39)
Warning in evalq((function (..., call. = TRUE, immediate. = FALSE, noBreaks. =
FALSE, : 0 Columns have been removed
[1] 0

After remove the SNPs with pcent greater than 39% results have been saved in dataset “removedgeno” under group “OMICS,” we observe the resulting dataset

# Get imputed data and show the first 5 rows
h5fsvd = H5Fopen("delayed.hdf5")
removedgeno <- h5fsvd$OMICS$removedgeno
h5closeAll()

removedgeno[1:5,1:10]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    2    0    2    3    0    2    3    1     0
[2,]    2    3    0    0    0    1    1    3    3     1
[3,]    1    2    1    1    1    0    0    3    1     2
[4,]    3    2    2    1    3    1    2    2    1     2
[5,]    3    1    2    3    3    1    0    3    1     1

8.4 QC - Get MAF

8.5 QC - Remove MAF

9 Session information

sessionInfo()
R version 4.1.0 Patched (2021-06-26 r80567)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] C/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BigDataStatMeth_0.99.14 rhdf5_2.36.0            knitr_1.36             
[4] BiocStyle_2.20.2       

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7          magrittr_2.0.1      R6_2.5.1           
 [4] rlang_0.4.12        fastmap_1.1.0       highr_0.9          
 [7] stringr_1.4.0       tools_4.1.0         data.table_1.14.2  
[10] xfun_0.28.7         jquerylib_0.1.4     htmltools_0.5.2    
[13] RcppParallel_5.1.4  yaml_2.2.1          digest_0.6.28      
[16] bookdown_0.24       BiocManager_1.30.16 Rhdf5lib_1.14.2    
[19] sass_0.4.0          rhdf5filters_1.4.0  bitops_1.0-7       
[22] codetools_0.2-18    RCurl_1.98-1.5      evaluate_0.14      
[25] rmarkdown_2.11      stringi_1.7.5       compiler_4.1.0     
[28] bslib_0.3.1         jsonlite_1.7.2