# Tutorial 1: Using the ondisc_matrix class

This tutorial shows how to use ondisc_matrix, the core class implemented by ondisc. An ondisc_matrix is an R object that represents an expression matrix stored on-disk rather than in-memory. We cover the topics of initialization, querying basic information, subsetting, and pulling submatrices into memory. We begin by loading the ondisc package.

library(ondisc)

# Initialization

ondisc ships with several example datasets, stored in the “extdata” subdirectory of the package.

raw_data_dir <- system.file("extdata", package = "ondisc")
list.files(raw_data_dir)
#> [1] "cell_barcodes.tsv"   "gene_expression.mtx" "genes.tsv"
#> [4] "guides.tsv"          "perturbation.mtx"

The files “gene_expression.mtx”, “cell_barcodes.tsv,” and “genes.tsv” together define a gene-by-cell expression matrix. We save the full paths to these files in the variables mtx_fp, barcodes_fp, and features_fp.

mtx_fp <- paste0(raw_data_dir, "/gene_expression.mtx")
barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
features_fp <- paste0(raw_data_dir, "/genes.tsv")

An ondisc_matrix consists of two parts: an HDF5 (i.e., .h5) file that stores the expression data on-disk in a novel format, and an in-memory object that allows us to interact with the expression data from within R. The easiest way to initialize an ondisc_matrix is by calling the function create_ondisc_matrix_from_mtx. We pass to this function (i) a file path to the .mtx file storing the expression data, (ii) a file path to the .tsv file storing the cell barcodes, and (iii) a file path to the .tsv file storing the feature IDs and human-readable feature names. We optionally can specify the directory in which to store the initialized .h5 file, which in this tutorial we will take to be the temporary directory.

temp_dir <- tempdir()
exp_mat_list <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
barcodes_fp = barcodes_fp,
features_fp = features_fp,
on_disk_dir = temp_dir)
#>
|========                                                                 | 11%
|=================                                                        | 23%
|==========================                                               | 36%
|====================================                                     | 48%
|=============================================                            | 61%
|======================================================                   | 73%
|===============================================================          | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#>
|========                                                                 | 11%
|=================                                                        | 23%
|==========================                                               | 36%
|====================================                                     | 48%
|=============================================                            | 61%
|======================================================                   | 73%
|===============================================================          | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#> Writing CSC data.
#> Writing CSR data.

By default, create_ondisc_matrix_from_mtx returns a list of three elements: (i) an ondisc_matrix representing the expression data, (ii) a cell-wise covariate matrix, and (iii) a feature-wise covariate matrix. The exact cell-wise and feature-wise covariate matrices that are computed depend on the inputs to create_ondisc_matrix_from_mtx (see documentation via ?create_ondisc_matrix_from_mtx for full details). The advantage to computing the cell-wise and feature-wise covariates at initialization is that it obviates the need to load the entire dataset into memory a second time.

expression_mat <- exp_mat_list$ondisc_matrix head(expression_mat) #> Showing 5 of 300 featuress and 6 of 900 cells: #> Loading required package: Matrix #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 3 0 0 0 0 5 #> [2,] 0 2 0 0 0 0 #> [3,] 0 8 0 0 0 0 #> [4,] 0 0 0 0 0 0 #> [5,] 0 0 0 0 0 0 cell_covariates <- exp_mat_list$cell_covariates
#>   n_nonzero n_umis     p_mito
#> 1        43    214 0.04672897
#> 2        26    169 0.00000000
#> 3        22    116 0.05172414
#> 4        37    258 0.08139535
#> 5        36    224 0.08035714
#> 6        31    147 0.07482993
feature_covariates <- exp_mat_list\$feature_covariates
#>   mean_expression coef_of_variation n_nonzero
#> 1       0.7577778          2.981871       114
#> 2       0.5977778          3.302883        96
#> 3       0.5788889          3.539932        85
#> 4       0.6533333          3.341677        91
#> 5       0.5522222          3.578487        82
#> 6       0.5455556          3.541223        84

The initialized HDF5 file is named ondisc_matrix_1.h5 and is located in the temporary directory.

"ondisc_matrix_1.h5" %in% list.files(temp_dir)
#> [1] TRUE

A strength of create_ondisc_matrix_from_mtx is that it does not assume that entire expression matrix fits into memory. The optional argument n_lines_per_chunk can be used to specify the number of lines to read from the .mtx file at a time. Additionally, create_ondisc_matrix_from_mtx is fast: the novel algorithm that underlies this function is highly efficient and implemented in C++ for maximum speed. Typically, create_ondisc_matrix_from_mtx takes aboout 4-8 minutes/GB to run. Finally, for a given dataset, create_ondisc_matrix_from_mtx only needs to be run once, even after closing and opening new R sessions.

# Querying basic information

We can use the functions get_feature_ids, get_feature_names, and get_cell_barcodes to obtain the feature IDs, feature names (if applicable), and cell barcodes, respectively, of an ondisc_matrix.

feature_ids <- get_feature_ids(expression_mat)
feature_names <- get_feature_names(expression_mat)
cell_barcodes <- get_cell_barcodes(expression_mat)

#> [1] "ENSG00000198060" "ENSG00000237832" "ENSG00000267543" "ENSG00000103460"
#> [5] "ENSG00000229637" "ENSG00000174990"
#> [1] "MARCH5"     "AL138808.1" "AC015802.3" "TOX3"       "PRAC2"
#> [6] "CA5A"
#> [1] "GCTTTCGTCTAGACCA-1" "ACGGTCGTCGTTAGAC-1" "TTTACGTTCACCTCGT-1"
#> [4] "TGGATCATCCTTCAGC-1" "ACAGGGAAGACGCCCT-1" "ACCTACCAGTGTTCCA-1"

Additionally, we can use dim, nrow, and ncol to obtain the dimension, number of rows (i.e., number of features), and number of columns (i.e., number of cells) of an ondisc_matrix.

dim(expression_mat)
#> [1] 300 900
nrow(expression_mat)
#> [1] 300
ncol(expression_mat)
#> [1] 900

# Subsetting

We can subset an ondisc_matrix to obtain a new ondisc_matrix that is a submatrix of the original. To subset an ondisc_matrix, apply the [ operator and pass a numeric, logical, or character vector indicating the cells or features to keep. Character vectors are assumed to refer to feature IDs (for rows) and cell barcodes (for columns).

# numeric vector examples
# keep genes 100-110
x <- expression_mat[100:110,]
# keep all cells except 10 and 20
x <- expression_mat[,-c(10,20)]
# keep genes 50-100 and 200-250 and cells 300-500
x <- expression_mat[c(50:100, 200:250), 300:500]

# character vector examples
# keep genes ENSG00000107581, ENSG00000286857, and ENSG00000266371
x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),]
# keep cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1
x <- expression_mat[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]

# logical vector example
# keep all genes except ENSG00000237832 and ENSG00000229637
x <- expression_mat[!(get_feature_ids(expression_mat)
%in% c("ENSG00000237832", "ENSG00000229637")),]

Subsetting an ondisc_matrix leaves the original object unchanged.

expression_mat
#> An ondisc_matrix with 300 features and 900 cells.

This important property, called object persistence, makes programming with ondisc_matrices intuitive. The underlying HDF5 file is not copied upon subset; instead, information is shared across ondisc_matrix objects, making subsets fast.

# Pulling a submatrix into memory

We can pull a submatrix of an ondisc_matrix into memory, allowing us to perform computations on a subset of the data. To pull a submatrix into memory, use the [[ operator, passing a numeric, character, or logical vector indicating the cells or features to access. The data structure that underlies an ondisc_matrix enables fast access to both rows and columns of the matrix.

# numeric vector examples
# pull gene 6
m <- expression_mat[[6,]]
# pull cells 200 - 250
m <- expression_mat[[,200:250]]
# pull genes 50 - 100 and cells 200 - 250
m <- expression_mat[[50:100, 200:250]]

# character vector examples
# pull genes ENSG00000107581 and ENSG00000286857
m <- expression_mat[[c("ENSG00000107581", "ENSG00000286857"),]]
# pull cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1
m <- expression_mat[[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]]

# logical vector examples
# subset the matrix, keeping genes ENSG00000107581, ENSG00000286857, and ENSG00000266371
x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),]
# pull all genes except ENSG00000107581
m <- x[[get_feature_ids(x) != "ENSG00000107581",]]

The last example demonstrates that we can pull a submatrix of an ondisc_matrix into memory after having subset the matrix.

One can remember the difference between [ and [[ by recalling R lists: [ is used to subset a list, and [[ is used to access elements stored within a list. Similarly, [ is used to subset an ondisc_matrix, and [[ is used to access a submatrix stored within an ondisc_matrix.

# Saving and loading an ondisc_matrix

As discussed previously, there are two components to an ondisc_matrix: the HDF5 file stored on-disk, and the R object stored in memory. The latter contains a file path to the former, allowing us to interact with the expression data from within R.

To save an ondisc_matrix, simply call saveRDS on the ondisc_matrix R object to create an .rds file.

saveRDS(object = expression_mat, file = paste0(temp_dir, "/expression_matrix.rds"))
rm(expression_mat)

We then can load the ondisc_matrix by calling readRDS on the .rds file.

expression_mat <- readRDS(paste0(temp_dir, "/expression_matrix.rds"))

We also can use the constructor of the ondisc_matrix class to create an ondisc_matrix from an already-initialized HDF5 file.

h5_file <- paste0(temp_dir, "/ondisc_matrix_1.h5")
expression_mat <- ondisc_matrix(h5_file)