Differential Expression Enrichment Tool (DEET)

Dustin Sokolowski

2022-10-09

Install and load DEET

DEET relies on the following packages. Since they are all CRAN in origin, they should download and install automatically with devtools::install_github or utils::install.packages. The required dependencies are listed below.

Installation

  1. Github (Development Version)
  1. CRAN (Stable Release)

Downloading files

All processed DEGs, metadata, and enriched pathways in formats compatible with this package as well as other methods such as gene set enrichment analysis are stored here: http://wilsonlab.org/public/DEET_data/

No functions within DEET automatically load data for the user, so the data either needs to be downloaded directly from the ftp, or using the downloader function.

The DEET_data_download function, with possible inputs “ALL”, “metadata”, “enrich”, and “feature_extract” automatically downloads the data required to run DEET_enrich and/or DEET_feature_extract.

We reccomended using:

downloaded <- DEET_data_download("ALL")
metadata <- downloaded$metadata
DEET_feature_extract_input <- downloaded$DEET_feature_extract
DEET_enrich_input <- downloaded$DEET_enrich

Here: DEET_enrich_input replaces DEET_example_data for DEET_enrich(). DEET_feature_extract_input replaces DEET_feature_extract_example_matrix for DEET_feature_extract() Lastly, metadata is not directly used in any of the function, but summarizes all of the pairwise comparisons using the following columns.

Once download, save these data and DEET can be used offline.

Structure of required datatypes

metadata

A comparison - by - explanatory piece of data dataframe providing important details to contextualize each study. For every pairwise comparison, the study name, source (SRA, TCGA, GTEx and SRA-manual), description from the DRA compendium, the number of samples (total, up-condition, and down-condition), samples (total ,up-condition, down-condition), tissue (including tumour from TCGA), number of DEs (total, up-condition, down-condition), age (mean +- sd), sex, top 15 DEGs - up, top 15 DEGs - down, top 5 enriched pathways, and top 5 enriched TFs. PMID are also available for studies selected from SRA. Lastly, each pairwise comparison was given an overall category based on those decided in Crow et al., 2019.

DEET_enrich_input

This is the meat and potatoes of the DEET dataset. Here, you can find all of the significant DE genes computed within DEET (padj < 0.05), DEGs, pathways, and TFs sorted into *gmt files compatible with traditional pathway enrichment tools (e.g., GSEA, gprofiler etc.), respective metadata, and the pathway enrichment and TF enrichment files used to generate the internal pathway enrichments of DEET_enrich. A more specific breakdown of these objects are below:

  • DEET_enrich_input: A named list of seven objects containing the data frames summarizing the DEGs from comparisons within DEET, GMT objects of comparisons within DEET for enrichment through ActivePathways, GMT objects for basic pathway and TF enrichment, and a dataframe for the metadata of each study.
  • DEET_DE: A list of data frames containing the significant DE genes, mean expression, log2fold-change, and padj from DESeq (padj < 0.05).
  • DEET_gmt_BP: A list of class GMT, which is a list of studies where each study is populated by comparison id (internal DEET identifier), comparison name (interpretable comparison name), and a gene set. In this case the gene-set is the pathways that are enriched within that study.
  • DEET_gmt_TF: A list of class GMT, which is a list of studies where each study is populated by comparison id (internal DEET identifier), comparison name (interpretable comparison name), and a gene set. In this case the gene-set is the TFs that are enriched within that study.
  • DEET_gmt_DE: A list of class GMT, which is a list of studies where each study is populated by comparison id (internal DEET identifier), comparison name (interpretable comparison name), and a gene set. In this case the gene-set is the DEGs that are enriched within that study.
  • gmt_BP: A list of class GMT, which is a curated human gene ontology gene sets from the Bader Lab http://download.baderlab.org/EM_Genesets/.
  • gmt_TF: A list of class GMT, which is a curated human transcription factor gene sets from the Bader Lab http://download.baderlab.org/EM_Genesets/.
  • DEET_metadata: the same as the metadata dataframe (see above).

DEET_feature_extract_input

A gene by comparison matrix populated by the log2Fold-change of genes that are significantly DE in the comparison (padj < 0.05). The file is the input to the mat variable in DEET_feature_extract.

Implementation summary

The primary function of the DEET R package is to allow users to query their own list of DEGs against the consistently computed DEGs within DEET by using the function DEET_enrich(). The optimal input into DEET_enrich() is a data frame of genes (human gene symbols) with an associated p-value and coefficient (e.g., Fold-change) in conjunction with a list of genes designating the statistical background. DEET_enrich() first identifies enriched biological pathways and TF targets using the *.gmt files used for all of the DEET comparisons (i.e., “Human_GO_AllPathways_with_GO_iea_June_01_2021_symbo.gmt” for pathways and “Human_TranscriptionFactors_MSigdb_June_01_2021_symbol.gmt” for TFs), allowing us to not only compare overlapping genes between the user-inputted genes and the DEGs in DEET but also overlapping pathways and TFs. All gene-set enrichment within the DEET_enrich() functions use ActivePathways with all detected genes as the background, Brown’s p-value fusion method, a false-discovery rate for p-value correction, and a cutoff of 0.05. Then, DEET_enrich() enriches the users’ inputted genes, pathways, and TF targets against the DEGs, pathways, and TF targets stored within DEET. Enrichment of the user’s inputted gene lists against the DE comparisons within DEET are also completed with ActivePathways, with a minimum geneset filter of 15 and a maximum of 10000. Then, DEET_enrich() computes the Spearman’s and Pearson’s correlation between the coefficients within the user’s imputed list of DEGs that overlap with the log2(Fold-change) of DEGs within enriched pairwise comparisons.P-values of these correlations are corrected with an FDR-adjustment. Together, DEET_enrich() returns significantly enriched studies based on overlapping DEGs, pathways, and TFs. Similarly, DEET_enrich() returns the traditional pathway and TF motif enrichment of the inputted gene list. All enrichment outputs are in the format of the output of ActivePathways (study, FDR-adjusted p-value, input length, DE comparison length, overlapping genes). DEET_enrich() also returns a dataframe of the Spearman’s and Pearson’s correlation (with associated FDR-adjusted p-values) between the inputted DE list with the DEGs found in DEET as well as the intersecting DEGs within those studies. Optionally, DEET_enrich() may be used with a generic gene list (i.e. without P-values or coefficients). If the inputted gene list is ordered, then the p-value is artificially generated as equation 1 and the coefficient is artificially generated as equation 2. We assume an inputted list in decreasing order of significance, so the FDR and coef in equations 1 and 2 are reversed. DEET_enrich() then runs normally but Pearson’s correlation between the inputted gene list and the DEGs within DEET are excluded. If the inputted gene list is unordered, then all of the p-values are set to 0.049 and both Spearman’s and Pearon’s correlations between the users inputted genes and the DEGs within DEET are excluded. If users do not provide a background set of genes, then we assume the background set is all genes detected within DEET.

For a sorted list of genes without a p-value or coefficient: Note, this happens internally, you do not have to do it.

DEG_list <- c("a", "b", "c", "d") # list of genes user inputs

DEG_processed <- data.frame(gene_symbol = DEG_list)
# DEG list is the list of genes that the user inputs

      padj <- 0.049
      for(i in 2:nrow(DEG_processed)) {
        padj[i] <- padj[i-1] * 0.95
      }
      padj <- rev(padj)
      log2fc <- rev(seq(1, 1 + 0.1*(nrow(DEG_processed) - 1), 0.1))

      DEG_processed$padj <- padj
      DEG_processed$coef <- log2fc
      colnames(DEG_processed) <- c("gene_symbol", "padj", "coef")

The DEET R package also contains plotting functions to summarize the most significant studies based on each enrichment test and correlation within DEET_enrich(). The proccess_and_plot_DEET_enrich() function plots barplots of the most enriched studies based on gene set enrichment (ActivePathways) of studies enriched based on overlapping DEGs, pathways, and TF targets. The DEET_plot_correlation() function generates scatterplots of the most enriched studies based on Spearman’s correlation analysis. All plots are generated using ggplot2, and the functions return the ggplot2 objects, allowing researchers to further modify and/or save the plots.

Lastly, the DEET R package contains a function called DEET_feature_extract(), which allows researchers to identify genes that are associated with metadata. If the response variable are continuous (e.g., number of DEGs in study, Fold-change of TP53 etc.) then features are extracted by calculating the coefficients from a Gaussian family elastic net regression using the glmnet R package, as well as Spearman’s correlation between every gene and the response variable. If the response variable is categorical (e.g., Source, Category etc.), then features are extracted by calculating the coefficients from a multinomial family elastic net regression, as well as an ANOVA between each category within the response variable. Lastly, in the response variable is ordinal (e.g., enriches for TNFa pathway, Cancer study yes/no etc.), then features are extracted using a binomial family elastic net regression, as well as a Wilcoxon’s test between the two categories within the response variable.

Breaking down each core function within DEET

DEET_enrich: querying your own list again the DEGs stored within DEET.

Inputs

  • DEET_enrich() expects a list of human gene symbols as either a character vector or as a data frame with the columns c("gene_symbol", "padj", "coef"). If you are inputted DEGs, the log2Fold-change is the most appropriate coefficient, but I left this general for other inputted data types (e.g., effect size from GWAS). Theexample_DEET_enrich_inputhas the structure required for a gene list with p-values and coefficients. An example of the unordered list isexample_DEET_enrich_input$gene_symbol`.

  • DEET_dataset should either be the example file in DEET_example_data or DEET_enrich_input downloaded by the downloader function or manually from the FTP. other options will likely cause the tool to crash.

  • ordered: This parameter only applies when the DEG_list variable is a character vector (not a dataframe). It determins if the inputted list is ordered or not.

  • background: a character vector of genes within the universe. For example, if your input is a list of DEGs in RNA-seq, then “background” should be all of your detected genes. Like in traditional pathway enrichment, not seeting a background may bias the enriched studies by tissue and/or cell-type as more highly expressed genes have more power to be detected as DE.

Examples

Differences in output between the inputted gene list types

The output of these three comparisons will be comparable, however the correlation variable is of note when the input gene set is just a list of genes.

When the gene set is ordered, a Spearman’s correlation is interpretable, as it is simply the rank-order of genes, however a Pearson’s correlation is not interpretable as we do not know the relative difference in coefficient size of your inputted genes

If the gene list is unordered, correlation analysis is entirely uninterpretable and is not run. You are given this message: Input gene list is considered UNORDERED: Correlation analysis will not be run and pathway enrichment will be unordered.

Since it not run the output is No variance in coefs. Cannot proceed with correlation.

Outputs

Named list where each element contains 6 objects. Each object will contain the results (enrichment or correlation) and corresponding metadata.

  • AP_INPUT_BP_output - Enriched BPs of input gene list. This is the output of ActivePathways. Namely, it is a data table giving the ontology ID, name, FDR-adjusted p-value, ontology size, and a character vector of overlapping genes.
  • AP_INPUT_TF_output - Enriched TF of input gene list. The format is identical to AP_INPUT_BP_output but instead of enriched gene ontologies, there is the enriched TFs.
  • AP_DEET_DE_output - Enriched pairwise comparisons within DEET based on overlap the researchers inputted gene list and DE genes. This object contains two elements, the first one is results, which is the same data.table as in AP_INPUT_BP_output but instead of pathways, we are enriching against DE comparisons found within DEET. The second is metadata which is a subset of the larger metadata dataframe that corresponds with the gene sets in results.
  • AP_DEET_BP_output - Enriched pairwise comparisons within DEET based on the overlap of the pathways enriched by the inputted gene list and the pathways enriched from the pairwise comparisons withing DEET. The format is the same as AP_DEET_DE_output where pathway gene ontology term names are in the place of DE gene names.
  • AP_DEET_TF_output - Enriched pairwise comparisons within DEET based on the overlap of the transcription factors enriched by the inputted gene list and the pathways enriched from the pairwise comparisons withing DEET. The format is the same as AP_DEET_DE_output where pathway gene transcription factor term names are in the place of DE gene names.
  • DE_correlations - A list containing three objects. The first objects, results is a data frame where the rows are different pairwise comparisons and the columns are the ouput of the correlational analysis between overlapping genes. Columns are the DEET ID and DEET name, the pearson/spearman’s correlation coefficients, as well as their raw and FDR adjusted p-values. The metadata object are the subsetted rows from the metadata data frame that align with the studies containing significant correlations. Lastly, the distributions object is a list where each element is a diffierent significantly associated pairwise comparisons. Each element is populated by a dataframe where rows are genes significant in at least one study, columns are the coefficients from the input study and DEET, and colour designates whether the gene is DE in one study (grey), both studies in the same direction (purple) and both studies in the opposite direction (orange).

DEET_feature_extract

Inputs

  • DEET_feature_extract expects three variables: a gene-by-comparison matrix populated with a statistic related to differential expresion (e.g., p-value, fold-change), a response variable (i.e., an output dependent variable), and a category.

  • mat - a gene-bycomparison matrix populated with a statistic related to differential expression. The matrix in DEET_data_download() is populated with the log2FC of genes if they were deemed as significant (padj < 0.05). Other matrices, namely those populated by p-value, fold-change, and t-statitic (including the matrices for all DEGs) can be found in http://wilsonlab.org/public/DEET_data/feature_matrices and can be downloaded separately.

  • respeonse: a character vector identifying each study, this can be categorical, binomial (e.g. 0/1, T/F), or continuous

  • datatype: a character indicating if the response is “binomial”, “categorical”, or “continuous”. This is case sensitive.

Outputs

DEET feature extract outputs a list of three objects.

  • elastic_net_coefficients - a named list where each element is a different option of the categorical or binomial variable. Each element is populated with a named vector where the name is a gene and the vector is the coefficient of an eleastic net regression. If the datatype is “continuous”, then this will just be a one-element list.
  • elastic_net - The direct output of the glmnet elastic net. The parameters of glmnet are default aside from the family, which is determined by datatype and alpha which is set to 0.5 (selecting an elastic net instead of L1 or L2 regularized).
  • basic_features - more basic statistical analysis to find genes associated with features. It is a list with “test”, a character vector stating “ANOVA”, “Wilcoxon”, or “correlation” depending on if the datatype input was categorical, binomial or continuous respsectively. The other object is a dataframe showing the significance of each gene, where rows are genes, and columns are stastistics (e.g., F-statistic for ANOVA), P-value, and FDR-adjusted p-value.

Plotting the outputs of DEET_enrich.

Barplots of enrichement

The proccess_and_plot_DEET_enrich() function is a wrapper that generates barplots and a dot plot of enrichment for all of the individual outputs of DEET_enrich() (not the correlations). The outputs are in ggplot2 objects, allowing users to further modify the plots or print however they like.

Inputs

  • DEET_output - The direct output of DEET_enrich() a named list with the ActivePathways enrichment from each data type. If there wasn’t enrichment of a certain datatype (e.g. AP_DEET_BP_output) the function runs properly but it is not plotted.

The remaining varables are for graphical parameters that are passed into the DEET_enrichment_plot() function.

  • width - The number of inches in the barplot or dotplot.
  • text_angle - The angle of the enriched studies.
  • horizontal - Whether the output barplot is vertical or horizontal
  • topn - the top number of studies (by p-value) to be plotted.
  • ol_size - the minimum number of overlapping genes (or paths) in an enriched study.
  • exclude_domain - Exclude studies enriched based on DEGs, Paths, or TF if the user happened to aggregate the results into a single DF, generally unused.
  • cluster_order - Factor to group studies based on the researchers custom annotation.
  • colors - Type of color pallete to input into ‘scale_fill_brewer’ of ggplot.

Outputs

There are up to four outputs assuming everything is significant, each output is a list or a ggplot object.

As shown above, from here you can also just use DEET_enrichment_plot directly to have some more control over these plots.

Scatterplots of correlations for DEET enrichment

This function also takes the direct output from DEET_enrich and generates scatterplots of the correlations of studies whose log2FCs are significantly correlated with the input DE list.

Input

correlation_input - The DE_correlations object that is the output of the DEET_enrich function. It only works if there was at least one study that was significantly correlated.

Outputs

  • A list of ggplot objects showing the scatterplot between the coefs of the inputted DE list and the correlated study within DEET. The X axis shows the correlation coefficient, the Y axis shows the name of the enriched study, and the coloured points are genes that are DE in both studies.

Using DEET gene lists with other studies and enriching two input lists simultaneously.

As mentioned previously, the genesets within DEET are easily transferrable to other gene set enrichment datasets.

Saving DEET gene set for GSEA, gprofiler, etc.

One option is to download the *gmt files diretly from our ftp. http://wilsonlab.org/public/DEET_data/DEET_DE.gmt Is directly compatible with these tools.

The other option would be to save the downloaded DEET gmt as a gmt file. This is completed using the ActivePathways R package. Instead of using the example data as shown below, please use the full dataset.

Instead of saving to a temporary directory like in this vignette, save the file wherever you want the directory to be saved.

Outputs

The outputs of using ActivePathways are the same as DEET_enrich() but with a couple extra columns. * evidence: Whether the DEET comparison is enriched because of one gene list, both gene lists, or an integrated version of these gene lists * Genes_colname: The genes that contributed to enrichment from each inputted gene list.