Surprisal Analysis, an R package for information theoretic analysis of gene expression data

library(SurprisalAnalysis)
library(ggplot2)

Read data and apply Surprisal analysis

data <- read.csv(system.file("extdata", "helper_T_cell_0_test.csv", package = "SurprisalAnalysis"), header=TRUE)
results <- surprisal_analysis(data)
results[[2]]-> transcript_weights
percentile_GO <- 0.95 #change based on your preference
lambda_no <- 2 #change based on your preference, lambda #1 is the baseline state

Run GO analysis

GO.results <- GO_analysis_surprisal_analysis(transcript_weights, percentile_GO, lambda_no, key_type = "SYMBOL", flip = FALSE, species.db.str =  "org.Mm.eg.db", top_GO_terms=15)

The function GO_analysis_surprisal_analysis() runs Gene Ontology (GO) enrichment on the most influential transcripts from a chosen Surprisal pattern. Below are the input arguments:

transcript_weights

A matrix of transcript weights, typically the second element ([[2]]) returned from the Surprisal analysis function.

percentile_GO

A numeric value between 0 and 1 specifying the quantile cutoff for transcript selection. Example: 0.95 means only the top 5% of transcripts (by absolute weight) in the chosen \(\lambda\) pattern are used.

lambda_no

An integer specifying which \(\lambda\) pattern to analyze. Note: \(\lambda_1\) represents the balance state, while higher-order \(\lambda\)’s capture additional constraints or patterns.

key_type

The type of transcript identifiers used in your data. Options include:

“SYMBOL” (gene symbols, e.g. TP53),

“ENTREZID” (Entrez gene IDs),

“ENSEMBL” (Ensembl IDs),

“PROBEID” (microarray probe IDs). This must match the ID format in your input dataset.

flip

Logical (TRUE/FALSE). If TRUE, multiplies transcript weights for the selected \(\lambda\) by –1 before selecting the top quantile. Useful for ensuring consistency with the direction of \(\lambda\) plots.

species.db.str

The organism database to use for gene mapping. Current options:

“org.Hs.eg.db” for Homo sapiens (human),

“org.Mm.eg.db” for Mus musculus (mouse)

ont

The GO ontology branch for enrichment analysis. Options:

“BP” – Biological Process (default),

“MF” – Molecular Function,

“CC” – Cellular Component

pAdjustMethod

The multiple testing correction method. Options include: “BH” (default), “bonferroni”, “holm”, “hochberg”, “hommel”, “BY”, “none”.

top_GO_terms

An integer specifying the number of top enriched GO terms to return (default: 15).

ggplot(GO.results, aes(x=Description, y=Count, fill=p.adjust))+geom_bar(stat="identity")+scale_fill_gradient(low = "#790915", high = "#062c5c")+theme_minimal()+
  
  theme(
    # Remove panel border
    panel.border=element_blank(),  
    #plot.border = element_blank(),
    # Remove panel grid lines
    panel.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    # Add axis line
    axis.line = element_line(colour = "black"),
    #axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    #axis.text = element_blank(),
    #legend.position = "none",
    plot.title = element_text(hjust = 0.5, size=20),
    #axis.text = element_text(size = 15),
    
    text = element_text(size=18)
  ) +coord_flip()+labs(tag="A", title="GO analysis")

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Surprisal Analysis Guidelines