Current Version: 0.2.3
UCSCXenaTools
is a R package download and explore data from UCSC Xena data hubs, which are
A collection of UCSC-hosted public databases such as TCGA, ICGC, TARGET, GTEx, CCLE, and others. Databases are normalized so they can be combined, linked, filtered, explored and downloaded.
Installation
You can install UCSCXenaTools from github with:
# install.packages("devtools")
devtools::install_github("ShixiangWang/UCSCXenaTools", build_vignettes = TRUE)
Read this vignettes.
browseVignettes("UCSCXenaTools")
Data Hub List
All datasets are available at https://xenabrowser.net/datapages/.
Currently, UCSCXenaTools
support all 7 data hubs of UCSC Xena.
- UCSC Public Hub: https://ucscpublic.xenahubs.net
- TCGA Hub: https://tcga.xenahubs.net
- GDC Xena Hub: https://gdc.xenahubs.net
- ICGC Xena Hub: https://icgc.xenahubs.net
- Pan-Cancer Atlas Hub: https://pancanatlas.xenahubs.net
- GA4GH (TOIL) Hub: https://toil.xenahubs.net
- Treehouse Hub: https://xena.treehouse.gi.ucsc.edu
If the API
changed, please remind me by email to w_shixiang@163.com or open an issue on GitHub.
Usage
Download UCSC Xena Datasets and load them into R by UCSCXenaTools
is a workflow in generate
, filter
, query
, download
and prepare
5 steps, which are implemented as XenaGenerate
, XenaFilter
, XenaQuery
, XenaDownload
and XenaPrepare
, respectively. They are very clear and easy to use and combine with other packages like dplyr
.
The following use clinical data download of LUNG, LUAD, LUSC from TCGA (hg19 version) as an example.
XenaData data.frame
Begin from version 0.2.0
, UCSCXenaTools
use a data.frame
object (built in package) XenaData
to generate an instance of XenaHub
class, which communicate with API of UCSC Xena Data Hubs.
You can load XenaData
after loading UCSCXenaTools
into R.
library(UCSCXenaTools)
data(XenaData)
head(XenaData)
#> XenaHosts XenaHostNames
#> 1 https://ucscpublic.xenahubs.net UCSC_Public
#> 2 https://ucscpublic.xenahubs.net UCSC_Public
#> 3 https://ucscpublic.xenahubs.net UCSC_Public
#> 4 https://ucscpublic.xenahubs.net UCSC_Public
#> 5 https://ucscpublic.xenahubs.net UCSC_Public
#> 6 https://ucscpublic.xenahubs.net UCSC_Public
#> XenaCohorts
#> 1 1000_genomes
#> 2 1000_genomes
#> 3 Acute lymphoblastic leukemia (Mullighan 2008)
#> 4 Acute lymphoblastic leukemia (Mullighan 2008)
#> 5 Acute lymphoblastic leukemia (Mullighan 2008)
#> 6 B cells (Basso 2005)
#> XenaDatasets
#> 1 1000_genomes/BRCA2
#> 2 1000_genomes/BRCA1
#> 3 mullighan2008_public/mullighan2008_500K_genomicMatrix
#> 4 mullighan2008_public/mullighan2008_public_clinicalMatrix
#> 5 mullighan2008_public/mullighan2008_SNP6_genomicMatrix
#> 6 basso2005_public/basso2005_public_clinicalMatrix
Generate a XenaHub object
This can be implemented by XenaGenerate
function, which generate XenaHub
object from XenaData
data frame.
XenaGenerate()
#> class: XenaHub
#> hosts():
#> https://ucscpublic.xenahubs.net
#> https://tcga.xenahubs.net
#> https://gdc.xenahubs.net
#> https://icgc.xenahubs.net
#> https://toil.xenahubs.net
#> https://pancanatlas.xenahubs.net
#> https://xena.treehouse.gi.ucsc.edu
#> cohorts() (134 total):
#> 1000_genomes
#> Acute lymphoblastic leukemia (Mullighan 2008)
#> B cells (Basso 2005)
#> ...
#> Treehouse PED v8
#> Treehouse public expression dataset (July 2017)
#> datasets() (1549 total):
#> 1000_genomes/BRCA2
#> 1000_genomes/BRCA1
#> mullighan2008_public/mullighan2008_500K_genomicMatrix
#> ...
#> treehouse_public_samples_unique_ensembl_expected_count.2017-09-11.tsv
#> treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv
We can set subset
argument to narrow datasets.
XenaGenerate(subset = XenaHostNames=="TCGA")
#> class: XenaHub
#> hosts():
#> https://tcga.xenahubs.net
#> cohorts() (38 total):
#> TCGA Acute Myeloid Leukemia (LAML)
#> TCGA Adrenocortical Cancer (ACC)
#> TCGA Bile Duct Cancer (CHOL)
#> ...
#> TCGA Thyroid Cancer (THCA)
#> TCGA Uterine Carcinosarcoma (UCS)
#> datasets() (879 total):
#> TCGA.LAML.sampleMap/HumanMethylation27
#> TCGA.LAML.sampleMap/HumanMethylation450
#> TCGA.LAML.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes
#> ...
#> TCGA.UCS.sampleMap/Pathway_Paradigm_RNASeq_And_Copy_Number
#> TCGA.UCS.sampleMap/mutation_curated_broad
You can use
XenaHub()
to generate aXenaHub
object for API communication, but it is not recommended.
It’s possible to explore hosts()
, cohorts()
and datasets()
.
xe = XenaGenerate(subset = XenaHostNames=="TCGA")
# get hosts
hosts(xe)
#> [1] "https://tcga.xenahubs.net"
# get cohorts
head(cohorts(xe))
#> [1] "TCGA Acute Myeloid Leukemia (LAML)"
#> [2] "TCGA Adrenocortical Cancer (ACC)"
#> [3] "TCGA Bile Duct Cancer (CHOL)"
#> [4] "TCGA Bladder Cancer (BLCA)"
#> [5] "TCGA Breast Cancer (BRCA)"
#> [6] "TCGA Cervical Cancer (CESC)"
# get datasets
head(datasets(xe))
#> [1] "TCGA.LAML.sampleMap/HumanMethylation27"
#> [2] "TCGA.LAML.sampleMap/HumanMethylation450"
#> [3] "TCGA.LAML.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes"
#> [4] "TCGA.LAML.sampleMap/mutation_wustl_hiseq"
#> [5] "TCGA.LAML.sampleMap/GA"
#> [6] "TCGA.LAML.sampleMap/HiSeqV2_percentile"
Pipe operator %>%
can also be used here.
> library(tidyverse)
> XenaData %>% filter(XenaHostNames == "TCGA", grepl("BRCA", XenaCohorts), grepl("Path", XenaDatasets)) %>% XenaGenerate()
class: XenaHub
hosts():
https://tcga.xenahubs.net
cohorts() (1 total):
TCGA Breast Cancer (BRCA)
datasets() (4 total):
TCGA.BRCA.sampleMap/Pathway_Paradigm_mRNA_And_Copy_Number
TCGA.BRCA.sampleMap/Pathway_Paradigm_RNASeq
TCGA.BRCA.sampleMap/Pathway_Paradigm_RNASeq_And_Copy_Number
TCGA.BRCA.sampleMap/Pathway_Paradigm_mRNA
Filter
There are too many datasets, we filter them by XenaFilter
function.
Regular expression can be used to filter XenaHub object to what we want.
(XenaFilter(xe, filterDatasets = "clinical") -> xe2)
#> class: XenaHub
#> hosts():
#> https://tcga.xenahubs.net
#> cohorts() (39 total):
#> (unassigned)
#> TCGA Acute Myeloid Leukemia (LAML)
#> TCGA Adrenocortical Cancer (ACC)
#> ...
#> TCGA Thyroid Cancer (THCA)
#> TCGA Uterine Carcinosarcoma (UCS)
#> datasets() (37 total):
#> TCGA.OV.sampleMap/OV_clinicalMatrix
#> TCGA.DLBC.sampleMap/DLBC_clinicalMatrix
#> TCGA.KIRC.sampleMap/KIRC_clinicalMatrix
#> ...
#> TCGA.READ.sampleMap/READ_clinicalMatrix
#> TCGA.MESO.sampleMap/MESO_clinicalMatrix
Then select LUAD
, LUSC
and LUNG
3 datasets.
XenaFilter(xe2, filterDatasets = "LUAD|LUSC|LUNG") -> xe2
Pipe can be used here.
suppressMessages(require(dplyr))
xe %>%
filterXena(filterDatasets = "clinical") %>%
filterXena(filterDatasets = "luad|lusc|lung")
## class: XenaHub
## hosts():
## https://tcga.xenahubs.net
## cohorts() (39 total):
## (unassigned)
## TCGA Acute Myeloid Leukemia (LAML)
## TCGA Adrenocortical Cancer (ACC)
## ...
## TCGA Thyroid Cancer (THCA)
## TCGA Uterine Carcinosarcoma (UCS)
## datasets() (3 total):
## TCGA.LUSC.sampleMap/LUSC_clinicalMatrix
## TCGA.LUNG.sampleMap/LUNG_clinicalMatrix
## TCGA.LUAD.sampleMap/LUAD_clinicalMatrix
Query
Create a query before download data
xe2_query = XenaQuery(xe2)
xe2_query
#> hosts datasets
#> 1 https://tcga.xenahubs.net TCGA.LUSC.sampleMap/LUSC_clinicalMatrix
#> 2 https://tcga.xenahubs.net TCGA.LUNG.sampleMap/LUNG_clinicalMatrix
#> 3 https://tcga.xenahubs.net TCGA.LUAD.sampleMap/LUAD_clinicalMatrix
#> url
#> 1 https://tcga.xenahubs.net/download/TCGA.LUSC.sampleMap/LUSC_clinicalMatrix.gz
#> 2 https://tcga.xenahubs.net/download/TCGA.LUNG.sampleMap/LUNG_clinicalMatrix.gz
#> 3 https://tcga.xenahubs.net/download/TCGA.LUAD.sampleMap/LUAD_clinicalMatrix.gz
Download
Default, data will be downloaded to XenaData
directory under system temp directory. You can specify the path.
If the data exists, command will not run to download them, but you can force it by force
option.
xe2_download = XenaDownload(xe2_query)
#> We will download files to directory /var/folders/mx/rfkl27z90c96wbmn3_kjk8c80000gn/T//Rtmp9Iq0Y6.
#> Downloading TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz
#> Downloading TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz
#> Downloading TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz
#> Note fileNames transfromed from datasets name and / chracter all changed to __ character.
## not run
#xe2_download = XenaDownload(xe2_query, destdir = "E:/Github/XenaData/test/")
Note fileNames transfromed from datasets name and / chracter all changed to __ character.
Prepare
There are 4 ways to prepare data to R.
# way1: directory
cli1 = XenaPrepare("E:/Github/XenaData/test/")
names(cli1)
## [1] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"
## [3] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz"
# way2: local files
cli2 = XenaPrepare("E:/Github/XenaData/test/TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz")
class(cli2)
## [1] "tbl_df" "tbl" "data.frame"
cli2 = XenaPrepare(c("E:/Github/XenaData/test/TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz",
"E:/Github/XenaData/test/TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"))
class(cli2)
## [1] "list"
names(cli2)
## [1] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"
# way3: urls
cli3 = XenaPrepare(xe2_download$url[1:2])
names(cli3)
## [1] "LUSC_clinicalMatrix.gz" "LUNG_clinicalMatrix.gz"
# way4: xenadownload object
cli4 = XenaPrepare(xe2_download)
names(cli4)
#> [1] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz"
#> [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"
#> [3] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
SessionInfo
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] UCSCXenaTools_0.2.3 devtools_1.13.6 pacman_0.4.6
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.18 pillar_1.3.0 compiler_3.5.1
#> [4] later_0.7.3 bindr_0.1.1 prettydoc_0.2.1
#> [7] tools_3.5.1 digest_0.6.16 jsonlite_1.5
#> [10] evaluate_0.11 memoise_1.1.0 tibble_1.4.2
#> [13] pkgconfig_2.0.2 rlang_0.2.2 shiny_1.1.0
#> [16] rstudioapi_0.7 commonmark_1.5 curl_3.2
#> [19] yaml_2.2.0 bindrcpp_0.2.2 knitr_1.20
#> [22] withr_2.1.2 httr_1.3.1 stringr_1.3.1
#> [25] dplyr_0.7.6 roxygen2_6.1.0 xml2_1.2.0
#> [28] desc_1.2.0 htmlwidgets_1.2 hms_0.4.2
#> [31] rprojroot_1.3-2 DT_0.4 shinydashboard_0.7.0
#> [34] tidyselect_0.2.4 glue_1.3.0 R6_2.2.2
#> [37] rmarkdown_1.10 readr_1.1.1 purrr_0.2.5
#> [40] magrittr_1.5 backports_1.1.2 promises_1.0.1
#> [43] htmltools_0.3.6 assertthat_0.2.0 xtable_1.8-2
#> [46] mime_0.5 httpuv_1.4.5 stringi_1.2.4
#> [49] crayon_1.3.4
New feature
- Add easy download function and Xena information for TCGA data
# download RNASeq data (use UVM as example)
tcgaEasyDownload(project = "UVM",
data_type = "Gene Expression RNASeq",
file_type = "IlluminaHiSeq RNASeqV2")
Run shiny by
UCSCXenaTools::XenaShiny()
Download by shiny is under consideration, I am try learning more about how to operate shiny.
Acknowledgement
This package is based on XenaR, thanks Martin Morgan for his work.
LICENSE
GPL-3
please note, code from XenaR package under Apache 2.0 license.
ToDo
- Shinny
- More easier download workflow