Gene ID Mapping for Genotype-Tissue Expression (GTEx) Data

Nan Xiao <nanx@uchicago.edu>
Gao Wang <gaow@uchicago.edu>

2016-06-12

1 Introduction

The Genotype-Tissue Expression (GTEx) project (Lonsdale et al. 2013) aims at measuring human tissue-specific gene expression levels. With the collected data, we will be able to explore the landscape of gene expression, gene regulation, and their deep connections with genetic variations.

Raw GTEx data contains expression measurements from various types of elements (such as genes, pseudogenes, noncoding DNA sequences) covering the whole genome. For some analysis, it might be desirable to only keep a subset of the data, for example, data from protein coding genes. In such cases, mapping the original Ensembl gene IDs to Entrez gene IDs or HGNC symbols become an essential step in the analysis pipeline.

The grex package offers a minimal dependency solution to do such ID mappings. Currently, an Ensembl ID from GTEx can be mapped to its Entrez gene ID, HGNC gene symbol, HGNC gene name, cytogenetic location, and UniProt ID. We also limit our scope on the Ensembl IDs appeared in the gene read count data. Ensembl IDs from transcript data will be considered in future versions.

2 Mapping Table

To facilitate such ID conversion tasks, the grex package has a built-in mapping table derived from the well-known annotation data package org.Hs.eg.db (Carlson 2015). The mapping data we used has integrated mapping information from Ensembl and NCBI, to maximize the possibility of finding a matched Entrez ID. The R code for creating the table is located here.

Not surprisingly, when creating such a table, there were hundreds of cases where a single Ensembl ID can be mapped to multiple Entrez gene IDs. To create a one-to-one mapping, we took a simple approach: we just removed the duplicated Entrez IDs and only kept the first we encountered in the original database. Therefore, there might be cases where the mapping is not 100% accurate. If you have such doubts for particular results, please try searching the original ID on the Ensembl website and see if we got a correct mapped ID.

3 Code Example

As an example, we use the Ensembl IDs from GTEx V6 gene count data and select 100 IDs:

library("grex")
data("gtexv6")
id = gtexv6[101:200]
df = grex(id)
tail(df)
##          ensembl_id entrez_id hgnc_symbol                                   hgnc_name
## 95  ENSG00000272153      <NA>        <NA>                                        <NA>
## 96  ENSG00000116198      9731      CEP104                  centrosomal protein 104kDa
## 97  ENSG00000169598      1677        DFFB       DNA fragmentation factor subunit beta
## 98  ENSG00000264428      <NA>        <NA>                                        <NA>
## 99  ENSG00000198912    339448    C1orf174         chromosome 1 open reading frame 174
## 100 ENSG00000236423 100133612   LINC01134 long intergenic non-protein coding RNA 1134
##     cyto_loc uniprot_id
## 95      <NA>       <NA>
## 96   1p36.32 A0A024R4G3
## 97    1p36.3     B4DZS0
## 98      <NA>       <NA>
## 99   1p36.32     Q8IYL3
## 100  1p36.32       <NA>

The elements which cannot be mapped accurately will be NA.

Genes with a mapped Entrez ID:

filtered_genes = df[!is.na(df$entrez_id), c('ensembl_id', 'entrez_id')]
head(filtered_genes)
##         ensembl_id entrez_id
## 1  ENSG00000175756     54998
## 3  ENSG00000221978     81669
## 4  ENSG00000224870    148413
## 5  ENSG00000242485     55052
## 8  ENSG00000235098    441869
## 10 ENSG00000205116    643965

If you want to start from the raw GENCODE gene IDs provided by GTEx (e.g. ENSG00000227232.4), the function cleanid() can help you remove the .version part in them, to produce Ensembl IDs.

4 What’s Next?

Usually, the next step is removing (or imputing) the genes with NA IDs, and then select the genes to keep. Notably, as was observed in the complete gene read count data, in about 100 cases, multiple Ensembl IDs can be mapped to one single Entrez ID. Post-processing steps may also be needed for such genes.

5 Acknowledgements

We thank the members of Stephens lab (Kushal K Dey, Lei Sun, Michael Turchin) for their valuable suggestions and helpful discussions on this problem. Project website: http://nanx.me/grex

References

Carlson, Marc. 2015. org.Hs.eg.db: Genome Wide Annotation for Human.

Lonsdale, John, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad, Richard Hasz, et al. 2013. “The Genotype-Tissue Expression (Gtex) Project.” Nature Genetics 45 (6). Nature Publishing Group: 580–85.