Data Integration Summary


Data Collection

We identified lncRNAs databases by conducting a broadly cast literature search of the PubMed database through February 28th, 2019 with the following keyword algorithm: (lncrna or long noncoding or long non-coding rna or noncoding) and (annotation or function or database).
We downloaded lncRNAs annotation files in gene transfer format (GTF) or gene feature format (GFF) from all six annotation databases (links in Table 1) identified by the comprehensive analytical review above. To streamline the data integration step all the GTF or GFF annotations were parsed to the same format using the following steps: (i) if necessary, we updated the coordinates of annotation using the UCSC liftOver tool to hg38, and (ii) for each chromosome, we split the gene and transcript records into individual files named by chromosome, strand, start and stop base pair locations. Each gene block file contained the transcripts information and the transcript block file contained the exons information. In cases where the annotation file did not have any genes information (only containing transcripts or exons records) we used the gene ids in the transcripts or exons records to get the first and last exon, then manually created a gene entry using the base pair locations of the first exon (as gene start), of the last exon (as gene stop), and transcript strand to represent the gene strand. We also removed redundant records from all annotation files. Using CHESS2.1 as the reference annotation database (containing both protein-coding and lncRNAs genes) we used a cumulative stepwise intersection method to merge it with the rest of the five lncRNAs annotation databases in this order: (i) FANTOM5.0.v3, (ii) LNCipedia5.2,  (iii) NONCODEv5.0, (iv) MiTranscriptomev2 and (v) BIGTranscriptomev1 at the genes and transcripts levels.



  CHESS2.1 LNCipedia5.2 NONCODEv5.0 FANTOM5.0.v3 MiTranscriptomev2 BIGTranscriptomev1 lncRNAKB
Annotation source file name chess2.1.gff lncipedia 5_2_hc_hg38.gtf NONCODEv5 human hg38_lncRNA.gtf FANTOM_CAT.lv3 robust.only_lncRNA.gtf mitranscriptome hg19.v2.gtf BIGTranscriptome lncRNA catalog.hg19.gtf lncRNAKB hg38_v6.gtf
Website URL URL URL URL URL URL URL
Reference [PMID] Pertea et al. , 2018 [30486838] Volders et al. , 2013, 2015, 2019 [30371849] Fang et al. , 2018 [29140524] Hon et al. , 2017 [28241135] Iyer et al. , 2015 [25599403] You et al. , 2017 [28396519] Seifuddin et al.
Genome reference build/version hg38 hg19, hg38 hg19, hg38 hg19 hg19 hg19 hg38
Annotation method/source GENCODE (release 25 and 27), RefSeq, FANTOM5, RNA-seq (GTEx-phs000424.v6.p1 in May of 2016), transcript assembly, mass spectrometry (validation) Ensembl, RNA-seq (Human BodyMap 2.0 lincRNAs), LncRNAdb, GENCODE (release 13), RefSeq-Dec2014, RefSeq-NCBI (release 106), Nielsen et al., Hangauer et al., NONCODE, Sun and Gadad et al., 2015, FANTOM CAT Ensembl, RefSeq, lncRNAdb, LNCipedia, RNA-seq (Human BodyMap 2.0 lincRNAs), Exosome Expression Profile, old versions of NONCODE GENCODE (release 19), Human BodyMap 2.0, miTranscriptome, ENCODE and an RNA-seq assembly from 70 FANTOM5 samples, cap analysis of gene expression (CAGE) data 7,256 RNA sequencing (RNA-seq) libraries from tumors, normal tissues and cell lines comprising over 43 Tb of sequence from 25 independent studies including ENCODE, TCGA, Human BodyMap 2.0, proteomics (validation) ENCODE, TCGA, GTEx, Human BodyMap 2.0, the Human Protein Atlas, GENCODE (release 19), RefSeq, PacBio CHESS2.1, LNCipedia5.2, NONCODEv5.0, FANTOM5.0.V3, MiTranscriptomev2, BIGTranscriptomev1
Number of lncRNAs (genes) 18,887 56,946 96,308 27,871 63,505 14,090 77,199
Number of lncRNAs (transcripts) 56,927 127,802 172,216 89,833 175,259 26,591 224,286
Number of lncRNAs (exons) 159,891 357,620 429,240 251,201 539,840 87,316 611,340
Number of protein-coding genes 22,518 - - - - - 22,518
Tissue-specific Expression/score yes no yes yes yes yes yes
Tissue-specific Expression Quantitative Trait Loci (eQTLs) no no no no no no yes
UCSC genome browser/Custom Genome Browser Track no yes yes yes yes no yes
External Gene information/links (Gene Cards or RefSeq or Ensembl or UCSC) yes yes yes yes yes yes yes
Coding potential prediction/score no yes no yes yes yes yes
Conservation information no yes yes yes yes no yes
Gene-level functional annotation, mRNA co-expression, pathway enrichment analysis no no yes yes yes no yes



Table 1: Summary of lncRNA annotation databases that have been integrated into the lncRNAKB and the types of data included in each resource.



Architecture of the database


The 3-tier server architecture model containing data, logic and presentation tiers has been implemented as shown in Figure 2. The popular MySQL open source relational database management system (RDBMS) has been employed for the data tier, expanded with a NoSQL document storage. NoSQL document storage is a JSON-based (JavaScript Object Notation) data structure format and as such has a flexible dynamic structure with no schema constraints which makes it suitable for literature and document storage. The MySQL RDBMS (version 8.0) is ideal for data indexing and a powerful query system for relational data. The logic tier is responsible for the communication between the user queries from the presentation tier and fetching the outcome from the data tier, as well as data integration from MySQL and NoSQL data sources.  The presentation tier contains several modules based on AJAX (Asynchronous JavaScript and XML), jQuery (JavaScript Query system version 3.3.1 - https://jquery.com/), and the PHP server-side scripting language (version 7.1.18.), as well as the CSS (Cascading Style Sheets) code to describe how HTML elements are to be displayed on user side web interface. JQuery and AJAX have the advantage of asynchronous background calls to the logic tier, native JSON parsing, and dynamic rendering of the browser display, which makes the data retrieval system perform more efficiently.




Classification/Annotation coding potential of lncRNAs using Random Forest


We used FEELnc (FlExible Extraction of LncRNAs) (Wucher et al., 2017) to classify/annotate and calculate the coding potential of lncRNAs in the lncRNAKB. FEELnc annotates lncRNAs based on a machine learning method, Random Forest (RF) (Breiman, 2001), trained with general features such as multi k-mer frequencies, RNA sequence length and open reading frames (ORFs) size. It is comprised of three modules: (i) filter, (ii) coding potential, and (iii) classifier.



Tissue-specific expression profiling and expression quantitative trait loci (eQTLs)


Expression profiling

We analyzed the largest and most comprehensive tissue-specific RNA-seq data available through the Genotype Tissue Expression (GTEx) project (GTEx Consortium et al., 2017) to create a tissue-specific expression body map of human lncRNAs across all the genes in the GTF annotation file from lncRNAKB. We downloaded raw paired-end RNA-seq data (FASTQ files – GTEx Release v7) from the dbGap portal (study_id=phs000424.v7.p2) of 31 solid human normal tissues. For each solid tissue, quality control of paired-end reads were assessed using FastQC tools (Andrews), adapter sequences and low-quality bases were trimmed using Trimmomatic (Bolger et al., 2014) and aligned to latest version of the human reference genome (H. sapiens, GRCh38) using the latest version of HISAT2 (Kim et al., 2015), which is a splice-aware aligner that maps reads to the reference. Using uniquely aligned reads to the human genome, gene-level expression (raw read counts) were generated with the featureCounts software (Liao et al., 2014), which assigns reads to features in a fast and parallelizable framework. After visualizing the distribution of uniquely mapped paired-end reads assigned to genes across all the GTEx samples we chose to exclude samples with < 106 reads assigned to genes. In addition, there were samples with data that we could not map or download. We normalized the raw read counts to Transcripts Per Kilobase Million (TPM) (Wagner et al., 2012) (https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/). For each gene in the lncRNAKB annotation database, we created a box plot distribution to visualize its tissue-specific expression pattern across all tissues.

Principal Component Analysis

PCA To explore gene expression similarity between tissues and across GTEx samples as well as summarize lncRNAs tissue-specific expression we performed a principal component analysis (PCA) (Son et al., 2018). We used the normalized TPM expression values, transformed by taking the 〖log〗_2 (TPM+1), across all lncRNAs (n = 77,199) and tissues (n = 31) (no filters applied). We used the prcomp package in R (Team, 2012).

Tissue-specificity scores

In addition to visualization, we calculated two tissue-specificity metrics (Tau and Preferential Expression Measure (PEM)) (Kryuchkova-Mostacci and Robinson-Rechavi, 2017; Russ and Futschik, 2010)using the normalized TPM expression values. Tau summarizes in a single number whether a gene is tissue-specific or ubiquitously expressed and PEM shows for each tissue separately how specific the gene is to that tissue. The PEM scores the expression of a gene in a given tissue in relation to its average expression in all tissues. To compute Tau and PEM, we calculated and used the average expression across all replicates for each gene by tissue. All genes that were not expressed in at least one tissue were removed from the analysis. For comparison purposes with Tau, we used the maximum specificity value of PEM (normalized between 0 to 1) for a certain gene among all tissues.



Genotype file processing

We also downloaded whole genome sequence (WGS) data (Variant Call Format (VCF) file – GTEx Release v7) from the dbGap portal (study_id=phs000424.v7.p2) to conduct tissue-specific expression quantitative trait loci (eQTL) analysis. We created an eQTL body map of human lncRNAs, across all the genes in the GTF annotation file from lncRNAKB. We preprocessed the VCF file using the following steps with a combination of PLINKv1.9 (Chang et al., 2015; Purcell and Chang) vcfv0.1.15 (Danecek et al., 2011) and bcfv1.9 tools (Narasimhan et al., 2016): (i) removed indels; (ii) excluded missing and multi-allelic variants; (iii) selected "FILTER == 'PASS'" variants; (iv) excluded variants with minor allele frequency (MAF) < 5%; (v) updated the coordinates of single nucleotide polymorphisms (SNPs) using the UCSC liftOver tool (Casper et al., 2018) from hg19 to hg38 (latest genome build); (vi) changed the SNPs IDs to dbSNP (Sherry et al., 2001) rsID using dbSNP Build 151; (vii) converted to bed, bim and fam format. For each solid tissue, we only selected subjects that had both WGS data and gene expression data. We generated a subset of the VCF files by tissue and re-calculated the MAF to exclude variants with MAF < 5%. After converting to ped and map format, we ran principal component analysis (PCA) on each tissue to get a set of genotype covariates using eigensoftv6.1.4 (Price et al., 2006; Patterson et al., 2006).

eQTL analysis

For each solid tissue, we implemented a two-step filtering approach using the raw gene expression counts matrix quantified across all the genes in the lncRNAKB GTF annotation file. First, we filtered genes based on the normalized TPM expression values, keeping genes with TPM > 0.50 in at least 20% of the samples. Second, we filtered genes based on the raw counts, keeping genes with counts > 2 in at least 20% of the samples. The edgeR (Robinson et al., 2010) package in R (Team, 2012) was used to process the filtered read counts into log2 counts per million (log2CPM) and the limma-voom R package (Ritchie et al., 2015) (Law et al., 2014) was used to normalize the data between samples using trimmed mean of M-values (TMM) (Robinson and Oshlack, 2010). The expression files were then sorted by gene start and stop, compressed with BGZIP and indexed with TABIX (Li, 2011). Only tissues with > 80 samples were included in the cis-eQTL analysis. In the eQTL analysis, we included the first five principal components (PCs) that explained the most variation in the genotype data by looking at their scree plots by tissue. Sex was also included as a covariate. Within each tissue, cis-eQTLs were identified by linear regression, as implemented in FastQTLv2.0 (threaded option) (Ongen et al., 2016), adjusting for the five PCs and sex. We restricted our search to variants within 1 Megabase (Mb) of the transcription start site (TSS) of each gene and in the tissue of analysis. To evaluate the significance of the most highly associated variant per gene we used the adaptive permutations option in FastQTL between 1000 and 10000 permutations. Once we obtained the permutation p-values for all the genes, we accounted for multiple testing to determine the significant cis-eQTLs. We used the Benjamini and Hochberg correction method (Haynes, 2013) to calculate the false discovery rate (FDR) in R statistical programming language (R) (Team, 2012). For each tissue, all cis-eQTL results were visualized using a manhattan plot created using the qqman package in R (Turner, 2014).



Conservation


Conservation of exons between protein-coding genes and lncRNAs in the lncRNAKB annotation database was analyzed using the bigWigAverageOverBed (Pohl and Beato, 2014) and the cons30way (hg38) track (Siepel et al., 2005) both downloaded from the UCSC genome browser. This track shows multiple alignments of 30 vertebrate species and measurements of evolutionary conservation using two methods (phastCons and phyloP (Cooper et al., 2005)) from the PHAST package (Hubisz et al., 2011) for all thirty species. The multiple alignments were generated using multiz (Blanchette et al., 2004) and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. An exon-level BED file was created using the lncRNAKB GTF annotation file separately for protein-coding genes and lncRNAs. We merged overlapping exons within transcripts to avoid counting conservation scores of overlapping base pairs more than once. For each exon, the bigWigAverageOverBed function calculates the average conservation score across all base pairs. Using boxplots we visualized and compared the average conservation score differences between lncRNAs and protein-coding exons.



The output columns in the conservation table are:
name - name field from bed, which should be unique
size - size of bed (sum of exon sizes
covered - number of bases within exons covered by bigWig
sum - sum of values over all bases covered
mean0 - average over bases with non-covered bases counting as zeroes
mean - average over just covered bases




Functional characterization of lncRNAs using a network-based approach


We used the “guilt by association” principle to functionally characterize lncRNAs in the lncRNAKB across all 31 solid human normal tissues in the GTEx data (Ehsani and Drabløs, 2018). This method is widely used to identify well-annotated genes that seem to be involved in some of the same processes as a given un-annotated gene. It is based on comparison of gene expression profiles between lncRNAs and mRNAs using metrics such as Pearson or Spearman correlation, applying a specific cut-off, and performing enrichment analysis of annotation terms in the most highly ranked mRNAs. The significantly enriched terms can be used as an estimate of annotation for the lncRNAs.

Using the filtered log2CPM and TMM normalized gene expression data (see Methods: Tissue-specific expression profiling and expression quantitative trait loci (eQTLs), subsection: eQTL analysis), we used the weighted gene co-expression network analysis (WGCNA) approach (Langfelder and Horvath, 2008) as implemented in the Co-Expression Modules identification Tool (CEMiTool) package in R (Russo et al., 2018) to identify modules of lncRNA-mRNA pairs that are co-expressed and therefore likely work in concert to carry out various biological functions.