Tissue-specific expression profiling and expression quantitative trait loci (eQTLs)
We analyzed the largest and most comprehensive tissue-specific RNA-seq data available through the Genotype Tissue Expression (GTEx) project
(GTEx Consortium et al., 2017) to create a tissue-specific expression body map of human lncRNAs across all the genes in the GTF annotation file from lncRNAKB. We downloaded raw paired-end RNA-seq data (FASTQ files – GTEx Release v7) from the dbGap
portal (study_id=phs000424.v7.p2) of 31 solid human normal tissues. For each solid tissue, quality control of paired-end reads were assessed using FastQC tools (Andrews), adapter sequences and low-quality bases were trimmed using Trimmomatic (Bolger
et al., 2014) and aligned to latest version of the human reference genome (H. sapiens, GRCh38) using the latest version of HISAT2 (Kim et al., 2015), which is a splice-aware aligner that maps reads to the reference. Using uniquely aligned reads to
the human genome, gene-level expression (raw read counts) were generated with the featureCounts software (Liao et al., 2014), which assigns reads to features in a fast and parallelizable framework. After visualizing the distribution of uniquely mapped
paired-end reads assigned to genes across all the GTEx samples we chose to exclude samples with < 106 reads assigned to genes. In addition, there were samples with data that we could not map or download. We normalized the raw read counts to Transcripts
Per Kilobase Million (TPM) (Wagner et al., 2012) (https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/). For each gene in the lncRNAKB annotation database, we created a box plot distribution to visualize its tissue-specific expression
pattern across all tissues.
Principal Component Analysis
PCA To explore gene expression similarity between tissues and across GTEx samples as well as summarize lncRNAs tissue-specific expression we performed a principal component analysis (PCA) (Son et al., 2018). We used the normalized TPM expression values, transformed by taking the 〖log〗_2 (TPM+1), across all lncRNAs (n = 77,199) and tissues (n = 31) (no filters applied). We used the prcomp package in R (Team, 2012).
In addition to visualization, we calculated two tissue-specificity metrics (Tau and Preferential Expression Measure (PEM)) (Kryuchkova-Mostacci and Robinson-Rechavi, 2017; Russ and Futschik,
2010)using the normalized TPM expression values. Tau summarizes in a single number whether a gene is tissue-specific or ubiquitously expressed and PEM shows for each tissue separately how specific the gene is to that tissue. The PEM scores the expression
of a gene in a given tissue in relation to its average expression in all tissues. To compute Tau and PEM, we calculated and used the average expression across all replicates for each gene by tissue. All genes that were not expressed in at least one
tissue were removed from the analysis. For comparison purposes with Tau, we used the maximum specificity value of PEM (normalized between 0 to 1) for a certain gene among all tissues.
Genotype file processing
We also downloaded whole genome
sequence (WGS) data (Variant Call Format (VCF) file – GTEx Release v7) from the dbGap portal (study_id=phs000424.v7.p2) to conduct tissue-specific expression quantitative trait loci (eQTL) analysis. We created an eQTL body map of human lncRNAs, across
all the genes in the GTF annotation file from lncRNAKB. We preprocessed the VCF file using the following steps with a combination of PLINKv1.9 (Chang et al., 2015; Purcell and Chang) vcfv0.1.15 (Danecek et al., 2011) and bcfv1.9 tools (Narasimhan
et al., 2016): (i) removed indels; (ii) excluded missing and multi-allelic variants; (iii) selected "FILTER == 'PASS'" variants; (iv) excluded variants with minor allele frequency (MAF) < 5%; (v) updated the coordinates of single nucleotide polymorphisms
(SNPs) using the UCSC liftOver tool (Casper et al., 2018) from hg19 to hg38 (latest genome build); (vi) changed the SNPs IDs to dbSNP (Sherry et al., 2001) rsID using dbSNP Build 151; (vii) converted to bed, bim and fam format. For each solid tissue,
we only selected subjects that had both WGS data and gene expression data. We generated a subset of the VCF files by tissue and re-calculated the MAF to exclude variants with MAF < 5%. After converting to ped and map format, we ran principal component
analysis (PCA) on each tissue to get a set of genotype covariates using eigensoftv6.1.4 (Price et al., 2006; Patterson et al., 2006).
For each solid tissue, we implemented a two-step filtering approach using the raw gene expression
counts matrix quantified across all the genes in the lncRNAKB GTF annotation file. First, we filtered genes based on the normalized TPM expression values, keeping genes with TPM > 0.50 in at least 20% of the samples. Second, we filtered genes based
on the raw counts, keeping genes with counts > 2 in at least 20% of the samples. The edgeR (Robinson et al., 2010) package in R (Team, 2012) was used to process the filtered read counts into log2 counts per million (log2CPM) and the limma-voom
R package (Ritchie et al., 2015) (Law et al., 2014) was used to normalize the data between samples using trimmed mean of M-values (TMM) (Robinson and Oshlack, 2010). The expression files were then sorted by gene start and stop, compressed with BGZIP
and indexed with TABIX (Li, 2011). Only tissues with > 80 samples were included in the cis-eQTL analysis. In the eQTL analysis, we included the first five principal components (PCs) that explained the most variation in the genotype data by looking
at their scree plots by tissue. Sex was also included as a covariate. Within each tissue, cis-eQTLs were identified by linear regression, as implemented in FastQTLv2.0 (threaded option) (Ongen et al., 2016), adjusting for the five PCs and sex. We
restricted our search to variants within 1 Megabase (Mb) of the transcription start site (TSS) of each gene and in the tissue of analysis. To evaluate the significance of the most highly associated variant per gene we used the adaptive permutations
option in FastQTL between 1000 and 10000 permutations. Once we obtained the permutation p-values for all the genes, we accounted for multiple testing to determine the significant cis-eQTLs. We used the Benjamini and Hochberg correction method (Haynes,
2013) to calculate the false discovery rate (FDR) in R statistical programming language (R) (Team, 2012). For each tissue, all cis-eQTL results were visualized using a manhattan plot created using the qqman package in R (Turner, 2014).