Created by Yanyu Liang, with some material adapted from Charles Washington III and Jiamao Zheng’s lab 6 from 2018
As we know that eQTL analysis needs to test association for each gene in transcriptome against a set of variants. Typically, for cis-eQTL analysis, one gene could have thousands of variants to test against. Matrix eQTL @matrixqtl was developed to create a computationally less burdensome analysis for eQTL identification. Compared to other existing tools for QTL analysis, Matrix eQTL is orders of magnitude faster in analysis due to specific preprocessing and its use of large matrix operations for the computationally difficult aspects of the system.
To compare between genes which habor different LD structures and different number of testing variants, the statistical test underlying cis-eQTL requires permutation. So it is computationally intensive even with matrixQTL. To overcome the computational burden of permutation, FastQTL @fastqtl proposed an approximate simulation scheme which is efficient for large-scale transcriptome-wide analysis, i.e. Geuvadis & GTEx.
In the past few years, GPU has been widely adpated for many computations.
TensorQTL @tensorqtl was recently developed which is capable of running on both CPU and GPU.
And when GPU is enabled, it runs 100 times faster than CPU.
Today, we will learn to use tensorqtl
with CPU. (And to run on GPU needs only few more setup steps with the same command!)
PrediXcan @predixcan is a gene-based association method that directly tests the molecular mechanisms through which genetic variation affects a phenotype. It can also run with GWAS summary statistics along with reference LD, which is proposed in @metaxcan with software available at MetaXcan repo. In the second part of the lab, we will see how to run S-PrediXcan analysis using GWAS summary statistics.
By the end of the lab you should be able to:
- Understand the types of files required for tensorQTL and MetaXcan
- Interpret the results of tensorQTL and S-PrediXcan analysis
tensorQTL
Input files for eQTL analysis
- Phenotype: a matrix representing gene expression levels in each individual (gene x individual)
- Covariate: a matrix representing value of covariate values in each individual (covariate x individual)
- Genotype: a matrix representing genotype dosage (effect allele) for each variant and individual (in plink format)
Note that we need to know the genomic position of the gene (say the position of transcription start site, TSS) since we test cis-eQTL for nearby variant only. The example data for this lab is at /project2/hgen47100/data/lab6
:
- Phenotype:
GEUVADIS.chr22.expression.bed.gz
- Covariate:
GEUVADIS.445_samples.covariates.txt
- Genotype:
GEUVADIS.hg38.chr22.*
Problem 1: How many covariates in the example data?
Compute nomial p-value in cis-eQTL analysis
Nominal p-value is the observed p-value under linear model \(\tilde{Y} \sim X\), where \(\tilde{Y}\) is residual expression level after regressing out covariates and \(X\) is the genotype dosage of a variant of interest.
Let’s compute nomimal p-value for all cis-eQTL candidates. Here we define cis-window as 10kb surrounding TSS (both side).
sinteractive # request a computing node and work interactively
module load Anaconda3 # load conda
source activate /project2/hgen47100/software2/conda_env/tensorqtl # load dependencies for tensorqtl
cd [working-directory]
mkdir output
python /project2/hgen47100/software2/tensorqtl/tensorqtl/tensorqtl.py \
--covariates /project2/hgen47100/data/lab6/GEUVADIS.445_samples.covariates.txt \
--window 10000 \
--mode cis_nominal \
/project2/hgen47100/data/lab6/GEUVADIS.hg38.chr22 \
/project2/hgen47100/data/lab6/GEUVADIS.chr22.expression.bed.gz \
output/cis_nominal
Problem 2
From the logging message of tensorqtl run, how many genes are being analyzed?
The output contains all variant/gene pairs being test regardless of significance.
So, it will be huge amount of data in practice.
The output file is in parquet
format, which is a binary format but it gives better I/O performance as comparing to human-readable text file.
We’ve provided a tiny python script to convert parquet
file to text table in txt.gz
.
python /project2/hgen47100/software2/parquet2table.py \
--parquet output/cis_nominal.cis_qtl_pairs.chr22.parquet \
--output output/cis_nominal.cis_qtl_pairs.chr22.txt.gz
Problem 3
How many variant/gene pairs are being tested and reported?
Problem 4
Which genes has the strongest association?
Perform cis-eQTL analysis with adaptive permutation
If we’d like to identify eGene (gene that is significantly regulated by genetic variation), like we’ve mentioned above, we need to perform permutation to obtain gene-level p-value.
Here is how it can be done using tensorqtl
.
python /project2/hgen47100/software2/tensorqtl/tensorqtl/tensorqtl.py \
--covariates /project2/hgen47100/data/lab6/GEUVADIS.445_samples.covariates.txt \
--window 10000 \
--mode cis \
/project2/hgen47100/data/lab6/GEUVADIS.hg38.chr22 \
/project2/hgen47100/data/lab6/GEUVADIS.chr22.expression.bed.gz \
output/cis
The output is the gene-level statistics obtained from adaptive permutation where each row is for one gene (in txt.gz format).
To obtain eGene as FDR 10%, we can collect all genes with qval
smaller than 0.1.
To obtain cis-eQTL for these eGenes, we can collect all variant/gene pairs with pval_nominal
(reported in cis_nominal
run) smaller than pval_nominal_threshold
.
Problem 5
Which gene has the most significant q-value?
Problem 6
Select a gene with q-value < 0.05, visualize its cis-eQTL results by plotting \(-\log(p)\)
on y-axis and distance to TSS on x-axis. And put a harizontal line indicating the corresponing pval_nominal_threshold
of the gene.