LDSC regression
In this exercise, you will use LD score regression method to calculate the chip heritability of two GWAS phenotypes from the UK Biobank data and look for evidence of population stratification.
LDSC documentation
To get started on using the software for this problem, you may find this document helpful (https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation).
For this problem, the precalculated LD scores can be found in /project2/bios25328/Data/LDSC/eur_w_ld_chr/
.
Questions
Write down the relationship between the expected value of the \(\chi^2\) statistic and heritability, number of markers, sample size, population stratification effect (the LD score regression formula)
Pick two phenotypes from (https://uchicago.box.com/s/tkoya2h769hyvs8e2texokpsw8qqoop3). Additional information about the phenotypes can be found in this spreadsheet https://docs.google.com/spreadsheets/d/1KrZsK5ayUXkK93Ib4frI9ZHk64_4MTIrVa553mY76w0.
Calculate heritability. What is the intercept for each trait? How do you interpret the values?
Calculate genetic correlation between the two UKB traits. Did you expect that? Comment.
Solution
Below are the steps provided by a Duhman with Natasha’s help (2/2021) on how to run LDSC on Midway.
# Use the computation node to make the process faster
sinteractive --account=bios25328 --ntasks=1 --mem-per-cpu=8000
Below change the directory name and phenotypes accordingly.
Create the working direectory /project2/bios25328/users/username/lab_ldsc
# define environmental variables for convenience
WORK=/project2/bios25328/users/username/lab_ldsc
DATA=/project2/bios25328/Data/LDSC
SOFTWARE=/project2/bios25328/software
## create the working directory if you don't have it
mkdir $WORK
cd $WORK
# load python
module load python
# load conda
module load python/anaconda-2020.02
Load the necessary software with the right version by activating the conda environment
# activate the conda environment (this environment has the necessary dependencies, versions, and software)
conda activate $SOFTWARE/ldsc/env/ldsc
Data formatting
The following step can be skipped because the files are already in the right format in $DATA.
# munge the first phenotype file. Note that I included the file paths so I can do this either from my work directory or from the conda directory natasha created
python $SOFTWARE/ldsc/munge_sumstats.py \
--sumstats /project2/bios25328/users/username/lab_ldsc/UKB_20127_Neuroticism_score.txt.gz \
--snp variant_id \
--a1 effect_allele \
--a2 non_effect_allele \
--p pvalue \
--frq frequency \
--N-col sample_size \
--chunksize 500000 \
--merge-alleles /project2/bios25328/users/username/lab_ldsc/w_hm3.snplist \
--out /project2/bios25328/users/username/lab_ldsc/neuroticism
# The output will look something like this
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./munge_sumstats.py \
--out /project2/bios25328/users/username/lab_ldsc/neuroticism \
--merge-alleles /project2/bios25328/users/username/lab_ldsc/w_hm3.snplist \
--frq frequency \
--N-col sample_size \
--chunksize 500000 \
--a1 effect_allele \
--a2 non_effect_allele \
--snp variant_id \
--sumstats /project2/bios25328/users/username/lab_ldsc/UKB_20127_Neuroticism_score.txt.gz \
--p pvalue
Interpreting column names as follows:
zscore: Z-score (0 --> no effect; above 0 --> A1 is trait/risk increasing)
non_effect_allele: Allele 2, interpreted as non-ref allele for signed sumstat.
sample_size: Sample size
frequency: Allele frequency
variant_id: Variant ID (e.g., rs number)
effect_allele: Allele 1, interpreted as ref allele for signed sumstat.
pvalue: p-Value
Reading list of SNPs for allele merge from /project2/bios25328/users/username/lab_ldsc/w_hm3.snplist
Read 1217311 SNPs for allele merge.
Reading sumstats from /project2/bios25328/users/username/lab_ldsc/UKB_20127_Neuroticism_score.txt.gz into memory 500000 SNPs at a time.
.................. done
Read 8856162 SNPs from --sumstats file.
Removed 7427968 SNPs not in --merge-alleles.
Removed 251724 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 1 variants that were not SNPs or were strand-ambiguous.
1176469 SNPs remain.
Removed 2 SNPs with duplicated rs numbers (1176467 SNPs remain).
Removed 0 SNPs with N < 182738.666667 (1176467 SNPs remain).
Median value of zscore was 0.00744685390964, which seems sensible.
Removed 42 SNPs whose alleles did not match --merge-alleles (1176425 SNPs remain).
Writing summary statistics for 1217311 SNPs (1176425 with nonmissing beta) to /project2/bios25328/users/username/lab_ldsc/neuroticism.sumstats.gz.
Metadata:
Mean chi^2 = 1.674
Lambda GC = 1.486
Max chi^2 = 85.543
1605 Genome-wide significant SNPs (some may have been removed by filtering).
Conversion finished at Fri Feb 26 15:55:08 2021
Total time elapsed: 1.0m:7.62s
# This is the command for the second phenotype. I only changed the file name
(/project2/bios25328/software/ldsc/env/ldsc) [muhtaseb@midway2-0046 lab_6]$ python $SOFTWARE/ldsc/munge_sumstats.py \
--sumstats /project2/bios25328/users/username/lab_ldsc/UKB_20016_Fluid_intelligence_score.txt.gz \
--snp variant_id \
--a1 effect_allele \
--a2 non_effect_allele \
--p pvalue \
--frq frequency \
--N-col sample_size \
--chunksize 500000 \
--merge-alleles /project2/bios25328/users/username/lab_ldsc/w_hm3.snplist \
--out /project2/bios25328/users/username/lab_ldsc/intelligence
# The outcome of this command will look similar
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./munge_sumstats.py \
--out /project2/bios25328/users/username/lab_ldsc/intelligence \
--merge-alleles /project2/bios25328/users/username/lab_ldsc/w_hm3.snplist \
--frq frequency \
--N-col sample_size \
--chunksize 500000 \
--a1 effect_allele \
--a2 non_effect_allele \
--snp variant_id \
--sumstats /project2/bios25328/users/username/lab_ldsc/UKB_20016_Fluid_intelligence_score.txt.gz \
--p pvalue
Interpreting column names as follows:
zscore: Z-score (0 --> no effect; above 0 --> A1 is trait/risk increasing)
non_effect_allele: Allele 2, interpreted as non-ref allele for signed sumstat.
sample_size: Sample size
frequency: Allele frequency
variant_id: Variant ID (e.g., rs number)
effect_allele: Allele 1, interpreted as ref allele for signed sumstat.
pvalue: p-Value
Reading list of SNPs for allele merge from /project2/bios25328/users/username/lab_ldsc/w_hm3.snplist
Read 1217311 SNPs for allele merge.
Reading sumstats from /project2/bios25328/users/username/lab_ldsc/UKB_20016_Fluid_intelligence_score.txt.gz into memory 500000 SNPs at a time.
.................. done
Read 8856162 SNPs from --sumstats file.
Removed 7427968 SNPs not in --merge-alleles.
Removed 251724 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 1 variants that were not SNPs or were strand-ambiguous.
1176469 SNPs remain.
Removed 2 SNPs with duplicated rs numbers (1176467 SNPs remain).
Removed 0 SNPs with N < 72545.3333333 (1176467 SNPs remain).
Median value of zscore was -0.00853457115591, which seems sensible.
Removed 42 SNPs whose alleles did not match --merge-alleles (1176425 SNPs remain).
Writing summary statistics for 1217311 SNPs (1176425 with nonmissing beta) to /project2/bios25328/users/username/lab_ldsc/intelligence.sumstats.gz.
Metadata:
Mean chi^2 = 1.556
Lambda GC = 1.445
Max chi^2 = 68.204
551 Genome-wide significant SNPs (some may have been removed by filtering).
Conversion finished at Fri Feb 26 16:00:15 2021
Total time elapsed: 1.0m:5.66s
Run LDSC regression
# LDSC regression of the two phenotypes:
python $SOFTWARE/ldsc/ldsc.py \
--ref-ld-chr $DATA/eur_w_ld_chr/ \
--out $WORK/neuroticism_intelligence \
--rg $DATA/UKB_20127_Neuroticism_score.sumstats.gz,$DATA/UKB_20016_Fluid_intelligence_score.sumstats.gz \
--w-ld-chr $DATA/eur_w_ld_chr/
The outcome of running the code is below
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./ldsc.py \
--ref-ld-chr eur_w_ld_chr/ \
--out /project2/bios25328/users/username/lab_ldsc/neuroticism_intelligence \
--rg /project2/bios25328/users/username/lab_ldsc/neuroticism.sumstats.gz,/project2/bios25328/users/username/lab_ldsc/intelligence.sumstats.gz \
--w-ld-chr eur_w_ld_chr/
Beginning analysis at Fri Feb 26 17:49:29 2021
Reading summary statistics from /project2/bios25328/users/username/lab_ldsc/neuroticism.sumstats.gz ...
Read summary statistics for 1176425 SNPs.
Reading reference panel LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
Read reference panel LD Scores for 1290028 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
Read regression weight LD Scores for 1290028 SNPs.
After merging with reference panel LD, 1168320 SNPs remain.
After merging with regression SNP LD, 1168320 SNPs remain.
Computing rg for phenotype 2/2
Reading summary statistics from /project2/bios25328/users/username/lab_ldsc/intelligence.sumstats.gz ...
Read summary statistics for 1217311 SNPs.
After merging with summary statistics, 1168320 SNPs remain.
1168320 SNPs with valid alleles.
/project2/bios25328/software/ldsc/ldscore/sumstats.py:532: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
ref_ld = sumstats.as_matrix(columns=ref_ld_cnames)
/project2/bios25328/software/ldsc/ldscore/irwls.py:161: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
coef = np.linalg.lstsq(x, y)
Heritability of phenotype 1
---------------------------
Total Observed scale h2: 0.1183 (0.006)
Lambda GC: 1.4853
Mean Chi^2: 1.6707
Intercept: 1.0133 (0.0123)
Ratio: 0.0199 (0.0183)
# Heritability
Heritability of phenotype 2/2
-----------------------------
Total Observed scale h2: 0.2365 (0.0105)
Lambda GC: 1.4423
Mean Chi^2: 1.5503
Intercept: 1.0402 (0.0093)
Ratio: 0.073 (0.0168)
Genetic Covariance
------------------
Total Observed scale gencov: -0.0276 (0.0045)
Mean z1*z2: -0.1348
Intercept: -0.0441 (0.007)
# Genetic Correlation
Genetic Correlation
-------------------
Genetic Correlation: -0.1647 (0.0268)
Z-score: -6.147
P: 7.8983e-10
Summary of Genetic Correlation Results
p1 p2 rg se z p h2_obs h2_obs_se h2_int h2_int_se gcov_int gcov_int_se
/project2/bios25328/users/username/lab_ldsc/neuroticism.sumstats.gz /project2/bios25328/users/username/lab_ldsc/intelligence.sumstats.gz -0.1647 0.0268 -6.147 7.8983e-10 0.2365 0.0105 1.0402 0.0093 -0.0441 0.007
Analysis finished at Fri Feb 26 17:49:51 2021
Total time elapsed: 21.75s
To run locally
To install LDSC regression software, go to GitHub repository (https://github.com/bulik/ldsc) and follow the installation instructions in (https://github.com/bulik/ldsc#getting-started).
The installation requires conda
being pre-installed.
conda
is package manager for command line tool, Python, and R.
We highly encourage you to try conda
if you haven’t done so.
If you have trouble installing the software, please consult with your TA and/or instructors.