LD Score Regression

Yanyu Liang

2021-04-16

Categories: tutorial

\[ \newcommand{\E}{\text{E}} \]

Overview

This is a minial tutorial on using ldsc for LDSC regression. Let’s start by reviewing what LDSC regression does and what kind of input files are required for a minimal LDSC regression analysis.

Under polygenic model, B. K. Bulik-Sullivan et al. (2015) proposed

\[\begin{aligned} \E[\chi_j^2 | l_j] = N\frac{h^2}{M} l_j + Na + 1 \end{aligned}\] , where \(\chi^2\) is GWAS summary statistics, \(N\) is sample size, \(h^2\) is heritability, and \(M\) is the number of common variants we are considering for contributing to heritability. More importantly, this relationship holds for each SNP \(j\). And \(l_j\) is called LD score for SNP \(j\), which is defined as \(l_j := \sum_k r_{jk}^2\) where \(r_{jk}^2\) is the LD between SNP \(j\) and \(k\). So, we could notice that \(l_j\) is population-specific and it does not depend on the trait.

OK, essentially, to regress \(\chi_j^2\) against \(l_j\), we need to know:

\(\chi_j^2\) which is available from GWAS summary statistics
\(N\) GWAS sample size
\(l_j\) which is typically shared by LDSC developers and you can also compute on your own if you are working with a special population

From here we can conclude a general workflow for LDSC regression:

Download or prepare LDSC files
Format your GWAS so that it uses the same SNP ID system as LDSC files
Run LDSC regression software

As minimal tutorial, we won’t touch on how to prepare LDSC files by yourselves. For your interest, you can take a look at link for this task.

Instal LDSC regression software

If you don’t have conda installed on your machine, please install it from here before proceeding. Then, go to ldsc GitHub repository and follow the instruction.

# assuming you've installed conda
cd [your-working-dir]
mkdir software 
cd software 
git clone https://github.com/bulik/ldsc.git
cd ldsc
conda env create --file environment.yml
source activate ldsc

Download LDSC files

LDSC developers pre-computed LD scores for Europeans and East Asians. They are available at here. For this tutorial, we will work with European GWAS. So, we can simply use the one suggested here. These LDSC files are obtained from 1000G Europeans with HapMap3 SNPs.

So, we can do

cd [your-working-dir]
mkdir data
cd data
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
tar -jxvf eur_w_ld_chr.tar.bz2

Prepare GWAS files

For this tutorial, we select GWAS summary statistics from link. For illustration, I downloaded UKB_20022_Birth_weight.txt.gz and UKB_21001_Body_mass_index_BMI.txt.gz. Let’s move the downloaded files to [your-working-dir]/gwas/.

ldsc requires the GWAS summary statistics files follow some format. It is described in details here.

In short, it requires your file to have the following columns:

SNP – SNP identifier (e.g., rs number or other SNP ID matching the LDSC files)
N – sample size (which may vary from SNP to SNP).
Z – z-score. Sign with respect to A1 (warning, possible gotcha)
A1 – first allele (effect allele)
A2 – second allele (other allele)

The recommended practice is to use munge_sumstats.py script shared with the software for this formatting step. For our dataset, we can run

cd [your-working-dir]
mkdir output
python software/ldsc/munge_sumstats.py \
  --sumstats data/gwas/UKB_20022_Birth_weight.txt.gz \
  --N-col sample_size \
  --snp variant_id \
  --a1 effect_allele \
  --a2 non_effect_allele \
  --signed-sumstats zscore,0 \
  --p pvalue \
  --merge-alleles data/eur_w_ld_chr/w_hm3.snplist \
  --out output/UKB_20022_Birth_weight

And run similar command for UKB_21001_Body_mass_index_BMI.txt.gz.

You may notice that the formatting script also does some quality control and sanity check. So, if possible, we’d prefer perform formatting with this script rather than writing our own.

Run LDSC regression

OK, it’s time to run LDSC regression.

cd [your-working-dir]
python software/ldsc/ldsc.py \
  --h2 output/UKB_20022_Birth_weight.sumstats.gz \
  --ref-ld-chr data/eur_w_ld_chr/ \
  --w-ld-chr data/eur_w_ld_chr/ \
  --out output/ldsc_UKB_20022_Birth_weight

--ref-ld-chr is specifying LDSC for the run. --w-ld-chr is specifying the weighting scheme in the linear regression (it is more about technical details).

And we can estimate genetic correlation under similar framework B. Bulik-Sullivan et al. (2015).

cd [your-working-dir]
python software/ldsc/ldsc.py \
  --rg output/UKB_20022_Birth_weight.sumstats.gz,output/UKB_21001_Body_mass_index_BMI.sumstats.gz \
  --ref-ld-chr data/eur_w_ld_chr/ \
  --w-ld-chr data/eur_w_ld_chr/ \
  --out output/rg_UKB_20022_Birth_weight_x_UKB_21001_Body_mass_index_BMI

References

Bulik-Sullivan, Brendan, Hilary K Finucane, Verneri Anttila, Alexander Gusev, Felix R Day, Po-Ru Loh, ReproGen Consortium, et al. 2015. “An Atlas of Genetic Correlations Across Human Diseases and Traits.” Nat. Genet. 47 (11): 1236–41.

Bulik-Sullivan, Brendan K, Po-Ru Loh, Hilary K Finucane, Stephan Ripke, Jian Yang, Nick Patterson, Mark J Daly, Alkes L Price, and Benjamin M Neale. 2015. “LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies.” Nat. Genet. 47 (3): 291–95.