Today, we will follow the applied guide to creating and validating polygenic risk scores (PRS) by Mills, Barban, and Tropf.
Mills, Melinda C.; Barban, Nicola; Tropf, Felix C.. An Introduction to Statistical Genetic Data Analysis (p. 243). MIT Press. Kindle Edition.
Mills et al’s learning objectives
- Recall the basics of polygenic scores (PGSs)
- Comprehend and engage in the seven steps of the pruning and threshold method
- Understand and calculate a PGS using PRSice 2.0
- Validate PGSs
Recall that a polygenic risk score is a weighted average of SNP dosages using the weights estimated from the GWAS of the trait of interest. Given a person’s genotype data, we can calculate the PRS as follows:
\[ \text{PRS} = \hat{Y} = \sum_k w_k \cdot X_{k}\] Different PRS wiil differ by the
- SNPs to included in the sum (variable selection)
- weights (based on GWAS results)
Setup
BIN="/Users/haekyungim/bin"
DATA="/Users/haekyungim/Box/LargeFiles/imlab-data/data-Github/web-data/2021-04-23-lab-polygenic-risk-scores/data_chapter10"
WORK=$DATA
mkdir $WORK/output
Get the data
-[ ] download the ch 10 data from https://www.intro-statistical-genetics.com/data-code
Install PRSice
-[ ] Download PRSice from http://www.prsice.info/
wget https://github.com/choishingwan/PRSice/releases/download/2.3.3/PRSice_mac.zip
-[ ] move the executable to the directory where you keep your software (optional)
** the first time you run PRSice, it will install additional components **
PRSice=/Users/haekyungim/bin/PRSice_mac
Calculating PRS
- Obtain GWAS summary statistics from a large discovery sample
- Obtain independent sample(s) that has genome-wide data
- Align and select SNPs in common between the two samples
- Select independent SNPs by pruning based on LD.
- Restrict to SNPs with a p-value that is lower than a certain threshold.
- Construct PGS by summing up risk alleles weighted by betas from the summary statistics.
- Evaluate strength of this association by regressing the phenotype on the PGS.
Pruning vs. clumping
- Pruning is a statistical procedure that selects one random SNP per LD block.
- Clumping selects the SNP with the lowest p-value association in each LD block.
Run PRSice
Rscript $PRSice/PRSice.R --dir $PRSice \
--prsice $PRSice/PRSice_mac \
--base $WORK/BMI.txt \
--target $WORK/1kg_hm3_qc \
--snp MarkerName \
--A1 A1 \
--A2 A2 \
--stat Beta \
--pvalue Pval \
--pheno-file $WORK/BMI_pheno.txt \
--bar-levels 5e-08,5e-07,5e-06,5e-05,5e-04,5e-03,5e-02,5e-01 \
--fastscore \
--binary-target F \
--extract /Users/haekyungim/Box/LargeFiles/imlab-data/data-Github/web-data/2021-04-23-lab-polygenic-risk-scores/data_chapter10/BMI_score_all.valid \
--out $WORK/BMI_score_all
- $PRSice is the location where we downloaded the PRSice software package
- PRSice_mac is the name of the PRSice main executable file
- $WORK/BMI.txt has the GWAS summary results
- $WORK/1kg_hm3_qc is the header of the plink formatted genotype files to be used
- $WORK/BMI_pheno.txt has the actual phenotype file to be used for validataion
- bar-levels indicate the thresholds to be used, i.e. only SNPs with p-values smaller than the specified values are used in the calculation of the PRS
- fastcore specifies that only the thresholds listed be calculated
- binary-target F: meand that we the trait is quantitative, not binary
- some extra QC that was needed, some SNPs were repeated in the provided genotyep file
- out is where we save the results
Mill et al original chapter 10 code
Rscript PRSice.R --dir . \
--prsice ./PRSice \
--base BMI.txt \
--target 1kg_hm3_qc \
--snp MarkerName \
--A1 A1 \
--A2 A2 \
--stat Beta \
--pvalue Pval \
--pheno-file BMI_pheno.txt \
--bar-levels 1 \
--fastscore \
--binary-target F \
--out BMI_score_all
Rscript PRSice.R --dir . \
--prsice ./PRSice \
--base BMI.txt \
--target 1kg_hm3_qc \
--thread 1 \
--snp MarkerName \
--A1 A1 \
--A2 A2 \
--stat Beta \
--pvalue Pval \
--bar-levels 5e-08,5e-07,5e-06,5e-05,5e-04,5e-03,5e-02,5e-01 \
--fastscore \
--all-score \
--no-regress \
--binary-target F \
--out BMIscore_thresholds
head BMIscore_thresholds.all.score
Rscript PRSice.R --dir . \
--prsice ./PRSice \
--base BMI.txt \
--target 1kg_hm3_qc \
--thread 1 \
--snp MarkerName \
--A1 A1 \
--A2 A2 \
--stat Beta \
--pvalue Pval \
--pheno-file BMI_pheno.txt \
--interval 0.00005 \
--lower 0.0001 \
--quantile 20 \
--all-score \
--binary-target F \
--out BMIscore_graphics
Rscript PRSice.R --dir . \
--prsice ./PRSice \
--base BMI.txt \
--target 1kg_hm3_qc \
--thread 1 \
--snp MarkerName \
--A1 A1 \
--A2 A2 \
--stat Beta \
--pvalue Pval \
--no-clump F \
--pheno-file Obesity_pheno.txt \
--interval 0.00005 \
--lower 0.0001 \
--quantile 5 \
--all-score \
--binary-target T \
--out Obesity_score_graphics
#import external data
data<-read.table("BMIscore_thresholds.all.score", header=T)
# show first rows of data
head(data)
# calculate new standardised variable
data$PGS=(data$X1-mean(data$X1))/sd(data$X1, na.rm=T)
#plot histogram of the polygenic score
hist(data$PGS)
# import external data with phenotype
pheno_BMI<-read.table("BMI_pheno.txt", header=T)
# merge the two datasets
data.with.pheno<-merge(data,pheno_BMI, by="IID")
# run a linear regression model
mod1<-(lm(BMI~PGS, data=data.with.pheno))
summary(mod1)
summary(mod1)$r.square
# create a vector with column names
columns=c("FID", "IID", "pca1", "pca2", "pca3", "pca4", "pca5", "pca6", "pca7", "pca8", "pca9", "pca10")
# read external data with PCAs
pca<- read.table(file="pca.eigenvec", sep = "", header=F, col.names=columns)[,c(2:12)]
# merge file with covariates with the rest of the file
data.with.covars<-merge(data.with.pheno,pca, by="IID")
# Estimate new linear regression model
mod2<-lm(BMI~PGS+pca1+pca2+pca3+pca3+pca5+pca6+pca7+pca8+pca9+pca10, data=data.with.covars)
# Calculate R-Square
summary(mod2)$r.square
# Estimate linear regression model
mod2.no.pgs<-update(mod2, ~ . - PGS)
# Estimate difference in R-squared
print(summary(mod2)$r.square-summary(mod2.no.pgs)$r.square)
# install new package in R to perform bootstrap
# install new package in R to perform bootstrap
install.packages("boot")
library(boot)
set.seed(12345)
# define new function to calculate differential r-squared.
# The following commands define the new function
rsq <- function(formula, PGS=PGS, data, indices) {
d <- data[indices,]
fit1 <- lm(formula, data=d)
fit2 <- update(fit1, ~ . + PGS)
return(summary(fit2)$r.square-summary(fit1)$r.square)
}
# bootstrapping with 1000 replications
results <- boot(data=data.with.covars, statistic=rsq,
R=1000, formula=BMI~pca1+pca2+pca3+pca3+pca5+pca6+pca7+pca8+pca9+pca10, PGS=PGS)
boot.ci(results, type="norm")
# import external data with multi-ancestry information
data<-read.table("BMIscore_MULTIANCESTRY.best", header=T)
head(data)
# Calculate standardized PGS
data$PGS=(data$PRS-mean(data$PRS))/sd(data$PRS, na.rm=T)
data.with.pheno<-merge(data,pheno_BMI, by="IID")
#import dataset with geographical information (ancestral populations)
geo<-read.table(file="1kg_samples.txt", sep = "\t", header=T)[, c(1,4,5,6,7)]
names(geo)[1]<-"IID"
# Create a vector with new columns names
columns=c("FID", "IID", "pca1", "pca2", "pca3", "pca4", "pca5", "pca6", "pca7", "pca8", "pca9", "pca10")
# import exteral data with PCAs
pca<- read.table(file="pca.eigenvec", sep = "", header=F, col.names=columns)[,c(2:12)]
# merge information with covariates and ancestral information
data.with.covars<-merge(data.with.pheno,pca, by="IID")
data.with.geo<-merge(data.with.covars,geo, by="IID")
# fit a linear regression model
mod.AFR<-lm(BMI~PGS, data=subset(data.with.geo, Superpopulation.name=="African"))
summary(mod.AFR)$r.square
# run bootstrap function to estimate increased R squared
results <- boot(data=subset(data.with.geo, Superpopulation.name=="African"), statistic=rsq,
R=1000, formula=BMI~pca1+pca2+pca3+pca3+pca5+pca6+pca7+pca8+pca9+pca10, PGS=PGS)
boot.ci(results, type="norm")