Today, we will follow the applied guide to creating and validating polygenic risk scores (PRS) by Mills, Barban, and Tropf.

Mills, Melinda C.; Barban, Nicola; Tropf, Felix C.. An Introduction to Statistical Genetic Data Analysis (p. 243). MIT Press. Kindle Edition.

Mills et al’s learning objectives

  • Recall the basics of polygenic scores (PGSs)
  • Comprehend and engage in the seven steps of the pruning and threshold method
  • Understand and calculate a PGS using PRSice 2.0
  • Validate PGSs

Recall that a polygenic risk score is a weighted average of SNP dosages using the weights estimated from the GWAS of the trait of interest. Given a person’s genotype data, we can calculate the PRS as follows:

\[ \text{PRS} = \hat{Y} = \sum_k w_k \cdot X_{k}\] Different PRS wiil differ by the

  • SNPs to included in the sum (variable selection)
  • weights (based on GWAS results)

Setup

BIN="/Users/haekyungim/bin"
DATA="/Users/haekyungim/Box/LargeFiles/imlab-data/data-Github/web-data/2021-04-23-lab-polygenic-risk-scores/data_chapter10"
WORK=$DATA
mkdir $WORK/output

Get the data

-[ ] download the ch 10 data from https://www.intro-statistical-genetics.com/data-code

Install PRSice

-[ ] Download PRSice from http://www.prsice.info/

wget https://github.com/choishingwan/PRSice/releases/download/2.3.3/PRSice_mac.zip

-[ ] move the executable to the directory where you keep your software (optional)

** the first time you run PRSice, it will install additional components **

PRSice=/Users/haekyungim/bin/PRSice_mac

Calculating PRS

  1. Obtain GWAS summary statistics from a large discovery sample
  2. Obtain independent sample(s) that has genome-wide data
  3. Align and select SNPs in common between the two samples
  4. Select independent SNPs by pruning based on LD.
  5. Restrict to SNPs with a p-value that is lower than a certain threshold.
  6. Construct PGS by summing up risk alleles weighted by betas from the summary statistics.
  7. Evaluate strength of this association by regressing the phenotype on the PGS.

Pruning vs. clumping

  1. Pruning is a statistical procedure that selects one random SNP per LD block.
  2. Clumping selects the SNP with the lowest p-value association in each LD block.

Run PRSice


Rscript $PRSice/PRSice.R --dir $PRSice \
    --prsice $PRSice/PRSice_mac \
    --base $WORK/BMI.txt \
    --target $WORK/1kg_hm3_qc \
    --snp MarkerName \
    --A1 A1 \
    --A2 A2 \
    --stat Beta \
    --pvalue Pval \
    --pheno-file $WORK/BMI_pheno.txt \
    --bar-levels 5e-08,5e-07,5e-06,5e-05,5e-04,5e-03,5e-02,5e-01 \
    --fastscore \
    --binary-target F \
    --extract /Users/haekyungim/Box/LargeFiles/imlab-data/data-Github/web-data/2021-04-23-lab-polygenic-risk-scores/data_chapter10/BMI_score_all.valid \
    --out $WORK/BMI_score_all
  • $PRSice is the location where we downloaded the PRSice software package
  • PRSice_mac is the name of the PRSice main executable file
  • $WORK/BMI.txt has the GWAS summary results
  • $WORK/1kg_hm3_qc is the header of the plink formatted genotype files to be used
  • $WORK/BMI_pheno.txt has the actual phenotype file to be used for validataion
  • bar-levels indicate the thresholds to be used, i.e. only SNPs with p-values smaller than the specified values are used in the calculation of the PRS
  • fastcore specifies that only the thresholds listed be calculated
  • binary-target F: meand that we the trait is quantitative, not binary
  • some extra QC that was needed, some SNPs were repeated in the provided genotyep file
  • out is where we save the results

Mill et al original chapter 10 code


Rscript PRSice.R --dir . \
    --prsice ./PRSice \
    --base BMI.txt \
    --target 1kg_hm3_qc \
    --snp MarkerName \
    --A1 A1 \
    --A2 A2 \
    --stat Beta \
    --pvalue Pval \
    --pheno-file BMI_pheno.txt \
    --bar-levels 1 \
    --fastscore \
    --binary-target F \
    --out BMI_score_all



Rscript PRSice.R --dir . \
    --prsice ./PRSice \
    --base BMI.txt \
    --target 1kg_hm3_qc \
    --thread 1 \
    --snp MarkerName \
    --A1 A1 \
    --A2 A2 \
    --stat Beta \
    --pvalue Pval \
    --bar-levels 5e-08,5e-07,5e-06,5e-05,5e-04,5e-03,5e-02,5e-01 \
    --fastscore \
    --all-score \
    --no-regress \
    --binary-target F \
    --out BMIscore_thresholds  


head BMIscore_thresholds.all.score

Rscript PRSice.R --dir . \
    --prsice ./PRSice \
    --base BMI.txt \
    --target 1kg_hm3_qc \
    --thread 1 \
    --snp MarkerName \
    --A1 A1 \
    --A2 A2 \
    --stat Beta \
    --pvalue Pval \
    --pheno-file BMI_pheno.txt \
    --interval 0.00005 \
    --lower 0.0001 \
    --quantile 20 \
    --all-score \
    --binary-target F \
    --out BMIscore_graphics

Rscript PRSice.R --dir . \
    --prsice ./PRSice \
    --base BMI.txt \
    --target 1kg_hm3_qc \
    --thread 1 \
    --snp MarkerName \
    --A1 A1 \
    --A2 A2 \
    --stat Beta \
    --pvalue Pval \
    --no-clump F \
    --pheno-file Obesity_pheno.txt \
    --interval 0.00005 \
    --lower 0.0001 \
    --quantile 5 \
    --all-score \
    --binary-target T \
    --out Obesity_score_graphics
#import external data
data<-read.table("BMIscore_thresholds.all.score", header=T)

# show first rows of data
head(data)

# calculate new standardised variable
data$PGS=(data$X1-mean(data$X1))/sd(data$X1, na.rm=T)

#plot histogram of the polygenic score 
 hist(data$PGS)

# import external data with phenotype
pheno_BMI<-read.table("BMI_pheno.txt", header=T)

# merge the two datasets
data.with.pheno<-merge(data,pheno_BMI, by="IID")



# run a linear regression model
mod1<-(lm(BMI~PGS, data=data.with.pheno))
summary(mod1)


summary(mod1)$r.square




# create a vector with column names
columns=c("FID", "IID", "pca1", "pca2", "pca3", "pca4", "pca5", "pca6", "pca7", "pca8", "pca9", "pca10")

# read external data with PCAs
pca<- read.table(file="pca.eigenvec", sep = "", header=F, col.names=columns)[,c(2:12)]

# merge file with covariates with the rest of the file
data.with.covars<-merge(data.with.pheno,pca, by="IID")

# Estimate new linear regression model
mod2<-lm(BMI~PGS+pca1+pca2+pca3+pca3+pca5+pca6+pca7+pca8+pca9+pca10, data=data.with.covars)

# Calculate R-Square
summary(mod2)$r.square

# Estimate linear regression model
mod2.no.pgs<-update(mod2,  ~ . - PGS)

# Estimate difference in R-squared
print(summary(mod2)$r.square-summary(mod2.no.pgs)$r.square)
# install new package in R to perform bootstrap

# install new package in R to perform bootstrap

install.packages("boot")
library(boot)
set.seed(12345)

# define new function to calculate differential r-squared.
# The following commands define the new function
rsq <- function(formula, PGS=PGS, data, indices) {
  d <- data[indices,] 
  fit1 <- lm(formula,  data=d)
  fit2 <- update(fit1,  ~ . + PGS)
  
  return(summary(fit2)$r.square-summary(fit1)$r.square)
} 


# bootstrapping with 1000 replications 
results <- boot(data=data.with.covars, statistic=rsq, 
    R=1000, formula=BMI~pca1+pca2+pca3+pca3+pca5+pca6+pca7+pca8+pca9+pca10, PGS=PGS)
boot.ci(results, type="norm")      


# import external data with multi-ancestry information

data<-read.table("BMIscore_MULTIANCESTRY.best", header=T)
head(data)

# Calculate  standardized PGS
data$PGS=(data$PRS-mean(data$PRS))/sd(data$PRS, na.rm=T)

data.with.pheno<-merge(data,pheno_BMI, by="IID")

#import dataset with geographical information (ancestral populations)
geo<-read.table(file="1kg_samples.txt", sep = "\t", header=T)[, c(1,4,5,6,7)]
names(geo)[1]<-"IID"

# Create a vector with new columns names
columns=c("FID", "IID", "pca1", "pca2", "pca3", "pca4", "pca5", "pca6", "pca7", "pca8", "pca9", "pca10")

# import exteral data with PCAs
pca<- read.table(file="pca.eigenvec", sep = "", header=F, col.names=columns)[,c(2:12)]

# merge information with covariates and ancestral information
data.with.covars<-merge(data.with.pheno,pca, by="IID")
data.with.geo<-merge(data.with.covars,geo, by="IID")

# fit a linear regression model
mod.AFR<-lm(BMI~PGS, data=subset(data.with.geo, Superpopulation.name=="African"))

summary(mod.AFR)$r.square


# run bootstrap function to  estimate increased R squared
results <- boot(data=subset(data.with.geo, Superpopulation.name=="African"), statistic=rsq, 
    R=1000, formula=BMI~pca1+pca2+pca3+pca3+pca5+pca6+pca7+pca8+pca9+pca10, PGS=PGS)

boot.ci(results, type="norm")      

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The source code is licensed under MIT.

Suggest changes

If you find any mistakes (including typos) or want to suggest changes, please feel free to edit the source file of this page on Github and create a pull request.

Citation

For attribution, please cite this work as

Haky Im (2021). Lab Polygenic Risk Scores. BIOS 25328 Cancer Genomics Class Notes. /post/2021/04/23/lab-polygenic-risk-scores/

BibTeX citation

@misc{
  title = "Lab Polygenic Risk Scores",
  author = "Haky Im",
  year = "2021",
  journal = "BIOS 25328 Cancer Genomics Class Notes",
  note = "/post/2021/04/23/lab-polygenic-risk-scores/"
}