Created by Max Winston, modified by Charles Washington III, Ankeeta Shah, and Yanyu Liang
(from HGEN47100 course material)

Before getting started, please download data from the Box link here. For your convenience, we have a copy of the data folder at /project2/hgen47100/data/lab2 on midway.

wget https://uchicago.box.com/shared/static/30h5kju8ggui94whfbz0prjn3x4dvtag.gz -O lab2.tar.gz
tar xvf lab2.tar.gz

Tranferring data between midway and local machine. Unix users are recommended to use scp or rsync. Windows users can use WinSCP or equivalent. For our purpose, our files are not sensitive and small in general. An alternative approach is to login to midway Desktop using ThinLinc and send the data as an email.

This lab section is dedicated to learning how to manipulate the appropriate files in a command-line program named PLINK (to run whole-genome association and population-based linkage analysis). By the end of this lab, you should be able to:

  • Know about PED and BED files
  • Run GWAS using PLINK for both discrete and continuous phenotypes
  • Account for covariates in GWAS run
  • Visualize GWAS outputs generated by PLINK

Running a Case-Control GWAS

One of the most common uses of GWAS is to explore putative associations for disease. In this section we will practice our skills with PLINK using a dataset with SNPs in a 20Mb region from a case-control GWAS of a binary phenotype (more than 5000 individuals). There is a little readme (readme_of_GAME_data.txt) which you may find useful. The input files have already been downloaded inside the data folder. It includes small GWAS file set (bed/bim/fam files), as well as the file containing covariates obtained from principal components analysis (PCA). For your convenience, you can find these files at /project2/hgen47100/data/lab2.

Throughout this section, you will be asked to run multiple PLINK commands. You will be asked to record them all in one bash script as part of the submission.

Preliminary Quality Control

Your first task is to use PLINK to remove SNPs with low call rates (>5% missingness), deviation from Hardy-Weinberg equilibrium (p < 0.001), and low minor allele frequency (MAF < 0.05). Please record the command in the bash script.

Problem 10
Provide the script you used. How many SNPs, in total, were removed and how many are left?

Logistic Regression

Test associations for this chromosome using logistic regression (–logistic) using the “–adjust” flag to conduct adjustments for multiple testing. No phenotype file needs to be specified because case-control status is in the .fam file, Please record the command in the bash script

Problem 11
Look through the most significant SNPs in the logistic association results file. What is the most significant SNP? What is the p-value? (Make sure you look at the .assoc.logistic.adjusted file’s Bonferroni-corrected value (BONF) as this gives you the corrected significance).

Now run the same command adjusting for five principal components (PCs) representing ancestry using the “–covar” flag for the 5 ancestral PC file and the “hide-covar” flag. Don’t worry too much about correcting for ancestry using PCs, we will get into much more detail on this in a later lab. Please record the command in the bash script

Problem 12
What is the most significant SNP now? What is the p-value?

Problem 13
Are the results the same from Problem 11 and Problem 12? Did correcting for ancestry do anything?

Problem 14 Make two Manhattan plots by importing the two .assoc.logistic files (not the adjusted versions) into R and using some of the code above (you’ll need to move them from the cluster to your local environment). Keep in mind that the length of the file is different from the previous Manhattan plot, so you may need to change the significance threshold as well as the data used. What are the main differences between the two plots?

Creating a Locuszoom Plot

Locuszoom plots have the same basic structure as Manhattan plots, however, they are useful in that they include additional information such as gene location, r-squared values (LD), and estimated recombination rates. Once you have generated your logistic association file (the one using the 5 PCs) from Section 2.2 and identified the top SNP from this association, it is very easy to navigate to the Locuszoom website and create the plot.

Create a locuszoom plot using the web server here. Use the significance reported in the .logistic file (not the .logistic.adjusted file) from section 2.2. (once again, you’ll need to move the file from the cluster to your local environment).

The website recognizes a SNP by genomic position along with the reference/alternative alleles. But the current .logistic file does not includes non effect allele. To fix this issue, please refer to example_rscript_to_convert_plink_assoc_to_locuszoom_input.R in the data folder to see one possible work-around.

After uploading the GWAS results, you will hit button “submit” which will submit a job to server to generate the locus zoom plots. Once the job is finished, you will be noticed by email with the link to report. In the report page, the link “region plot” directs you to the locuszoom plot. The color of each point indicates the pairwise correlation of that SNP with the focal SNP (which we entered as the most highly associated SNP). Additionally, the blue line in the background demonstrates the estimated recombination rate (cM/Mb) from the much larger HapMap dataset external to our Case-Control dataset.

Problem 15
Why do you think many of the SNPs surrounding the focal SNP are colored more red-ish (i.e. are more correlated)? Do you notice any pattern between the color of these points and the background recombination rate?

Assessing multiple association signals in this region

Often, you may want to know whether there are other significant association signals conditional upon identification of a highly-associated SNP (such as the one you identified in Section 2.2). To determine if there are multiple association signals in this region, run the logistic with covariates again using the “–condition rsXXXX”" flag to adjust for the top SNP. Please record the command in the bash script

Problem 16
Use locuszoom to plot the -log10 P-values from the new analysis. Show the plots for both the unconditioned (the very first locuszoom plot) and the conditioned files (for region 13:73666628-74166628).

Problem 17
Interpret the results of the two plots by writing 1-2 sentences for each.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The source code is licensed under MIT.

Suggest changes

If you find any mistakes (including typos) or want to suggest changes, please feel free to edit the source file of this page on Github and create a pull request.