Created by Max Winston & Brandon Pierce; modifications by Charles Washington III, Ankeeta Shah, and Yanyu Liang.
Genome-wide Complex Trait Analysis (GCTA, see documentation here) was originally designed to estimate the heritability of complex traits using genome-wide SNPs, but has now been extended for numerous other functionalities to better understand the genetic architecture of these traits (Yang et al., 2011). Generally, GCTA estimates heritability using the proportion of phenotypic variance explained by the a genetic relationship matrix (GRM), which is calculated using the genome-wide SNP data. In today’s lab we will become familiar with the GCTA software and some of its capabilities, as well as explore some of the conceptual issues dealt with in class with a large SNP dataset. By the end of the lab you should be able to:
- Construct GRMs given BED, FAM, and BIM files
- Run standard analysis in GCTA on large datasets (univariate REML)
- Run bivariate REML analysis in GCTA
- Relate how increasing density of markers affects heritability estimate
- Relate how subsetting markers may affect heritability estimate
You may want to check the tutorial here
Basics of GCTA
Standard GCTA options
option | Description |
---|---|
make-grm | Generates GRM from SNP data (BED, FAM, BIM files). |
make-grm-bin | Generates binary GRM from SNP data (BED, FAM, BIM files). |
bfile | Specifies stem name from BED file for analysis. |
out | Specifies output stem name. |
pheno | Specifies file containing phenotypic information. |
mpheno | Gives the column number within phenotype file to use for analysis. |
reml | Runs univariate restricted maximum likelihood (REML) analysis. |
grm | Specifies GRM file for analysis. |
grm-bin | Specifies binary GRM file for analysis. |
reml-bivar | Runs bivariate restricted maximum likelihood (REML) analysis. |
reml-maxit | Sets the maximum number of iterations to run (Default: 100). |
There is a lot of overlap between PLINK options and GCTA options. For example, the maf option does the same thing in GCTA as it does in PLINK (Refer to Lab 2).
Some setup
Let’s run this on Midway. It is recommended not to run large processes on the head node (where you land when you log in). We start an interactive session:
sinteractive --account=bios25328
We start by defining where the data will be and loading the gcta command.
DATA=/project2/bios25328/Data/Lab_GCTA_Winston
module load gcta
Let’s create a working directory for this lab
cd /project2/bios25328/users/haky/ ## you should replace haky with your user directory name
mkdir labgcta
cd labgcta
mkdir output # so that the output files go into a separate place
WORK=/project2/bios25328/users/haky/labgcta
Running GCTA is nice because it prints lots of useful information to the screen as it runs, and when it concludes. However, due to the computation required for creating a GRM, and the large size of the some of the input and output files, running it can take a bit of time. Depending on what you’re doing for this lab, you can expect some processes to take up to 3 minutes, and of course, with bigger files, it would take even more time. The syntax used to run GCTA is similar to other pipelines we have used on the command line: GCTA is called and modified with option flags (see Section 1.1). For example, one of the first things you will usually need to do is take your SNP data (BED, FAM, BIM formats-just like in PLINK!) and make a GRM.
Compute GRM
Try this with the following command and the “test” files:
gcta --bfile $DATA/test --autosome --maf 0.01 --make-grm --out $WORK/output/test
Problem 1
How many individuals are there in the test dataset?
Perform REML
GCTA employs a restricted maximum likelihood (REML) method to estimate the proportion of phenotypic data explained by SNP data (Yang et al., 2010).
Run a basic REML analysis on the “test” GRM you created with the following command:
gcta --grm $WORK/output/test --pheno $DATA/test.phen --reml --out $WORK/output/test
Results can be found in the test.hsq file. Open this file in your preferred text editor. Recall from lecture that narrow-sense heritability is additive genetic variance over phenotypic variance.
Problem 2
Is this phenotypic trait heritable (is it statistically significant)?
Problem 3
What is the heritability estimate? What is the standard error of this estimate?
Calculate principal components with GCTA
As we mentioned in class, we can calculate the principal components of the genotype matrix Using the genetic relatedness matrix. This is using the eigenvector decomposition rather than the singular value decomposition, but they yield the same PCs.
gcta --grm $WORK/output/test --pca 20 --out $WORK/output/test
Manipulating GRMs for a Robust Assessment of Heritability
Effect of SNP Density on Heritability Estimation
The density at which SNPs are sampled throughout the genome can have an important effect on heritability estimates of complex traits. In order to illustrate this point, you have been provided two large SNP datasets (250k and 500k markers) of 1,000 individuals and a file with two phenotypes (named “two_phenotypes.txt”).
Problem 4
For phenotype 1, create a GRM based on the 250,000 whole-genome SNPs (250k.bed, 250k.fam, 250k.bim) and estimate the heritability (Hints: Making a binary GRM may save you major time, and don’t forget to specify you want column 1 in the phenotype file to be used for REML analysis. See --mpheno
). Provide the command.
Problem 5
Now, create a GRM based on the 500,000 whole-genome SNPs (500k.bed, 500k.fam, 500k.bim) and estimate the heritability. Provide the command.
Problem 6
How did the estimate and/or SE change and why? What might this say about SNP density and estimating heritability more generally?
Effect of Subsetting a GRM to Causal Variants on Heritability Estimation
Some time-traveling scientists who have conducted GWAS of all humans on earth from all times deliver you a dataset containing only causal variants (causal.bed, causal.fam, causal.bim). Let’s explore how using this set of causal variants may change our estimation of heritability.
Problem 7
Create a GRM based only on the causal variants and estimate the heritability. Provide the command.
Problem 8
How did the estimate and/or SE change compared to the REML analysis with the 500K GRM? Why might this be?
Estimating Genetic Correlation of Multiple Phenotypes
Problem 9
Using the GRM generated from 500k SNPs, estimate the genetic correlation between phenotype 1 and 2 using a bivariate, rather than a univariate, REML run. Provide the command. (hint: --reml-bivar
and documentation on bivariate REML analysis)
Problem 10
Which line in the output file *.hsq
is genetic correlation? What does this “genetic correlation” mean?