Lab GCTA

Max Winston, et al

2021-04-16

Categories: lab

Created by Max Winston & Brandon Pierce; modifications by Charles Washington III, Ankeeta Shah, and Yanyu Liang.

Genome-wide Complex Trait Analysis (GCTA, see documentation here) was originally designed to estimate the heritability of complex traits using genome-wide SNPs, but has now been extended for numerous other functionalities to better understand the genetic architecture of these traits (Yang et al., 2011). Generally, GCTA estimates heritability using the proportion of phenotypic variance explained by the a genetic relationship matrix (GRM), which is calculated using the genome-wide SNP data. In today’s lab we will become familiar with the GCTA software and some of its capabilities, as well as explore some of the conceptual issues dealt with in class with a large SNP dataset. By the end of the lab you should be able to:

Construct GRMs given BED, FAM, and BIM files
Run standard analysis in GCTA on large datasets (univariate REML)
Run bivariate REML analysis in GCTA
Relate how increasing density of markers affects heritability estimate
Relate how subsetting markers may affect heritability estimate

You may want to check the tutorial here

Basics of GCTA

Standard GCTA options

option	Description
make-grm	Generates GRM from SNP data (BED, FAM, BIM files).
make-grm-bin	Generates binary GRM from SNP data (BED, FAM, BIM files).
bfile	Specifies stem name from BED file for analysis.
out	Specifies output stem name.
pheno	Specifies file containing phenotypic information.
mpheno	Gives the column number within phenotype file to use for analysis.
reml	Runs univariate restricted maximum likelihood (REML) analysis.
grm	Specifies GRM file for analysis.
grm-bin	Specifies binary GRM file for analysis.
reml-bivar	Runs bivariate restricted maximum likelihood (REML) analysis.
reml-maxit	Sets the maximum number of iterations to run (Default: 100).

There is a lot of overlap between PLINK options and GCTA options. For example, the maf option does the same thing in GCTA as it does in PLINK (Refer to Lab 2).

Some setup

Let’s run this on Midway. It is recommended not to run large processes on the head node (where you land when you log in). We start an interactive session:

sinteractive --account=bios25328

We start by defining where the data will be and loading the gcta command.


DATA=/project2/bios25328/Data/Lab_GCTA_Winston

module load gcta

Let’s create a working directory for this lab

cd /project2/bios25328/users/haky/ ## you should replace haky with your user directory name
mkdir labgcta
cd labgcta
mkdir output  # so that the output files go into a separate place
WORK=/project2/bios25328/users/haky/labgcta

Running GCTA is nice because it prints lots of useful information to the screen as it runs, and when it concludes. However, due to the computation required for creating a GRM, and the large size of the some of the input and output files, running it can take a bit of time. Depending on what you’re doing for this lab, you can expect some processes to take up to 3 minutes, and of course, with bigger files, it would take even more time. The syntax used to run GCTA is similar to other pipelines we have used on the command line: GCTA is called and modified with option flags (see Section 1.1). For example, one of the first things you will usually need to do is take your SNP data (BED, FAM, BIM formats-just like in PLINK!) and make a GRM.

Compute GRM

Try this with the following command and the “test” files:

gcta --bfile $DATA/test --autosome --maf 0.01 --make-grm --out $WORK/output/test

Problem 1

How many individuals are there in the test dataset?

Perform REML

GCTA employs a restricted maximum likelihood (REML) method to estimate the proportion of phenotypic data explained by SNP data (Yang et al., 2010).

Run a basic REML analysis on the “test” GRM you created with the following command:

gcta --grm $WORK/output/test --pheno $DATA/test.phen --reml --out $WORK/output/test

Results can be found in the test.hsq file. Open this file in your preferred text editor. Recall from lecture that narrow-sense heritability is additive genetic variance over phenotypic variance.

Problem 2

Is this phenotypic trait heritable (is it statistically significant)?

Problem 3

What is the heritability estimate? What is the standard error of this estimate?

Calculate principal components with GCTA

As we mentioned in class, we can calculate the principal components of the genotype matrix Using the genetic relatedness matrix. This is using the eigenvector decomposition rather than the singular value decomposition, but they yield the same PCs.

gcta --grm $WORK/output/test --pca 20 --out $WORK/output/test

Manipulating GRMs for a Robust Assessment of Heritability

Effect of SNP Density on Heritability Estimation

The density at which SNPs are sampled throughout the genome can have an important effect on heritability estimates of complex traits. In order to illustrate this point, you have been provided two large SNP datasets (250k and 500k markers) of 1,000 individuals and a file with two phenotypes (named “two_phenotypes.txt”).

Problem 4

For phenotype 1, create a GRM based on the 250,000 whole-genome SNPs (250k.bed, 250k.fam, 250k.bim) and estimate the heritability (Hints: Making a binary GRM may save you major time, and don’t forget to specify you want column 1 in the phenotype file to be used for REML analysis. See --mpheno). Provide the command.

Problem 5

Now, create a GRM based on the 500,000 whole-genome SNPs (500k.bed, 500k.fam, 500k.bim) and estimate the heritability. Provide the command.

Problem 6

How did the estimate and/or SE change and why? What might this say about SNP density and estimating heritability more generally?

Effect of Subsetting a GRM to Causal Variants on Heritability Estimation

Some time-traveling scientists who have conducted GWAS of all humans on earth from all times deliver you a dataset containing only causal variants (causal.bed, causal.fam, causal.bim). Let’s explore how using this set of causal variants may change our estimation of heritability.

Problem 7

Create a GRM based only on the causal variants and estimate the heritability. Provide the command.

Problem 8

How did the estimate and/or SE change compared to the REML analysis with the 500K GRM? Why might this be?

Estimating Genetic Correlation of Multiple Phenotypes

Problem 9

Using the GRM generated from 500k SNPs, estimate the genetic correlation between phenotype 1 and 2 using a bivariate, rather than a univariate, REML run. Provide the command. (hint: --reml-bivar and documentation on bivariate REML analysis)

Problem 10

Which line in the output file *.hsq is genetic correlation? What does this “genetic correlation” mean?