Plink Tutorial

Laura Vairus, et al

2021-04-09

Categories: tutorial

Basic Commands

here’s a list of some of the basic commands you will learn about
We will go more in depth for each one below

Command	Function
–file	loads a file in ascii format
–bfile	loads a file in binary format
–make-bed	converts an ascii file to binary
–out	specifies what you want to name your output
–freq	generates minor allele frequencies

For a more detailed tutorial for GWAS analysis check out “A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis” by Mareen et al here.

Introduction

We will learn to run a GWAS using plink.
This tutorial follows a plink tutorial you can find here

First, download plink into a new file /Users//demo/

DEMO="/Users/<yourusername>/demo"

Download the example data we will be using.
Note that this data is not real and is only intended to be used as an example for using plink.
Whenever running plink commands, you first have to indicate you’re using plink by typing the full path to it.
To avoid writing out the same path over and over again, we will define a variable that contains the path to plink.

plink="Users/<username>/demo/plink"

PED and MAP files

PED and MAP files are plain text files that contain a table of part of your data
Together, they make up all of the data of the study you’re doing
Look at the first few lines of your PED file by copy and pasting this command

cut -d ' ' -f  1-6  hapmap1.ped | head

Each row represents a different individual.
PED files start with 6 informative columns about each individual.

Family ID
Individual ID
Paternal ID
Maternal ID
Sex
- 1 = male, 2 = female, other = unknown
Phenotype
- Can be quantitative or an affection status
- The default values for affection status is: 1 = unaffected, 2 = affected, 0 = missing, -9 = missing

Every column after these is the data for the individual’s genotypes
- Each marker is biallelic, meaning there’s two genotypes for each column, paternal and maternal.
Now look at the beginning of your MAP file

head hapmap1.map

Each row represents a different SNP.
there are 4 informative columns.

Chromosome #
SNP identifier
Genetic distance (in centimorgans)
Base-pair position

every row here corresponds to a genotype column in the ped file to indicate the position of those genotypes
- the first row here corresponds to the first genotype column in the .ped (after the first ID-ing 6), the second row here to the next column there, and so on ## Binary Files It’s helpful to compress your data to speed up analysis and data processing
  One way to do this is to convert your files to binary format
Convert your .ped and .map files to binary

plink --file hapmap1 --make-bed --out hapmap1

These are the three plink commands we used:
- –file hapmap1: specifies that we want to use the hapmap1 files
- –make-bed: converts that file to binary
- –out hapmap1: names our output files hapmap1
- if you don’t specify a name for your output, it defaults to “plink” - If you look at your hapmap folder, you’ll see four new files - .bed: binary .ped - .bim: binary .map - .fam: copy of the first 6 column of .ped - .log: a log of what commands were run and what it printed

Minor Allele Frequencies

Find the minor allele frequencies of your data and look at the head

plink --file hapmap1 --freq --out maf
head maf.freq

There are 6 columns

Chromosome #
SNP ID
Minor Allele
Major Allele
Minor Allele Frequencies
Number of Allele Observations

Run a GWAS

When you “run a GWAS,” it just means you’re finding the association between your SNPs and the phenotypes.
In plink, you use the –assoc command to do this
Run a GWAS and look at the head

plink --file hapmap1 --assoc --out as
head as.assoc

There are 10 columns

Chromosome #
SNP ID
Base pair
Minor Allele
Frequency of the minor allele in affected individuals
Frequency of the minor allele in unaffected idividuals
Major Allele
Chi-squared statistic
- The difference between the data you observed and the data you expected under the null hypothesis
P-value
Odds ratio

Manhattan Plot

Now we will graph a Manhattan plot of our association data We will do this in R using the “qqman” package - [ ] First, download qqman

library(qqman)

Read your association table and set it to a variable

mydata <- read_table("/Users/<username>/...", header = TRUE)

Now if you try to plot mydata, R will give you an error because there are NA values in it.
To get rid of the NA values, use the command na.omit
Take out the NA values

mydata2 <- na.omit(mydata)

Now you’re ready to graph your manhattan plot!

manhattan(mydata2)

QQ Plot

the qq() function takes in a vector instead of a table.
When graphing our data in qq plots we use the p-values

Graph your p values

qqplot(mydata2$P)

We will read the hapmap1.ped file. typically genotype files are too large to load into memory but in this example it takes 28MB, which is fine with current computers. As a rule of thumb (in a laptop with 16Gb of ram, try not to load files that are more than 1GB)

hapmap1_ped = read_table("/Users/lvairus/Desktop/hapmap1/hapmap1.ped")

TODO add Mareen et al workflow here

“A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis” by Mareen et al here