Created by Max Winston, modified by Charles Washington III, Ankeeta Shah, Yanyu Liang, and Erik McIntire

This lab section is dedicated to learning how to download HapMap data and manipulate the appropriate files in a command-line program named PLINK. Additionally, in R we will import data files and generate our own data natively. By the end of this lab, you should be able to:

  • Organize files and folders in Unix
  • Know about PED and BED files
  • Use PLINK to generate statistics on HapMap data
  • Import data files into R
  • Simulate genotypes using R

Logging in and moving files

For the remainder of the document, please notice that steps requiring actions will generally be in bold. For example, Open Terminal if you are on a Mac.

Each of you should have been assigned a username on the server. You will use a Secure Shell client to log in to your home directory on the server. You can do this by:

ssh <username>@midway2.rcc.uchicago.edu

Note that you will be prompted to enter your password. One somewhat counter-intuitive facet of password entry in Unix is that there is no indiciation on-screen of characters being typed in. This is an intentional security feature, in order to prevent onlookers from inferring the length of your password. Type in your password when prompted. The midway2 server has recently implemented two-factor authentication, so you will next be prompted to log in with DUO, enter a passcode, or receive a call. After you do one of these three things, you should be on the server.

Using the commands listed above, do the following in your home directory:

  • 1) Make your own directory.
  • 2) Make a second directory.
  • 3) Look where your directories live.
  • 4) Move one of your directories into the other.
  • 5) Change directory to your new directory.
  • 6) Look inside your new directory.
  • 7) Move to the original directory.
  • 8) Read more about the rm command.
  • 9) Delete both directories you’ve made.
  • 10) Create a directory called for this lab session and move into it.
mkdir your_directory/         ##Step 1
mkdir dir_2/                  ##Step 2
pwd                           ##Step 3
mv your_directory/ dir_2/     ##Step 4
cd dir_2/                     ##Step 5
ls                            ##Step 6
cd ../                        ##Step 7
man rm                        ##Step 8
rm -r dir_2/                  ##Step 9
mkdir Lab_1/                  ##Step 10
cd Lab_1/                     ##Step 11

Working with data in R

R is used widely for scientific computing, however, the syntax is notably different than Unix. Specifically, two of the most visible differences are that variables are usually assigned with the assignment operator (<-) operator, and functions are applied using parentheses. While R can be accessed and R scripts can be run within the Unix environment, it is often preferable to operate R within the graphical user interface (GUI) known as RStudio. Unlike many poorly made GUIs that offer little advantage to a Unix environment, RStudio offers many organizational and functional benefits. Open RStudio.

Reading in data files

Although data can be simulated, generated, and analyzed all within R, often we may want to visualize or analyze data from some other source. We can learn R syntax by learning how to import new data. Use the following commands to set the directory and import the downloaded text file lab1_r_example_data.txt as the data object “new_data”. Don’t forget to set your directory to where you saved it:

# To change the working directory, you modify the following line
setwd("data")
# read table
new_data <- read.table("lab1_r_example_data.txt", header=TRUE)

Exploring objects

We can use functions such as “dim” (prints object dimensions) and “class” (prints object class) to investigate attributes of an object. An object’s class is important to know since different functions apply differently to different classes, which may cause errors. Let’s look at a few attributes of new_data, using the following commands:

dim(new_data)

Problem 5
What are the dimensions of new_data?

class(new_data)

Problem 6
What class is new_data?

While dim and class can be used to explore objects, you may want to explore these functions themselves. This is done by this simple syntax:

# One question mark before a phrase opens its specific page of R documentation
?dim

# Two question marks before a phrase searches R documentation
??class

Simulating genotype data

For simple simulation of genotype data, we’ll assume the SNP is biallelic and therefore use a binomial distribution. By passing our desired arguments for the parameters of the rbinom function, we can randomly generate genotypes. We set the parameter \(n\) equal to 1000 for the number of individuals in our simulation. Assuming the individuals are diploid, the number of trials, \(size\), will be 2 per individual. Finally, the probability of “success” for each trial, the minor allele frequency, is \(p\). Run the following simulation of genotype data:

# Simulate random SNP genotypes for 1000 diploid individuals, given a minor allele frequency of 0.2
num_individuals <- 1000
ploidy_level <- 2
maf <- 0.2
geno <- rbinom(n=num_individuals, size=ploidy_level, p=maf)

We now have the object “geno” containing genotypes for 1000 individuals. Given the minor allele is designated as 1 and major allele as 0, each genotype is either homozygous dominant (0+0=0), heterozygous (0+1=1), or homozygous recessive (1+1=2). Let’s rename these as AA, Aa, and aa to make this more intuitive:

# The gsub function acts as a find and replace, type ?gsub in the console for more info
geno <- gsub("0", "AA", geno)
geno <- gsub("1", "Aa", geno)
geno <- gsub("2", "aa", geno)

Check the number of times each genotype occurs by using the the “table” function:

table(geno)

Problem 7
We set our minor allele frequency as 0.2, but the proportion of genotypes containing the minor allele is roughly double that. Why is this the case?

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The source code is licensed under MIT.

Suggest changes

If you find any mistakes (including typos) or want to suggest changes, please feel free to edit the source file of this page on Github and create a pull request.

Citation

For attribution, please cite this work as

Max Winston, others, Erik McIntire (2021). Lab 1. BIOS 25328 Cancer Genomics Class Notes. /post/2021/01/12/lab-1/

BibTeX citation

@misc{
  title = "Lab 1",
  author = "Max Winston, others, Erik McIntire",
  year = "2021",
  journal = "BIOS 25328 Cancer Genomics Class Notes",
  note = "/post/2021/01/12/lab-1/"
}