HGEN47100 Lab 1

2021-04-02

Categories: lab

Created by Max Winston; modified by Charles Washington III, Ankeeta Shah, and Yanyu Liang

This class is designed to introduce a variety of concepts, methods, and tools utilized in statistical genetics. The majority of tools used to employ these methods and concepts require basic knowledge of computing languages, specifically Unix and R. In today’s lab, we will cover the basic syntax of these languages and some basic commands necessary for operating statistical genetics programs. By the end of this lab, you should understand how to perform the following:

Navigate in R and Unix
Download and install software packages in Unix
Write looped bash commands to run in Unix
Manipulate large text files in Unix
Import data files into R
Manipulate and visualize files in R

Commonly used commands in Unix

List of Basic Commands

Command	Description
pwd	Print working directory (tells you where you are)
cd	Change directory
mkdir	Make directory
ls	List information about files
mv	Moves or renames files or directories
rm	Deletes files (can remove directories with -rf)
cp	Copy a file to a new location (can copy directories with -r)
scp	Secure copy a file (for transferring files between a remote server and local system)
cat	Concatenate and print the content of files to screen
sed	Stream editor (for searching, finding, and replacing text)
“pipe”	Uses the output of one command as the input for another
>	Writes output into file (must enter file name after, will overwrite)
>>	Writes output into file (must enter file name, will append, not overwrite)
curl	Transfer a URL (downloads something from the Internet)
wget	Transfer a URL (downloads something from the Internet)
tar	Creates tape archives and adds or extracts files
ssh	Remote log-in command (Secure Shell client)
emacs	Opens emacs text editor
vim	Opens vim text editor
nano	Opens nano text editor
man	Access manual for any command (exit = q)
top	Observe computer’s tasks
sort	Sorts a file (extensive options to direct this command)
head	Look at the first ten lines of file (useful for large files)
tail	Look at the last ten lines of file (useful for large files)
./	Existing directory (not a command)
../	Directory contaiing existing directory (not a command)
grep	Prints lines matching a pattern (helpful for manipulating large files)

SLURM specifics	Description
sbatch	Submits a job to the server
sinterative	Submits an interactive job to the server (Our labs will use it a lot)
squeue	Status of a job you’ve submitted to the server

Logging in and moving files

For the remainder of the document, please notice that steps requiring actions will generally be in bold. For example, Open Terminal if you are on a Mac.

Each of you should have been assigned a username on the server. You will use a Secure Shell client to log in to your home directory on the server. You can do this by:

ssh <username>@midway2.rcc.uchicago.edu

Note that you will be prompted to enter your password. One somewhat counter-intuitive facet of password entry in Unix is that there is no indiciation on-screen of characters being typed in. This is an intentional security feature, in order to prevent onlookers from inferring the length of your password. Type in your password when prompted. The midway2 server has recently implemented two-factor authentication, so you will next be prompted to log in with DUO, enter a passcode, or receive a call. After you do one of these three things, you should be on the server.

Using the commands listed above, do the following in your home directory:

1) Make your own directory.
2) Make a second directory.
3) Look where your directories live.
4) Move one of your directories into the other.
5) Change directory to your new directory.
6) Look inside your new directory.
7) Move to the original directory.
8) Read more about the rm command.
9) Delete both directories you’ve made.
10) Create a directory called for this lab session and move into it.

mkdir your_directory/         ##Step 1
mkdir dir_2/                  ##Step 2
pwd                           ##Step 3
mv your_directory/ dir_2/     ##Step 4
cd dir_2/                      ##Step 5
ls                            ##Step 6
cd ../                         ##Step 7
man rm                        ##Step 8
rm -r dir_2/                  ##Step 9
mkdir Lab_1/                  ##Step 10
cd Lab_1/                     ##Step 10

Tab-Completion

Tab completion is helpful for both speed and accuracy. The following exercise will demonstrate this:

1. Make a directory with a really long name.
1. Start typing the name of the directory.
1. Hit tab. The directory should complete itself.
1. Move back into the previous directory.
1. Start typing the name of the directory, intentionally introducing an error.
1. Hit tab. Notice nothing happens, this indicates you’ve made an error.
1. Remove this directory using tab completion.

mkdir only_an_idiot_would_ever_create_a_directory_like_this/    ##Step 1
cd only_a                                                       ##Step 2
[tab, then enter]                                               ##Step 3
cd ..                                                          ##Step 4
cd olny_a                                                       ##Step 5
[tab]                                                           ##Step 6
rm -r only_a                                                    ##Step 7
[tab, then enter]                                               ##Step 8

Using the manual command to explore other commands

The Unix language has an extensive number of commands, and most of these commands have numerous “options” which can modify these commands. Here we will use the man command to understand some of the helpful options for the previously introduced ls command.

Generally, the ls command will list everything in the directory you are in, but most of the time you will probably want a lot more information then just the name of the files and directories. Furthermore, you may want ls to order the contents in a particular way, especially if you have many files in a directory. This can be done by using the options or flags, which are letters added after a dash which will be typed after the command itself.

1. Use the manual command to explore the options for listing.

man ls            ##Step 1

Problem 1: What command would give you a detailed but human-readable list of the files in a directory in reverse-order of when they have last been modified?

Downloading and unpacking files and programs

Frequently, you may need files or programs that are not already installed on your computer or server. Often these files are compressed, and often these files are in ‘tarballs’, which have the extension .tgz. IMPUTE2 is a program we’ll use later in the quarter, so let’s download and unpack it:

1. Download the IMPUTE2 tarball from the internet. Here we download the tarball and write it into the impute2.tgz file
1. Unpack the tarball.

curl https://mathgen.stats.ox.ac.uk/impute/impute_v2.3.2_x86_64_static.tgz > impute2.tgz              ##Step 1
tar -zxvf impute2.tgz                                                                                 ##Step 2

Once a program is unpacked, generally there is a README file that you can open with a text editor command (i.e. vim), however, every program is different, and reading the manual is the only sure way to install the program correctly.

Manipulating text files

Often you may have a large text file that you want to manipulate in some way. Using a variety of commands in combination is usually the most effective way to do this, and the “pipe” command allows for a more condense syntax where you use the output of one command as the input for the next.

As a quick example, we will use a file from the IMPUTE2 directory. Specifically, the mapping file provides a fine-scale recombination map with three columns: (1) physical position (in base pairs), (2) recombination rate between current position and next position in map (in cM/Mb), (3) and genetic map position (in cM). Since you may be interested in the physical positions with the greatest recombination rates, we can easily get a list of the top ten using the following:

1. Move into the Example directory.
1. Manipulate the mapping file to create top ten list. We will name this “top_ten.map”.

cd impute_v2.3.2_x86_64_static/Example/                       ##Step 1
cat example.chr22.map | sort -k2nr | head > top_ten.map       ##Step 2

Problem 2: Step 2 is composed of multiple commands. Use the descriptions of the commands (Section 1.1) and the man command to describe Step 2 command-by-command.

Problem 3: How would you modify this command if you only wanted the top three?

Writing bash scripts

Having the capability to write bash scripts which can be run in the Unix environment is critical, especially for running replicates or collating results from a command-line program. We will write a bash script using the vim text editor, and then run it. Alternatively, feel free to use your own text editor (emacs, textwrangler, etc.) to write the script below. Please start with the following:

1. Create a new scipt file. We will name ours test.sh
1. Add the bash tag at the top of the file.
1. Write a for-loop to create ten copies of the mapping file with an additional line in each.
1. Save and exit.
1. Run script.

vim test.sh                                                       ##Step 1  
[now in vim, type "i" to allow you to insert text]  
#!/bin/bash                                                       ##Step 2
for i in {1..10}; do                                              ##Step 3
    echo "golden nugget" {$i} > top_ten_{$i}.map                  ##Step 3
    cat top_ten.map >> top_ten_{$i}.map                           ##Step 3
done                                                              ##Step 3
[exit vim by first pressing "Esc" and then ":wq" to save and quit (use :q to quit or :q! to quit without saving any changes you made)]    ##Step 4
bash test.sh                                                      ##Step 5

Problem 4: Find the golden nugget! Write a new script named extract.sh where the first line of each of the top_ten replicate files is written into a single file named extract.txt. Note: you may want to refer to Section 1.1 to find a command that can select single lines.

Storing files on the cluster

The workspace you’ve been using thus far for this lab is your ‘home’ directory. This space is quickly accessed and serves as a good workspace to enter your commands and edit files. However, there is limited storage space here. Long term storage space can be found /project2/hgen47100/. Go to this path and create a directory with your name to store any files that you want to keep for the long term.

cd /project2/hgen47100/
mkdir <your_folder_name>_Lab/
chmod 700 <your_folder_name>_Lab/
[to return to your home directory, enter cd /home/<username>]

Basic commands in R

R is used widely for scientific computing, however, the syntax is notably different than Unix. Specifically, two of the most visible differences are that variables are usually assigned with the assignment operator (<-) operator, and functions are applied using parentheses. While R can be accessed and R scripts can be run within the Unix environment, it is often preferable to operate R within the graphical user interface (GUI) known as RStudio. Unlike many poorly made GUIs that offer little advantage to a Unix environment, RStudio offers many organizational and functional benefits. Open RStudio.

Reading in data files

Although data can be simulated, generated, and analyzed all within R, often we may want to visualize or analyze data from some other source. We can learn R syntax by learning how to import new data. Use the following commands to set the directory and import the downloaded text file lab1_r_example_data.txt as the data object “new_data”. Don’t forget to set your directory to where you saved it:

# To change the working directory, you may want to modify and comment/uncomment the following line
# setwd("data")
# read table
new_data <- read.table("data/lab1_r_example_data.txt", header=TRUE)

Exploring objects, functions, and packages

dim(new_data)

## [1] 16  2

class(new_data)

## [1] "data.frame"

Here we can see that the dimensions of new_data are 16 rows and 2 columns, and that it is in data frame format. This is important to know since different functions apply differently to different classes, which may cause errors. While dim and class can be used to explore objects, you may want to explore these functions themselves. This is done by this simple syntax:

?dim
??class

Problem 5: What is the difference between using one versus two question marks?

Let’s stick with the output from a simple query (one question mark). Notice that both dim and class are relatively simple functions with few arguments. You probably won’t have to read the documentation much for either function, but as you develop your skills using statistical genetics, you will spend a lot of time reading documentation for more complicated functions with several arguments (i.e. input) and many values (i.e. output).

List of basic classes and functions

Class	Description
numeric	Object is numeric (interger or double)
integer	Object is integer (not double)
double	Object is double
character	Object is not numeric, composed of characters
vector	One-dimensional, can contain numeric or characters
matrix	Two-dimensional, can contain numeric or characters
array	Multi-dimensional, can contain numeric or characters
data frame	A list of vectors of equal lenght, each vector contains same class
list	A list of various generic vectors

Functions are how you execute your commands and write your scripts. It is critical to ensure the arguments are of the correct class (read documentation), as most user-errors originate from providing data in the inappropriate class or dimension.

Function	Description
c(1,2,3)	Concatenate numbers into a vector
c(1:10	Concatenate from number 1 to number 10 in a vector
c(“a”,“b”,“c”)	Concatenate characters into a vector
cbind(x,y,z)	Binds vectors x,y,z into matrix by columns
rbind(x,y,z)	Binds vectors x,y,z into matrix by rows
matrix(x,nrow,ncol)	Creates matrix from vector x with nrow rows and ncol columns
array(x,dim=y)	Creates array from vector x with dimensions of vector y
rep(x,y)	Creates vector repeats vector x interger y times
length(x)	Returns length of object x
sample(x,size=n)	Samples number of items (n) from elemetns of object x
subset(x,subset)	Subsets object x by conditions of subset
mean(x)	Returns arithmetic mena of numeric object x
sd(x)	Returns standard deviation of numeric object x
max(x)	Returns maximum value of numeric object x
min(x)	Returns minimum value of numeric object x
t(x)	Transposes matrix x
plot(x,y)	Plots elements of vector x agains elements of vector y
boxplot(x~y)	Generates boxplots for values of numeric vector x by grouping vector y
lm(x~y)	Generates a linear model for response vector x by predictor vector y
density(x)	Estimates density distributions of univariate observations
table(x)	Returns summary of frequencies of all values in R object

Simple Plotting

Sometimes the first thing you want to do is generate a really quick plot of your data before anything else. This is one of the advantages of R, in that visualizing data is very intuitive and easy because it is object-oriented. For example, if you want to plot the first column versus the second column of data from your object, type the following command:

plot(new_data$Bob,new_data$Steve)

While we do get to see the data quickly, it’s kind of messy. To give the same plot a title and label the axes, you can do this easily by adding the following arguments to the command:

plot(new_data$Bob,new_data$Steve,main="Cheeseburgers Eaten",xlab="Bob",ylab="Steve")

Problem 6: At first glance the current plot may be a bit misleading because of the differences in scale between the axes. Use the query function learned in Section 2.2 to fix the scale on the axes so that both axis range from 0 to 100. Note that the necessary arguments may take vectors.

Subsetting and targeting elements of R objects using indices

Although the function subset is useful to subset R objects by particular values, objects of classes vector,matrix,array, and data frame are designed for quick manipulation using indices. To illustrate this clearly, type the following commands to create a 3x4 matrix and look at it:

?matrix
mat <- matrix(1:12,nrow = 3,ncol = 4,byrow=TRUE)
mat

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

Notice that above each column R displays the index for that column, and likewise for each row. Follow the steps below to create a different matrix based on the matrix “mat”:

1. Make column 1 a vector named “A”
1. Make column 3 a vector named “C”
1. Add columns 2 and 4 together to make a vector “BD”
1. Make a matrix named “new_mat” by binding vectors “A”,“B”,“BD” by column
1. Look at matrix “new_mat”
1. Multiply the entire matrix “new_mat” by the element in the second row and second column, name “final_mat”
1. Look at matrix “final_mat”

A <- mat[,1]                       # Step 1
C <- mat[,3]                       # Step 2
BD <- mat[,2] + mat[,4]            # Step 3
new_mat <- cbind(A,C,BD)           # Step 4
new_mat                            # Step 5

##      A  C BD
## [1,] 1  3  6
## [2,] 5  7 14
## [3,] 9 11 22

final_mat <- new_mat[2,2]*new_mat
final_mat

##       A  C  BD
## [1,]  7 21  42
## [2,] 35 49  98
## [3,] 63 77 154

One important note for plotting is that the syntax is different between data frames and matrices. Where data frames use the $ to extract vectors to plot against each other, matrices use the indices used above to create vectors. For example, to plot the first and second columns of our new matrix “final_mat” against each other, type the following command:

plot(final_mat[,1],final_mat[,2])

Problem 7: You want a quick estimate on how often you will win the dice game craps on the first roll. You do this by constructing 500 random rolls (2 dice/roll) and looking at the results. Use the sample function to sample the numbers 1 to 6 with replacement 1000 times, naming this vector “dice”. Take this vector and make it into a matrix named “pairs” with 2 columns and 500 rows, simulating the 500 rolls. Make a new vector summing the first and second columns named “rolls”. If 7 and 11 are the two rolls for victory on the first roll, estimate the probability of winning from your 500-roll sample. Hint: The function table might be useful in this problem.

For-loops

Often in statistical genetics you may need to repeat something several times. For example, collating results over several replicates may require a for-loop that stores the results in an array.

To illustrate the point, let’s consider a simple example. Suppose we’d like to learn how the sample mean ($\bar{X}$) of 50 iid standard normal random variables, $X_1, \cdots, X_{50} \sim N(0, 1)$, behaves by simulation. To obtain one sample mean, we need to first draw 50 samples, $X_1, \cdots, X_{50}$, from standard normal, and then calculate the sample mean as $\bar{X} = \frac{1}{n} \sum_i X_i$. So, to obtain multiple copies of the sample mean, we need to repeat the above procedure multiple times. Let’s look at how we could implement it in R using for-loop (here we intent to obtain 1000 copies of sample mean).

1. Create an empty vector of length 1000 named “sample_means”. We will fill this with sample means obtained in each replication.
1. For each replication
- 2.1) Draw 50 samples
- 2.2) And calculate sample mean
1. Look at “sample_means” using summary function

nrepeat = 1000
nsample = 50
sample_means <- rep(NA, nrepeat)                # Step 1
for (i in 1:nrepeat){
  samples_i = rnorm(nsample, mean = 0, sd = 1)  # Step 2.1
  sample_means[i] = mean(samples_i)             # Step 2.2
}
summary(sample_means)                           # Step 3

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.3990160 -0.1014323  0.0005884 -0.0008194  0.0935761  0.5137038

Problem 8: How would you write a command to take the standard deviation of all elements in “sample_means” vector? Is it close to what you would expect?

Installing and loading packages

Many of the functions you may need in R may not be pre-installed. Luckily, it is very easy to install and load them, especially in RStudio. One package that is suggested for more aesthetic plotting is “ggplot2”. It is easy install and load this package, first install the package by navigating to the packages tab in the bottom right window of RStudio and click the “Install Packages” button. Then, follow the code below to load the package. Alternately, you can use the install.packages function with the name of the desired package:

library(ggplot2)

You can do this with any package and can also look through all of your installed packages in RStudio to see what is loaded and not loaded already. Occasionally, you may need to install “dependencies”, which are required packages to run a specific package. Checking the box next to a package in RStudio is equivalent to using the library command.

Introduction to ggplot2

Although the standard plotting introduced in Section 2.4 will accomplish all of your plotting needs, many people prefer the aesthetics and functionality of plotting in ggplot2. We will do a very basic introduction here, just to demonstrate the syntax.

Let’s use dataset called iris available in R.

head(iris, 2)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa

In short, it is giving you four measurements in three types of iris. To learn more about the dataset, you can type ?iris in R console.

Now that we have our dataset “iris”, we can use this for our basic introduction to ggplot2. First, let’s try to just plot the first two columns (x and y coordinates) using qplot, a function made for very basic plotting in ggplot2. Follow the following steps to generate plots:

plot1 <- ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Petal.Length))
plot1

plot2 <- ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Petal.Length, color = Species))
plot2

plot3 <- ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Petal.Length, color = Species), alpha = 0.5)
plot3

Notice a couple things here.

To color the points by the “third”Species" column, that is simply added to the function via the “color” argument.
When there are many dots, sometimes it is helpful to make them semi-transparent so that density is easier to see. This is done via the “alpha” argument, and ranges from 0 (transparent) to 1 (opaque).
Notice the coloring could be on a spectrum if the variable is numeric. For example, follow the commands below to color by Sepal.Width:

plot4 <- ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Petal.Length, color = Sepal.Width))
plot4

Problem 9: Take a look at the help page of geom_boxplot and make a boxplot where “Species” is on x-axis and “Petal.Length” is on y-axis