Advanced Data Analysis - Introduction to NGS data analysis

Department of Animal and Plant Sciences, University of Sheffield

Nicola Nadeau, Alison Wright, Helen Hipperson

The aim of this course is to give an introduction to handling NGS sequence data on the UoS HPC cluster (ShARC) and to some of the analyses you might want to do including investigating gene expression and nucleotide variation (SNPs).

Schedule 2019/2020

Content Date Session Venue Lead TAs
Introduction to the HPC and NGS data Mon 02/03/2020 Morning (9-12) ADB - A04 (Perak) Nicola Nadeau Thea Rogers, Jake Pepper, Naomi Cox
Sequence data formats and assessing sequence quality Mon 02/03/2020 Morning (9-12) ADB - A04 (Perak) Nicola Nadeau Thea Rogers, Jake Pepper, Naomi Cox
Aligning Illumina RNA-seq data Mon 02/03/2020 Afternoon (2-5) ADB - A04 (Perak) Alison Wright Thea Rogers, Jake Pepper, Naomi Cox
Differential gene expression analyses Wed 04/03/2020 Afternoon (2-5pm) ADB - A04 (Perak) Alison Wright Thea Rogers, Jake Pepper, Naomi Cox
SNP and genotype calling Fri 06/03/2020 Afternoon (2-5pm) ADB - A04 (Perak) Helen Hipperson Thea Rogers, Jake Pepper, Naomi Cox

General notes

Here are some websites that it is useful to have on hand (you might want to bookmark these so you can easily go back to them)

Linux and Shell cheatsheet (it’s not cheating!): http://rcg.group.shef.ac.uk/courses/linux/shell-cheatsheet.html

CiCS page on using the ShARC cluster: https://www.sheffield.ac.uk/cics/research/hpc/sharc

CiCS page on interactive useage of the cluster: https://www.sheffield.ac.uk/cics/research/hpc/using/interactive

CiCS page on submitting jobs to the cluster (more on his later): https://www.sheffield.ac.uk/cics/research/hpc/sharc/batch

Documentation on file storage on ShARC http://docs.iceberg.shef.ac.uk/en/latest/hpc/filestore.html

The genomics software repository: http://soria-carrasco.staff.shef.ac.uk/softrepo/

Working off-campus

To access sharc from off-campus you may need to connect to the University’s vpn.

If you haven’t used the vpn before you will first need to get a remote access password (not the same as your usual account password), which you can do from here: https://www.sheffield.ac.uk/it-services/password/

You should then follow the instructions to set up the vpn: https://www.sheffield.ac.uk/it-services/vpn

Logging in and getting started

We will be working on windows machines, which means that you need to use a program (ssh client) to access the cluster. We will be using MobXterm. Start by opening the program, if you have used it before to connect to sharc you may find “sharc.shef.ac.uk” under “User sessions”, in which case you can just double click on this to launch an ssh session on sharc. If not, click on “Session”>”SSH” and enter

sharc.sheffield.ac.uk

in the “Remote host” box and specify your username (port should always be 22).

Request an interactive session:

qrsh

You should always start by doing this. No work should ever be done on the head node! If you are on a head node you will see someting like this in your command line prompt:

[bo1nn@sharc-login1 ~]$

This node is just a gateway to the worker nodes. If you are on a worker node you will see the name of the node, eg.

[bo1nn@sharc-node004 ~]$

Important note


This tutorial relies on having access to a number of programs. The easiest way is to have your account configured to use the Genomics Software Repository. If that is the case you should see the following message when you get an interactive session with qrsh:

  Your account is set up to use the Genomics Software Repository
     More info: http://soria-carrasco.staff.shef.ac.uk/softrepo

If you don’t get that message, follow the instructions here to set up your account.

In addition, if you want to configure the nano text editor to have syntax highlighting and line numbering, you can configure it this way:

cat /usr/local/extras/Genomics/workshops/January2018/.nanorc >> /home/$USER/.nanorc

Note on transferring output files to your local computer for visualization


You probably will want to transfer files to your own computer for visualization (especially the images). If you are working on a windows machine and using MobaXterm then the easiest option is to use the graphical sftp panel on the left, using the icons or dragging and dropping from your computer.

Another possibility is to email the files, for example:

echo "Text body" | mail -s "Subject: gemma - hyperparameter plot" -a /data/myuser/gwas_gemma/output/hyperparameters.pdf your@email

In Linux and Mac, you can use rsync on the terminal. For example, to transfer one of the pdf files or all the results that are generated in this practical, the command would be:

# transfer pdf file
rsync myuser@iceberg.sheffield.ac.uk:/data/myuser/gwas_gemma/output/hyperparameters.pdf ./
# transfer all results
rsync -av myuser@iceberg.sheffield.ac.uk:/data/myuser/gwas_gemma/output ./

Other graphical alternatives are WinSCP, Filezilla or Cyberduck. You can find more detailed information here.