Introduction to the FAS Research Computing Resources

John Brunelle

ComputeFest, January 14, 2014

About us

FAS Research Computing

Research Computing: http://rc.fas.harvard.edu/

High Performance Computing: the old way

Old Way: Individual groups maintain their own resources

High Performance Computing: the new way

New Way: RC and HUIT provide resources

Value to researchers

When dealing with clusters, fully loaded machines are a good thing!

Odyssey

Our premier resource is the Odyssey cluster

A node

The typical hardware unit is called a node

It has all the same stuff that's in a desktop or laptop:

But more powerful and/or more of them compared to a typical desktop

Nodes are individual hosts with names like rclogin03 or holy2a18208

A core

The basic computational unit in a cluster is a CPU core

Thus, most nodes run 64 batch job processes

Inside a node

Each node has multiple CPUs cores:

A chassis

Each chassis has 16 nodes:

Our original Odyssey cluster

Connect a bunch of chassis together and you have a cluster:

More aspects of the cluster

Plus there's all the important things that make Odyssey much more than just a collection of individual computers:

More hardware

Infiniband (RDMA) networking

Network storage

Accessing Odyssey

Using a Mac or Linux

Run ssh from the command line (on a Mac, use Applications | Utilities | Terminal) (both login.rc.fas.harvard.edu and odyssey.fas.harvard.edu work):

ssh myusername@login.rc.fas.harvard.edu

Answer yes at this prompt:

The authenticity of host 'login.rc.fas.harvard.edu (140.247.232.235)' can't be established.
RSA key fingerprint is da:bb:90:7b:6b:a8:73:2a:83:db:89:19:da:4a:66:16.
Are you sure you want to continue connecting (yes/no)? yes

Enter your Odyssey account password:

Password: 

And then the six digit number displayed by your JAuth / GoogleAuthenticator app:

Verification code:

Start a terminal -- Windows (PuTTY)

Use PuTTY (putty.exe), a free ssh client:

Start a terminal -- Windows (PuTTY) (2)

Accept any warnings:

Enter your Password, and then Verification code -- the six digits displayed in JAuth or Google Authenticator app, no spaces:

Start a terminal -- Windows (SecureCRT)

You can alternatively use SecureCRT, but you must make the following change:

On the FAS computers, this is under Start | Programs | Internet and Multimedia | SecureCRT

More about logging in

Ask us about how to connect to Odyssey for git, svn, etc. without having to login for every operation

Exercises

  1. Log into Odyssey
  2. Make a directory named workshop in your home directory, and cd into it

Break

Where are we?

The prompt tells you a few things:

[cfest350@rclogin03 ~]$

Filesystems and storage

The virtual filesystem:

The distinctions between individual filesystems matter:

memory is not storage

Odyssey filesystems

...

Odyssey filesystems (2)

...

You may hear of our legacy filesystems like /n/nss2b/, /n/scratch2/, etc.

Full details of storage options are at http://rc.fas.harvard.edu/faq/storage

File recovery

If you accidentally delete a file from your home directory, you can often recover it from the checkpoint directory

This is a directory named .snapshot that's not listed by ls -a!

[cfest350@rclogin03 ~]$ cd .snapshot
[cfest350@rclogin03 .snapshot]$ ls
rc_homes_daily_2014-01-07-_12-00   rc_homes_hourly_2014-01-14-_00-00
rc_homes_daily_2014-01-08-_12-00   rc_homes_hourly_2014-01-14-_01-00
rc_homes_daily_2014-01-09-_12-00   rc_homes_hourly_2014-01-14-_02-00
rc_homes_daily_2014-01-10-_12-00   rc_homes_hourly_2014-01-14-_03-00
rc_homes_daily_2014-01-11-_12-00   rc_homes_hourly_2014-01-14-_04-00
rc_homes_daily_2014-01-12-_12-00   rc_homes_hourly_2014-01-14-_05-00
rc_homes_daily_2014-01-13-_12-00   rc_homes_monthly_2013-11-01-_00-00
rc_homes_hourly_2014-01-13-_18-00  rc_homes_monthly_2013-12-01-_00-00
rc_homes_hourly_2014-01-13-_19-00  rc_homes_monthly_2014-01-01-_00-00
rc_homes_hourly_2014-01-13-_20-00  rc_homes_weekly_2013-12-22-_12-00
rc_homes_hourly_2014-01-13-_21-00  rc_homes_weekly_2013-12-29-_12-00
rc_homes_hourly_2014-01-13-_22-00  rc_homes_weekly_2014-01-05-_12-00
rc_homes_hourly_2014-01-13-_23-00  rc_homes_weekly_2014-01-12-_12-00
...

Transferring files to/from Odyssey

Using software: modules

Modules how-to

[cfest350@rclogin03 workshop]$ module load centos6/R-3.0.2 
Loading module centos6/R-3.0.2.
[cfest350@rclogin03 workshop]$ which R
/n/sw/centos6/R-3.0.2/bin/R

R is popular application used for statistical computing and graphics

It can run in interactive mode or non-interactive mode, has a command line interface, can create graphics, etc

We'll use it to demonstrate how to use the batch system -- the specific R-related details are not important here

Modules how-to (2)

fasrcsw and lmod

Our module set has grown explosively and looks rather chaotic

This winter and spring we're switching to a new software management system we call fasrcsw

As part of this, we're switching the modules implementation to lmod

Exercises

  1. Create a file on the local computer and upload it to your home directory on Odyssey
  2. Download a file from your home directory on Odyssey to the local computer
  3. Find all the modules for the different versions of Matlab
  4. See what's in your ~/.snapshot directory

Break

Don't abuse the access nodes

The slurm batch job system

We now use batch job system named slurm (as opposed to alternatives such as LSF, PBS, SGE, etc.)

Partitions

A unit of computational work is called a job

Jobs are submitted to a batch queue, a.k.a. partition in slurm

Partitions are designed to:

Jobs are managed from the command line; see man slurm

Slurm documentation

http://www.schedmd.com/slurmdocs/
(https://computing.llnl.gov/linux/slurm/, though usually higher in google page rank, is older)

Many commands, overlapping functionality, not all with coherent options:

Suggestion: build yourself a crib-sheet of what works for you

What partitions exist?

The command scontrol show partitions command shows just the partitions to which you're able to submit:

[cfest350@rclogin03 workshop]$ scontrol show partitions
PartitionName=general
   AllocNodes=ALL AllowGroups=rc_admin,cluster_users Default=YES
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 MaxCPUsPerNode=UNLIMITED
   ...

PartitionName=interact
   AllocNodes=ALL AllowGroups=cluster_users,rc_admin Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=1 MaxCPUsPerNode=UNLIMITED
   ...
 
...

What type of job do I have?

Note that running many serial jobs at the same time does not mean you're running parallel jobs

Starting an interactive job

Use srun to submit interactive jobs

[cfest350@rclogin03 workshop]$ srun -p interact --pty bash
srun: job 5038070 queued and waiting for resources
srun: job 5038070 has been allocated resources
[cfest350@holy2a18208 workshop]$

Where are we now?

We now have a shell on a compute node

Notice the new hostname in the prompt:

[cfest350@holy2a18208 workshop]$

Note that our environment customizations have transferred to the new shell:

[cfest350@holy2a18208 workshop]$ pwd
/n/home00/cfest350/workshop
[cfest350@holy2a18208 workshop]$ module list
Currently Loaded Modulefiles:
  1) hpc/intel-mkl-11.0.0.079         6) centos6/gsl-1.16_gcc-4.4.7
  2) hpc/jdk-1.6                      7) centos6/hdf5-1.8.11_gcc-4.4.7
  3) centos6/tcl-8.5.14               8) centos6/netcdf-4.3.0_gcc-4.4.7
  4) centos6/tk-8.5.14                9) centos6/R-3.0.2
  5) centos6/fftw-3.3_gcc-4.4.7

If your goal is to run R you could just tell slurm to run that instead of bash, but it's often more flexible this way

Background: I/O redirection

Shell programs usually do i/o with the terminal, using three streams:

Background: I/O redirection (2)

Shell commands can be redirected to write to files instead of the screen, or read from files instead of they keyboard

(Click here for some tcsh notes)

Exercises

An R program that computes the average of a collection of numbers:

x <- c(3, 5, 11)
mean(x)
  1. Run the command R and enter these commands
  2. Use quit() to exit R, and answer n to not save the workspace
  3. Use nano to save this program to a file named commands.R
  4. Use input redirection to run it (hint: R needs an option like --vanilla when it's run non-interactively)
  5. Use output redirection to have all the output and errors go to a files instead of the screen

Break

Exit from the interactive job

As will always be the case, use the exit command to exit from the shell:

[cfest350@holy2a18208 workshop]$ exit
exit
[cfest350@rclogin03 workshop]$

Note that the any files created in the interactive session are there (since home directories are shared network storage):

[cfest350@rclogin03 shop]$ ls
commands.R  myRjob.err  myRjob.out

However, if we had modified the environment, changed directories, etc., those differences are not propagated back up to the original shell session

Non-Interactive jobs

The main purpose of slurm is to not run interactively

Non-interactive jobs are submitted with sbatch, but usually you want to specify more options

We recommend gathering those options and the program(s) to be run in a job submission script

The job submission script

First, let's create a new directory named myRdir, cd into there, and copy our commands.R file there:

[cfest350@rclogin03 workshop]$ mkdir myRdir
[cfest350@rclogin03 myRdir]$ cd myRdir
[cfest350@rclogin03 myRdir]$ cp ../commands.R .

Let's create a text file, using nano, named myRjob.sbatch and with the following contents:

#!/usr/bin/env bash
#SBATCH -J myRjob
#SBATCH -o myRjob_slurm.out
#SBATCH -e myRjob_slurm.err
#SBATCH -p computefest
#SBATCH -n 1
#SBATCH -t 5
#SBATCH --mem=100

R --vanilla < commands.R > myRjob.out 2> myRjob.err

sbatch options

Jobs get the same shell environment as the submitting shell session, so relative paths like this mean the files will be written to the current working directory at the time that you run sbatch

Any number of programs or amount of shell script can follow the #SBATCH options

Memory and time limits

Slurm imposes a memory limit on every job -- 100 MB by default

Use --mem to set a higher one:

More details here: SLURM: memory limits

Likewise, -t is the time limit (in minutes, if just a number)

Submitting the job

Submit it:

[cfest350@rclogin03 workshop]$ sbatch myRjob.sbatch 
Submitted batch job 5056069

Confirm that it's in the partition:

[cfest350@rclogin03 myRdir]$ squeue -j 5056130
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5056130 computefe   myRjo  cfest350 PD       0:00      1 (Priority)

PD means pending (add -l for less abbreviated output)

Eventually it will run:

[cfest350@rclogin03 myRdir]$ squeue -j 5056130
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5056130 computefe   myRjob cfest350  R       0:06      1 holy2a18208

Waiting for the job

When the job has finished, it will disappear from squeue output.

[cfest350@rclogin03 myRdir]$ squeue -j 5056130
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Add -t ALL to see finished jobs, too:

[cfest350@rclogin03 myRdir]$ squeue -t ALL -j 5056130
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5056130 computefe   myRjob cfest350 CD       0:09      1 holy2a18208

See also the --mail-type options for sending email upon certain job events

Job states

Getting job info

Getting job info (2)

squeue

Getting job info (3)

sacct

Listing jobs (2)

For the full job details, use scontrol -dd show job JOBID:

[cfest350@rclogin03 workshop]$ scontrol -dd show job 5056069
JobId=5056069 Name=myRjob
   UserId=cfest350(34905) GroupId=computefest_group(34727)
   Priority=199308664 Account=cluster_users QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2014-01-14T03:35:34 EligibleTime=2014-01-14T03:35:34
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=computefest AllocNode:Sid=rclogin03:11417
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=100M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/n/home00/cfest350/workshop/myRjob.sbatch
   WorkDir=/n/home00/cfest350/workshop
   BatchScript=
#!/usr/bin/env bash
#SBATCH -J myRjob
#SBATCH -o myRjob_slurm.out
#SBATCH -e myRjob_slurm.err
#SBATCH -p computefest
#SBATCH -n 1
#SBATCH -t 5
#SBATCH --mem=100

Getting the output of jobs (2)

And the files will show up in the directory from where the job was submitted:

[cfest350@rclogin03 myRdir]$ ls -l
total 81
-rw-r--r-- 1 cfest350 computefest_group  25 Jan 14 07:52 commands.R
-rw-r--r-- 1 cfest350 computefest_group   0 Jan 14 07:56 myRjob.err
-rw-r--r-- 1 cfest350 computefest_group 741 Jan 14 07:56 myRjob.out
-rw-r--r-- 1 cfest350 computefest_group 214 Jan 14 07:53 myRjob.sbatch
-rw-r--r-- 1 cfest350 computefest_group   0 Jan 14 07:56 myRjob_slurm.err
-rw-r--r-- 1 cfest350 computefest_group   0 Jan 14 07:56 myRjob_slurm.out
[cfest350@rclogin03 myRdir]$ tail -n 3 myRjob.out 
> mean(x)
[1] 6.333333
> 

Killing jobs

Sometimes you realize after submitting a job that it's not what you wanted to do

You can kill bjobs with the scancel command:

Exercises

  1. Use the squeue and sacct to look at the details of your jobs
  2. Introduce some typos in either the myRjob.sbatch file or the commands.R file and resubmit the job in order to see how the job fails and what the error messages look like (you'll probably want to rm the output files from the previous run before resubmitting)
  3. Resubmit your job and use scancel to abort it before it completes (you'll need to add a command like sleep 300 after the R command in command.R in order for the job to run long enough to be able to catch it)
  4. Use scp, rsync, or some other file transfer utility to download the output files from Odyssey

Break

Parallel jobs

Writing true parallel jobs is an advanced topic, but you often don't have to know how they work in order to run them

But things can go wrong very fast and very badly, so please be careful!

You must know that your job supports such parallelization, how it's implemented -- Just adding options to scancel will not make it run in parallel

Different approaches to parallel programming

You can also just run many serial of jobs at the same time...

Job arrays

Job arrays let you run many individual jobs (almost always serial) at the same time

Appropriate for when your analysis runs over sequentially numbered inputs

To submit a job array...

Job arrays (2)

Example:

#!/usr/bin/env bash
#SBATCH -J myRjob
#SBATCH -o myRjob_slurm.%a.out
#SBATCH -e myRjob_slurm.%a.err
#SBATCH -p serial_requeue
#SBATCH --array=1-4
#SBATCH -n 1
#SBATCH -t 5
#SBATCH --mem=100

myprogram mydata_${SLURM_ARRAY_TASK_ID} \
  > myjob.${SLURM_ARRAY_TASK_ID}.out \
  2> myjob.${SLURM_ARRAY_TASK_ID}.err

Note that this submits to the serial_requeue partition -- the partition applies to the individual job, regardless of the fact that those jobs are running in "parallel"

http://slurm.schedmd.com/job_array.html

Parallel jobs: MPI

You'll have to load the appropriate MPI module (specific you the software you're using), such as:

module load centos6/openmpi-1.6.4_gcc-4.8.0

and tweak the job submission:

Parallel jobs: MPI (2)

Example: an 8-way parallel job split evenly across two nodes:

#!/usr/bin/env bash
#SBATCH -J myRjob
#SBATCH -o myRjob_slurm.out
#SBATCH -e myRjob_slurm.err
#SBATCH -p computefest
#SBATCH -n 8
#SBATCH --ntasks-per-node=4
#SBATCH -t 5
#SBATCH --mem=100

mpirun -np 8 MYPROGRAM > MYJOB.out 2> MYJOB.err

Again, your program must support MPI in order for this to work!

The specific values of n and --ntasks-per-node will depend on your algorithms

Job submission best practices

Job submission best practices (2)

Again... a shared system

If your group owns hardware, many of these issues are alleviated

Getting help

Our FAQ at http://rc.fas.harvard.edu/faq/ is always growing

We're more than just sysadmins

We're here to help you will all computing aspects of your research

Just ask, we're happy to help!

Thanks

And don't forget to exit:

[cfest350@rclogin03 workshop]$ exit
exit

Tomorrow's workshop

Best Practices in Using FAS HPC System (Odyssey)

1:30 p.m.
Science Center Hall E