Job Scripts

A SLURM job script usually follows the following steps:

SLURM Header
Create temporary folder on local disk
Copy input data to temporary folder
Load required modules
Perform actual calculation
Copy output file back to global file system
Tidy up local storage

SLURM Header

The SLURM HEADER is a section in the script after the shebang. Every line begins with #SBATCH.

1#SBATCH --nodes=1                 # Request 1 Node
2#SBATCH --partition=gpu           # Run in partition cpu
3#SBATCH --job-name=minimal_gpu    # name of the job in squeue
4#SBATCH --gpus=1                  # number of GPUs to reserve
5#SBATCH --time=00:05:00           # estimated runtime (dd-hh:mm:ss)
6#SBATCH --account=lambem64_0000   # Project ID (check with rub-acclist)

This way the bash interpreter ignores these lines, but SLURM can pick them out to parse the contents. Additionally each line contains one of the sbatch flags. On Elysium the flags --partition, --time, and --account are required. For GPU-jobs the additional --gpus flag needs to be specified and at least 1.

Mandatory Flags

Flag	Example	Note
`--partiton=<partition>`	`--partition=cpu`	list of partitions with `sinfo`
`--time=<dd-hh:mm:ss>`	`--time=00-02:30:00`	maximum time the job will run
`--account=<account>`	`--account=snublaew_0001`	project the used computing time shall be billed to. list of project accounts with `rub-acclist`
`--gpus=<n>`	`--gpus=1`	number of GPUs. Must be at least 1 for GPU partitions

Optional Flags

Flag	Example	Note
`--job-name=<name>`	`--job-name="mysim"`	job name that is shown in `squeue` for the job
`--exclusive`	`--exclusive`	Nodes are not shared with other jobs (default on cpu, fat_cpu, gpu).
`--output=<filename>`	`--output=%x-%j.out`	Filename to contain stdout (`%x=job name`, `%j=job-id`)
`--error=<filename>`	`--error=%x-%j.err`	Filename to contain stderr (`%x=job name`, `%j=job-id`)
`--mail-type=<TYPE>`	`--mail-type=ALL`	Notify user by email when certain event types occur. If specified `--mail-user` needs to be set.
`--mail-user=<rub-mail>`	`--mail-user=max.muster@rub.de`	Address to which job notifications of type `--mail-type` are send.

Temporary Folder

If your code reads from some input, or writes output, the performance can strongly depend on where the data is located. If the data is in your home, or on the lustre file system the read/write performance is limited by the bandwidth of the interconnect. In addition to that a parallel file system has problems with many small read/write operations by design. It’s performance shines with reading/writing big chunks. Thus it is advisable to create a folder on the local disks in the /tmp/ directory, and perform all read/write operations in there. At the beginning of the job any input data is put there in one copy, and all output data is copied from the /tmp/ directory to its final location in one go.

 1# obtain the current location
 2HDIR=$(pwd)
 3
 4# create a temporary working directory on the node
 5WDIR=/tmp
 6cd ${WDIR}
 7
 8# copy set of input files to the working directory
 9cp ${HDIR}/inputdata/* ${WDIR}
10
11...
12
13# copy the set of output files back to the original folder
14cp outputdata ${HDIR}/outputs/
15
16# tidy up local files
17rm -rf ${WDIR}/*

Loading Modules

If your program was build with certain versions of libraries it may be required to provide the same libraries at runtime. Since everybody’s needs regarding library versions is different Elysium utilizes environment modules to manager software versions.

 1# unload all previously loaded modules
 2module purge
 3
 4# show all module that are available
 5module avail
 6
 7# load a specific module
 8module load the_modules_name_and_version
 9
10# list all loaded modules
11module list

Perform Calculation

How to perform your calculation strongly depends on your specific software and inputs. In general there are four typical ways to run HPC jobs.

Farming

Farming jobs are used if the program is not parallelized, or scales in a way that it can only utilize a few CPU cores efficiently. Then multiple instances of the same program are started. Each with a different input, as long as the instances have roughly the same runtime.

1for irun in $(seq 1 ${stride} ${ncores})
2do
3    # The core count needs to start at 0 and goes to ncores-1
4    taskset -c $(bc <<< "${irun-1}") ${myexe} inp.${irun} > out.${irun}
5done
6wait

Shared Memory

Programs that incorporate thread spawning (usually via OpenMP) can make use of multiple cores.

1export OMP_NUM_THREADS=${SLURM_TASKS_PER_NODE}
2${myexe} input

Distributed Memory

If programs require more resources than can be provided by one node it is necessary to pass information between the processes running on different nodes. This is usually done via the MPI protocol. A program must be specifically programmed to utilize MPI.

1ncorespernode=48
2nnodes=${SLURM_JOB_NUM_NODES}
3ncorestotal=$(bc <<< "${ncorespernode}*${nnodes}")
4mpirun -np ${ncorestotal} -ppn ${ncorespernode} ${myexe} input

Hybrid Memory (Shared and Distributed Memory)

In programs that utilize distributed memory parallelization via MPI it is possible to spawn threads within each process to make use of shared memory parallelization.

1nthreadsperproc=2
2ncorespernode=$(bc <<< "48/${nthreadsperproc}")
3nnodes=${SLURM_JOB_NUM_NODES}
4ncorestotal=$(bc <<< "${ncorespernode}*${nnodes}")
5export OMP_NUM_THREADS=${nthreadsperproc}
6mpirun -np ${ncorestotal} -ppn ${ncorespernode} ${myexe} input

GPU

Support for offloading tasks to GPUs needs to be incorporated into the program.

1export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
2${myexe} input

Examples

The following example scripts are ready to use on the Elysium cluster. The only change you need to make is to specify a valid account for the --account flag. You can use the rub-acclist command to get a list of your available project accounts. The executed programs do not produce any load and will finish in a few seconds. The generated output shows where each process/thread ran, and if it had access to a GPU.

Minimal CPU Job Script Example

Farming Job Script Example

Shared Memory Job Script Example

Distributed Memory Job Script Example

Hybrid Memory Job Script Example

Minimal GPU Job Script Example

GPU Job Script Example

Distributed Memory with GPU Job Script Example