SLURM
The Elysium HPC system utilizes SLURM as a resource manager, scheduler, and accountant in order to guarantee fair share of the computing resources.
If you are looking for technical details regarding the usage and underlying mechanisms of SLURM we recommend participating in the Introduction to HPC training course.
Examples of job scripts for different job types that are tailored to the Elysium cluster can be found in the Training Section.
List of Partition
All nodes in the Elysium cluster are grouped by their hardware kind, and job submission type. This way users can request specific computing hardware, and multi node jobs are guaranteed to run on nodes with the same setup.
In order to get a list of the available partitions, their current state,
and available nodes, the sinfo
command can be used.
1[login_id@login001 ~]$ sinfo
2PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
3cpu up 7-00:00:00 4 alloc cpu[033-034,037-038]
4cpu up 7-00:00:00 280 idle cpu[001-032,035-036,039-284]
5cpu_filler up 3:00:00 4 alloc cpu[033-034,037-038]
6cpu_filler up 3:00:00 280 idle cpu[001-032,035-036,039-284]
7fat_cpu up 2-00:00:00 13 idle fatcpu[001-013]
8fat_cpu_filler up 3:00:00 13 idle fatcpu[001-013]
9gpu up 2-00:00:00 20 idle gpu[001-020]
10gpu_filler up 1:00:00 20 idle gpu[001-020]
11fat_gpu up 2-00:00:00 1 drain* fatgpu005
12fat_gpu up 2-00:00:00 5 mix fatgpu[001,003-004,006-007]
13fat_gpu up 2-00:00:00 1 idle fatgpu002
14fat_gpu_filler up 1:00:00 1 drain* fatgpu005
15fat_gpu_filler up 1:00:00 5 mix fatgpu[001,003-004,006-007]
16fat_gpu_filler up 1:00:00 1 idle fatgpu002
17vis up 1-00:00:00 3 idle vis[001-003]
Requesting Nodes of a Partition
SLURM provides two commands to request resources.
srun
is used to start an interactive session.
1[login_id@login001 ~]$ srun --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 --pty bash
2[login_id@cpu001 ~]$
sbatch
is used to request resources that will execute a job script.
1[login_id@login001 ~]$ sbatch --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 myscript.sh
2Submitted batch job 10290
For sbatch
the submission flags can also be incorporated into the job script itself.
More information about job scripts, and the required and some optional flags can be found
in the Training/SLURM Header section.
Use spredict myscript.sh
to estimate the start time of your job.
List of Currently Running and Pending Jobs
If requested resources are currently not available, jobs are queued
and will start as soon as the resources are available again.
To check which jobs are currently running,
and which ones are pending and for what reason the
squeue
command can be used.
For privacy reasons only the user’s own jobs are displayed.
1[login_id@login001 ~]$ squeue
2 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 10290 cpu test login_id R 2:51 1 cpu001
List of Computing Resources Share
Users/Projects/Groups/Institutes are billed for computing resources used.
To check how many resources a user is entitled to
and how many they have already used the sshare
command is used.
For privacy reasons only the user’s own shares are displayed.
1[login_id@login001 ~]$ sshare
2Account User RawShares NormShares RawUsage EffectvUsage FairShare
3-------------------- ---------- ---------- ----------- ----------- ------------- ----------
4testproj_0000 login_id 1000 0.166667 20450435 0.163985 0.681818
List of Project Accounts
Due to technical reasons the project names on Elysium have rather cryptic names,
based on the loginID of the project manager and a number.
In order to make it easier to select a project account for the
--account
flag for srun
, or sbatch
,
and to check the share and usage of projects,
the RUB-exclusive rub-acclist
command can be used.
1[login_id@login001 ~]$ rub-acclist
2Project ID | Project Description
3--------------+--------------------------------------------------
4testproj_0000 | The fundamental interconnectedness of all things
5testproj_0001 | The translated quaternion for optimal pivoting