Basics
Elysium provides four login nodes: login1.elysium.hpc.rub.de
, …, login4.elysium.hpc.rub.de
.
These are your entry points to the cluster.
After login, you typically use them to prepare your software
and copy your data to the appropriate locations.
You can then allocate resources on the
cluster using the Slurm workload manager.
After submitting your request, Slurm will grant you the resources as soon as
they are free and your priority is higher than the priority of other jobs that
might be waiting for some of the same resources.
Your priority depends on your waiting time and your remaining FairShare.
Login
Login to Elysium combines the common SSH key-based authentication with web-based two-factor authentication.
You need to enable two-factor authentication for your RUB LoginID at rub.de/login.
The additional web-based authentication is cached for 14 hours so that you
typically only have to do it once per work day, per login node, and per IP address you connect from.
After that, your normal key-based SSH workflow will work as expected.
Follow these steps:
-
Start ssh with the correct private key, your RUB LoginID, and one of the four login hosts, e.g.
ssh -i ~/.ssh/elysium LOGINID@login1.elysium.hpc.ruhr-uni-bochum.de
.
Available login nodes are login1
to login4
.
-
Open the URL in a browser (or scan the QR code with your smartphone) to start web-based two-factor authentication.
-
Enter the second factor for two-factor authentication.
-
After successful login, you get a four-digit verification code.
-
Enter this code at your ssh prompt to finish login.
For the next 14 hours, only step 1 (classic key-based authentication) will
be necessary on the chosen login node for the IP address you connected from.
Login will fail if:
- You use the wrong private key (“Permission denied (publickey)”)
- You are not member of an active HPC project (“Permission denied (publickey)”)
- You did not enable two-factor authentication for your LoginID (“Two-factor authentication is required”)
- Web-based login fails
- You enter the wrong verification code (“Verification failed”)
- A timeout happens between starting the SSH session and finalizing web-based login (“session_id not found”);
just start the process again to get a new session ID.
Subsections of Software
Modules
We use the Lmod module system:
module available
(shortcut ml av
) lists available modules
module load
loads selected modules
module list
shows modules currently loaded in your environment
There are also hidden modules that are generally less relevant to users but can
be viewed with ml --show_hidden av
.
We are committed to providing the tools you need for your research and
development efforts. If you require modules that are not listed here or need
different versions, please contact our support team, and we will be happy to
assist you.
Compilers
- GCC 11.4.1, default on the system.
- GCC 13.2.0.
- AOCC 4.2.0, AMD Optimizing C/C++ Compiler
- Intel OneAPI Compilers: 2024.1.0
- Intel Classic Compilers: 2021.10.0
- NVHPC 24.7, NVIDIA HPC Compilers
MPI Libraries
- OpenMPI 4.1.6
- OpenMPI 5.0.3 (default).
- MPICH 4.2.1
- Intel OneAPI MPI 2021.12.1.
Mathematical Libraries
- AMD Math Libraries:
- AMD BLIS, BLAS-like libraries.
- AMD FFTW, a fast Fourier transform library.
- AMD libFLAME, a library for dense matrix computations. (LAPACK)
- AMD ScaLAPACK.
- HDF5: Version 1.14.3 (built with MPI)
- Boost 1.85.0
Programming Languages
- Julia 1.10.2, a high-level, high-performance dynamic programming language.
- R 4.4.0, for statistical computing.
- Python 3.11.7
- CUDA Toolkit 12.6.1
- GDB (GNU Debugger)
- Apptainer
Spack
We use the Spack package manager to build a set of common HPC software packages.
This section describes how to use an extent the central installation.
Alternatively, you can use a full independent Spack installation in your home directory, or use EasyBuild.
Using the Central Installation
Activate the central Spack installation with source /cluster/spack/0.22.2/share/spack/setup-env.sh
.
You can use this as a
starting point for your own software builds, without the need to rebuild
everything from scratch.
For this purpose, add the following three files to ~/.spack
:
~/.spack/upstreams.yaml
upstreams:
central-spack:
install_tree: /cluster/spack/0.22.2/opt/spack
~/.spack/config.yaml
config:
install_tree:
root: $HOME/spack/opt/spack
source_cache: $HOME/spack/cache
~/.spack/modules.yaml
modules:
default:
roots:
lmod: $HOME/spack/share/spack/lmod
enable: [lmod]
lmod:
all:
autoload: direct
hide_implicits: true
hierarchy: []
Also, put these two lines into your ~/.bashrc
:
export MODULEPATH=$MODULEPATH:$HOME/spack/share/spack/lmod/linux-almalinux9-x86_64/Core
. /cluster/spack/0.22.2/share/spack/setup-env.sh
You can then use the central Spack installation, with local additions added in ~/spack
.
Run spack compiler find
to add the system compiler to your compiler list. Also run it after loading other compilers via module load
to add those, too.
Overriding Package Definitions
If you need to override selected package definitions, create an additional file ~/.spack/repos.yaml
:
repos:
- $HOME/spack/var/spack/repos
and create a description for your local repo in ~/spack/var/spack/repos/repo.yaml
:
repo:
namespace: overrides
You can then copy the package definition you need to override, and edit it locally. Example for ffmpeg:
$ cd ~/spack/var/spack/repos/
$ mkdir -p packages/ffmpeg
$ cp /cluster/spack/0.22.2/var/spack/repos/builtin/packages/ffmpeg/package.py packages/ffmpeg
$ vim packages/ffmpeg/package.py
... edit as necessary (e.g. disable the patch for version 6.1.1) ...
When running spack install ffmpeg
, your local override will take precedence over the central version.
SLURM
The Elysium HPC system utilizes
SLURM
as a resource manager, scheduler, and accountant in order to
guarantee fair share of the computing resources.
If you are looking for technical details
regarding the usage and underlying
mechanisms of SLURM we recommend participating in the
Introduction to HPC training course.
Examples of job scripts for different job types that are
tailored to the Elysium cluster can be found in the
Training Section.
List of Partition
All nodes in the Elysium cluster are grouped by their hardware kind,
and job submission type.
This way users can request specific computing hardware,
and multi node jobs are guaranteed to run on nodes with the same setup.
In order to get a list of the available partitions, their current state,
and available nodes, the sinfo
command can be used.
1[login_id@login001 ~]$ sinfo
2PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
3cpu up 7-00:00:00 4 alloc cpu[033-034,037-038]
4cpu up 7-00:00:00 280 idle cpu[001-032,035-036,039-284]
5cpu_filler up 3:00:00 4 alloc cpu[033-034,037-038]
6cpu_filler up 3:00:00 280 idle cpu[001-032,035-036,039-284]
7fat_cpu up 2-00:00:00 13 idle fatcpu[001-013]
8fat_cpu_filler up 3:00:00 13 idle fatcpu[001-013]
9gpu up 2-00:00:00 20 idle gpu[001-020]
10gpu_filler up 1:00:00 20 idle gpu[001-020]
11fat_gpu up 2-00:00:00 1 drain* fatgpu005
12fat_gpu up 2-00:00:00 5 mix fatgpu[001,003-004,006-007]
13fat_gpu up 2-00:00:00 1 idle fatgpu002
14fat_gpu_filler up 1:00:00 1 drain* fatgpu005
15fat_gpu_filler up 1:00:00 5 mix fatgpu[001,003-004,006-007]
16fat_gpu_filler up 1:00:00 1 idle fatgpu002
17vis up 1-00:00:00 3 idle vis[001-003]
Requesting Nodes of a Partition
SLURM provides two commands to request resources.
srun
is used to start an interactive session.
1[login_id@login001 ~]$ srun --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 --pty bash
2[login_id@cpu001 ~]$
sbatch
is used to request resources that will execute a job script.
1[login_id@login001 ~]$ sbatch --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 myscript.sh
2Submitted batch job 10290
For sbatch
the submission flags can also be incorporated into the job script itself.
More information about job scripts, and the required and some optional flags can be found
in the Training/SLURM Header section.
Use spredict myscript.sh
to estimate the start time of your job.
List of Currently Running and Pending Jobs
If requested resources are currently not available, jobs are queued
and will start as soon as the resources are available again.
To check which jobs are currently running,
and which ones are pending and for what reason the
squeue
command can be used.
For privacy reasons only the user’s own jobs are displayed.
1[login_id@login001 ~]$ squeue
2 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 10290 cpu test login_id R 2:51 1 cpu001
List of Computing Resources Share
Users/Projects/Groups/Institutes are billed for computing resources used.
To check how many resources a user is entitled to
and how many they have already used the sshare
command is used.
For privacy reasons only the user’s own shares are displayed.
1[login_id@login001 ~]$ sshare
2Account User RawShares NormShares RawUsage EffectvUsage FairShare
3-------------------- ---------- ---------- ----------- ----------- ------------- ----------
4testproj_0000 login_id 1000 0.166667 20450435 0.163985 0.681818
List of Project Accounts
Due to technical reasons the project names on Elysium have rather cryptic names,
based on the loginID of the project manager and a number.
In order to make it easier to select a project account for the
--account
flag for srun
, or sbatch
,
and to check the share and usage of projects,
the RUB-exclusive rub-acclist
command can be used.
1[login_id@login001 ~]$ rub-acclist
2Project ID | Project Description
3--------------+--------------------------------------------------
4testproj_0000 | The fundamental interconnectedness of all things
5testproj_0001 | The translated quaternion for optimal pivoting
Visualization
We provide Visualization via VirtualGL on the visualization nodes on Elysium.hpc.ruhr-uni-bochum.de
Requirements:
X11 server with 24-bit- or 32-bit Visuals.
VirtualGL version > 3.0.2 installed.
You can check support for your Operating Sytsem at: https://virtualgl.org/Documentation/OSSupport
You can download VirtualGL at: https://github.com/VirtualGL/virtualgl/releases
To use VirtualGL on Elysium, you will only need the VirtualGL client, it is not necessary to configure a VirtualGL Server.
Resource allocation:
Allocate resources in the vis partition.
salloc -p vis -N1 --time=02:00:00 --account=$ACCOUNT
This will allocate one entire vis node for 2 hours.
Wait until a Slot in the vis partition is available.
You can check if your resources are already allocated using the ‘squeue’ command.
Establish Virtual GL connection:
Connect directly from your computer to the visualization node via ssh with vglonnect -s
Use one of the login servers login[001-004] as a jump host.
vglconnect -s $HPCUSER@vis001.elysium.hpc.rub.de -J $HPCUSER@login001.elysium.hpc.rub.de
If you don’t like long commands, you can configure one of the login nodes as jump host in your ~/.ssh/config for the vis[001-003] hosts.
The command vglconnect -s accepts nearly the same syntax as ssh.
Run your Software:
Load a module if required.
Start your application using vglrun, please remember to use useful command line options like -fps .
module load vmd
vglrun +pr -fps 60 vmd
Please remember to cancel the resource allocation once you are done with your interactive session.