Documentation

Elysium is the central HPC Cluster at the Ruhr-Universität Bochum. See the overview of its resources.

To use Elysium, you need to

Please read about the basic concept of using Elysium first.

The login process combines SSH key-based authentication with web-based two-factor authentication.

After login, you can use available software modules or build your own software.

Read about submitting jobs and allocating resources in the SLURM section.

Subsections of Documentation

Basics

Elysium provides four login nodes: login1.elysium.hpc.rub.de, …, login4.elysium.hpc.rub.de. These are your entry points to the cluster.

After login, you typically use them to prepare your software and copy your data to the appropriate locations.

You can then allocate resources on the cluster using the Slurm workload manager.

After submitting your request, Slurm will grant you the resources as soon as they are free and your priority is higher than the priority of other jobs that might be waiting for some of the same resources.

Your priority depends on your waiting time and your remaining FairShare.

Login

Login to Elysium combines the common SSH key-based authentication with web-based two-factor authentication. In order to be able to authenticate during login you need submit your public SSH key via the User Access Application Form as well as enable two-factor authentication for your RUB LoginID at rub.de/login.

The additional web-based authentication is cached for 14 hours so that you typically only have to do it once per work day, per login node, and per IP address you connect from. After that, your normal key-based SSH workflow will work as expected.

In order to simplify the use of SSH keys we recommend to specify it as identity file in your SSH config. This can be done by adding the following lines to your ~/.ssh/config file:

Host login00*.elysium.hpc.rub.de login00*.elysium.hpc.ruhr-uni-bochum.de
    IdentityFile ~/.ssh/elysium
    User <loginID>

where <loginID> has to be exchanged by your rub-loginID. If your SSH key is located in a different file the IdentityFile path needs to be adjusted accordingly.

Follow these steps:

  1. Start ssh with the correct private key, your RUB LoginID, and one of the four login hosts, e.g.
    ssh -i ~/.ssh/elysium LOGINID@login001.elysium.hpc.ruhr-uni-bochum.de, or ssh login001.elysium.hpc.rub.de if you want to use the SSH config specified above. Available login nodes are login001 to login004. Login step 1: start ssh Login step 1: start ssh

  2. Open the URL in a browser (or scan the QR code with your smartphone) to start web-based two-factor authentication. Login step 2 part 1: start web-based authentication Login step 2 part 1: start web-based authentication

  3. Enter the second factor for two-factor authentication. Login step 2 part 2: web-based LoginID / password authentication Login step 2 part 2: web-based LoginID / password authentication

  4. After successful login, you get a four-digit verification code. Login step 2 part 3: get the verification code Login step 2 part 3: get the verification code

  5. Enter this code at your ssh prompt to finish login. Login step 3: verify the SSH session Login step 3: verify the SSH session

For the next 14 hours, only step 1 (classic key-based authentication) will be necessary on the chosen login node for the IP address you connected from.

Login will fail if:

  • You use the wrong private key (“Permission denied (publickey)”)
  • You are not member of an active HPC project (“Permission denied (publickey)”)
  • You did not enable two-factor authentication for your LoginID (“Two-factor authentication is required”)
  • Web-based login fails
  • You enter the wrong verification code (“Verification failed”)
  • A timeout happens between starting the SSH session and finalizing web-based login (“session_id not found”); just start the process again to get a new session ID.

Software

We provide a basic set of toolchains and some common libraries via modules.

To build common HPC software packages, we provide a central installation of the Spack package manager. A detailed guide on how to use this installation can be found in the Spack Usage Guide.

Subsections of Software

Modules

We use the Lmod module system:

  • module available (shortcut ml av) lists available modules
  • module load loads selected modules
  • module list shows modules currently loaded in your environment

There are also hidden modules that are generally less relevant to users but can be viewed with ml --show_hidden av.

We are committed to providing the tools you need for your research and development efforts. If you require modules that are not listed here or need different versions, please contact our support team, and we will be happy to assist you.

Compilers

  • GCC 11.4.1, default on the system.
  • GCC 13.2.0.
  • AOCC 4.2.0, AMD Optimizing C/C++ Compiler
  • Intel OneAPI Compilers: 2024.1.0
  • Intel Classic Compilers: 2021.10.0
  • NVHPC 24.7, NVIDIA HPC Compilers

MPI Libraries

  • OpenMPI 4.1.6
  • OpenMPI 5.0.3 (default).
  • MPICH 4.2.1
  • Intel OneAPI MPI 2021.12.1.

Mathematical Libraries

  • AMD Math Libraries:
  • AMD BLIS, BLAS-like libraries.
  • AMD FFTW, a fast Fourier transform library.
  • AMD libFLAME, a library for dense matrix computations. (LAPACK)
  • AMD ScaLAPACK.
  • HDF5: Version 1.14.3 (built with MPI)
  • Boost 1.85.0

Programming Languages

  • Julia 1.10.2, a high-level, high-performance dynamic programming language.
  • R 4.4.0, for statistical computing.
  • Python 3.11.7

Tools and Utilities

  • CUDA Toolkit 12.6.1
  • GDB (GNU Debugger)
  • Apptainer

Spack

We use the Spack package manager to provide a collection of common HPC software packages. This page explains how to use the central Spack installation to build your own modulefiles.

Table of Contents

  1. Quick Setup
  2. Guide to Using Spack
  3. Central Spack Installation
  4. Overriding Package Definitions

Quick Setup with rub-deploy-spack-configs

You can directly copy the configuration files described in Central Spack Installation (upstreams.yaml, config.yaml, modules.yaml, compilers.yaml) to your home directory using the rub-deploy-spack-configs command:

rub-deploy-spack-configs

Add these lines to your ~/.bashrc to activate spack with every login:

export MODULEPATH=$MODULEPATH:$HOME/spack/share/spack/lmod/linux-almalinux9-x86_64/Core
. /cluster/spack/0.23.0/share/spack/setup-env.sh

Guide to Using Spack

Below is a detailed guide on how to effectively use Spack.

  1. Searching for Packages
  2. Viewing Package Variants
  3. Enabling/Disabling Variants
  4. Specifying Compilers
  5. Specifying Dependencies
  6. Putting It All Together
  7. Building and Adding a New Compiler
  8. Comparing Installed Package Variants
  9. Removing Packages

Searching for Packages

To find available packages, use:

spack list <keyword>  # Search for packages by name
# Example:
spack list openfoam
openfoam  openfoam-org
==> 2 packages

For detailed information about a package:

spack info <package>  # Show versions, variants, and dependencies
# Example:
spack info hdf5

For a quick search for all available packages in spack, visit https://packages.spack.io/.


Viewing Package Variants

Variants are build options that enable or disable features. List them with spack info <package>:

spack info hdf5

Output includes:

Preferred version:  
    1.14.3           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.3/src/hdf5-1.14.3.tar.gz

Safe versions:  
    1.14.3           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.3/src/hdf5-1.14.3.tar.gz
    1.14.2           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.2/src/hdf5-1.14.2.tar.gz
    1.14.1-2         https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.1-2/src/hdf5-1.14.1-2.tar.gz
    1.14.0           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.0/src/hdf5-1.14.0.tar.gz
    1.12.3           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.3/src/hdf5-1.12.3.tar.gz
    1.12.2           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.2/src/hdf5-1.12.2.tar.gz
    1.12.1           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.1/src/hdf5-1.12.1.tar.gz
    1.12.0           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.0/src/hdf5-1.12.0.tar.gz
    1.10.11          https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.11/src/hdf5-1.10.11.tar.gz
    1.10.10          https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.10/src/hdf5-1.10.10.tar.gz

Variants:
    api [default]               default, v110, v112, v114, v116, v16, v18
        Choose api compatibility for earlier version
    cxx [false]                 false, true
        Enable C++ support
    fortran [false]             false, true
        Enable Fortran support
    hl [false]                  false, true
        Enable the high-level library
    mpi [true]                  false, true
        Enable MPI support

Defaults are shown in square brackets, possible values to the right.


Checking the Installation

To see which dependencies will be installed, use:

spack spec hdf5

Output includes:

Input spec
--------------------------------
 -   hdf5

Concretized
--------------------------------
[+]  hdf5@1.14.3%gcc@11.4.1~cxx~fortran+hl~ipo~java~map+mpi+shared~subfiling~szip~threadsafe+tools api=default build_system=cmake build_type=Release generator=make patches=82088c8 arch=linux-almalinux9-zen4
[+]      ^cmake@3.27.9%gcc@11.4.1~doc+ncurses+ownlibs build_system=generic build_type=Release arch=linux-almalinux9-zen4
[+]          ^curl@8.7.1%gcc@11.4.1~gssapi~ldap~libidn2~librtmp~libssh+libssh2+nghttp2 build_system=autotools libs=shared,static tls=mbedtls,openssl arch=linux-almalinux9-zen4
[+]              ^libssh2@1.11.0%gcc@11.4.1+shared build_system=autotools crypto=mbedtls patches=011d926 arch=linux-almalinux9-zen4
[+]                  ^xz@5.4.6%gcc@11.4.1~pic build_system=autotools libs=shared,static arch=linux-almalinux9-zen4
[+]              ^mbedtls@2.28.2%gcc@11.4.1+pic build_system=makefile build_type=Release libs=shared,static arch=linux-almalinux9-zen4
[+]              ^nghttp2@1.52.0%gcc@11.4.1 build_system=autotools arch=linux-almalinux9-zen4
[+]                  ^diffutils@3.10%gcc@11.4.1 build_system=autotools arch=linux-almalinux9-zen4
[+]              ^openssl@3.3.0%gcc@11.4.1~docs+shared build_system=generic certs=mozilla arch=linux-almalinux9-zen4
[+]                  ^ca-certificates-mozilla@2023-05-30%gcc@11.4.1 build_system=generic arch=linux-almalinux9-zen4
[+]          ^ncurses@6.5%gcc@11.4.1~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-almalinux9-zen4
[+]      ^gcc-runtime@11.4.1%gcc@11.4.1 build_system=generic arch=linux-almalinux9-zen4
[e]      ^glibc@2.34%gcc@11.4.1 build_system=autotools arch=linux-almalinux9-zen4
[+]      ^gmake@4.4.1%gcc@11.4.1~guile build_system=generic arch=linux-almalinux9-zen4
[+]      ^openmpi@5.0.3%gcc@11.4.1~atomics~cuda~gpfs~internal-hwloc~internal-libevent~internal-pmix~java+legacylaunchers~lustre~memchecker~openshmem~orterunprefix~romio+rsh~static+vt+wrapper-rpath build_system=autotools fabrics=ofi romio-filesystem=none schedulers=slurm arch=linux-almalinux9-zen4

It’s always a good idea to check the specs before installing.


Enabling/Disabling Variants

Control variants with + (enable) or ~ (disable):

spack install hdf5 +mpi +cxx ~hl  # Enable MPI and C++, disable high-level API

For packages with CUDA, use compute capabilities 8.0 (for GPU nodes) and 9.0 (for FatGPU nodes):

spack install openmpi +cuda cuda_arch=80,90

Specifying Compilers

Use % to specify a compiler. Check available compilers with:

spack compilers

Example:

spack install hdf5 %gcc@11.4.1

When using compilers other than GCC 11.4.1, dependencies must also be built with that compiler:

spack install --fresh hdf5 %gcc@13.2.0

Specifying Dependencies

Use ^ to specify dependencies with versions or variants:

spack install hdf5 ^openmpi@4.1.5

Dependencies can also have variants:

spack install hdf5 +mpi ^openmpi@4.1.5 +threads_multiple

Make sure to set the variants for the package and the dependencies on the right position or installation will fail.


Putting It All Together

Combine options for customized installations:

spack install hdf5@1.14.3 +mpi ~hl ^openmpi@4.1.5 +cuda cuda_arch=80,90 %gcc@11.4.1
  • Compiles hdf 1.14.3 with GCC 11.4.1.
  • Enables MPI support, disable high-level API.
  • Uses OpenMPI 4.1.5 with cuda support as a dependency.

Building and Adding a New Compiler

Install a new compiler (e.g., GCC 13.2.0) with:

spack install gcc@13.2.0

Add it to Spack’s compiler list:

spack compiler add $(spack location -i gcc@13.2.0)

Verify it’s recognized:

spack compilers

Use it to build packages:

spack install --fresh hdf5 %gcc@13.2.0

Comparing installed package variants

If you have multiple installations of the same package with different variants, you can inspect their configurations using Spack’s spec command or the find tool.


List installed packages with variants

Use spack find -vl to show all installed variants and their hashes:

spack find -vl hdf5

-- linux-almalinux9-zen4 / gcc@11.4.1 ---------------------------
amrsck6 hdf5@1.14.3~cxx~fortran+hl~ipo~java~map+mpi+shared~subfiling~szip~threadsafe+tools api=default build_system=cmake build_type=Release generator=make patches=82088c8

2dsgtoe hdf5@1.14.3+cxx+fortran~hl~ipo~java~map+mpi+shared~subfiling~szip~threadsafe+tools api=default build_system=cmake build_type=Release generator=make patches=82088c8

==> 2 installed packages
  • amrsck6 and 2dsgtoe are the unique hashes for each installation.
  • You can see one package uses +hl while the other does not.

Inspect specific installations

Use spack spec /<hash> to view details of a specific installation:

spack spec /amrsck6  
spack spec /2dsgtoe 

Compare two installations

To compare variants between two installations, use spack diff with both hashes:

spack diff /amrsck6 /2dsgtoe

You will see a diff in the style of git:

--- hdf5@1.14.3/amrsck6mml43sfv4bhvvniwdydaxfgne
+++ hdf5@1.14.3/2dsgtoevoypx7dr45l5ke2dlb56agvz4
@@ virtual_on_incoming_edges @@
-  openmpi mpi
+  mpich mpi

So one version depends on OpenMPI while the other depends on MPICH.


Removing Packages

For multiple variants of a package, specify the hash:

spack uninstall /amrsck6

Central Spack Installation

Activate the central Spack installation with:

source /cluster/spack/0.23.0/share/spack/setup-env.sh

Use it as a starting point for your own builds without rebuilding everything from scratch.

Add these files to ~/.spack:

  • ~/.spack/upstreams.yaml:
    upstreams:
      central-spack:
        install_tree: /cluster/spack/opt
  • ~/.spack/config.yaml:
    config:
      install_tree:
        root: $HOME/spack/opt/spack
      source_cache: $HOME/spack/cache
      license_dir: $HOME/spack/etc/spack/licenses
  • ~/.spack/modules.yaml:
    modules:
      default:
        roots:
          lmod: $HOME/spack/share/spack/lmod
        enable: [lmod]
        lmod:
          all:
            autoload: direct
          hide_implicits: true
          hierarchy: []

Add these lines to your ~/.bashrc:

export MODULEPATH=$MODULEPATH:$HOME/spack/share/spack/lmod/linux-almalinux9-x86_64/Core
. /cluster/spack/0.23.0/share/spack/setup-env.sh

Then run:

spack compiler find

Overriding Package Definitions

Create ~/.spack/repos.yaml:

repos:
  - $HOME/spack/var/spack/repos

And a local repo description in ~/spack/var/spack/repos/repo.yaml:

repo:
  namespace: overrides

Copy and edit a package definition, e.g., for ffmpeg:

cd ~/spack/var/spack/repos/
mkdir -p packages/ffmpeg
cp /cluster/spack/0.23.0/var/spack/repos/builtin/packages/ffmpeg/package.py packages/ffmpeg
vim packages/ffmpeg/package.py

Alternatively, you can use a fully independent Spack installation in your home directory or opt for EasyBuild.

SLURM

The Elysium HPC system utilizes SLURM as a resource manager, scheduler, and accountant in order to guarantee fair share of the computing resources.

If you are looking for technical details regarding the usage and underlying mechanisms of SLURM we recommend participating in the Introduction to HPC training course.

Examples of job scripts for different job types that are tailored to the Elysium cluster can be found in the Training Section.

List of Partition

All nodes in the Elysium cluster are grouped by their hardware kind, and job submission type. This way users can request specific computing hardware, and multi node jobs are guaranteed to run on nodes with the same setup.

In order to get a list of the available partitions, their current state, and available nodes, the sinfo command can be used.

 1[login_id@login001 ~]$ sinfo
 2PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
 3cpu               up 7-00:00:00      4  alloc cpu[033-034,037-038]
 4cpu               up 7-00:00:00    280   idle cpu[001-032,035-036,039-284]
 5cpu_filler        up    3:00:00      4  alloc cpu[033-034,037-038]
 6cpu_filler        up    3:00:00    280   idle cpu[001-032,035-036,039-284]
 7fat_cpu           up 2-00:00:00     13   idle fatcpu[001-013]
 8fat_cpu_filler    up    3:00:00     13   idle fatcpu[001-013]
 9gpu               up 2-00:00:00     20   idle gpu[001-020]
10gpu_filler        up    1:00:00     20   idle gpu[001-020]
11fat_gpu           up 2-00:00:00      1 drain* fatgpu005
12fat_gpu           up 2-00:00:00      5    mix fatgpu[001,003-004,006-007]
13fat_gpu           up 2-00:00:00      1   idle fatgpu002
14fat_gpu_filler    up    1:00:00      1 drain* fatgpu005
15fat_gpu_filler    up    1:00:00      5    mix fatgpu[001,003-004,006-007]
16fat_gpu_filler    up    1:00:00      1   idle fatgpu002
17vis               up 1-00:00:00      3   idle vis[001-003]

Requesting Nodes of a Partition

SLURM provides two commands to request resources. srun is used to start an interactive session.

1[login_id@login001 ~]$ srun -N 1 --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 --pty bash
2[login_id@cpu001 ~]$

sbatch is used to request resources that will execute a job script.

1[login_id@login001 ~]$ sbatch -N 1 --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 myscript.sh
2Submitted batch job 10290

For sbatch the submission flags can also be incorporated into the job script itself. More information about job scripts, and the required and some optional flags can be found in the Training/SLURM Header section.

On Elysium several flags are mandatory. sbatch and srun will refuse to queue the job and give a detailed explanation which flag is missing and how to incorporate it into your command or script.

Use spredict myscript.sh to estimate the start time of your job.

Shared Nodes

All nodes are shared by default. If a user requests fewer CPU-cores than a node provides, other users may use these resources at the same time. To ensure that the requested nodes are not shared use the --exclusive flag. If more than one node is requested the --exlusive flag is mandatory.

GPU Nodes

For requesting resources on a GPU node the --gpus=<number of GPUs> flag is required. In order to allow for fairly shared resources the number of CPUs per GPU is limited. Thus the --cpus-per-gpu=<number of CPU cores per GPU> is required as well. For multi node jobs --gpus-per-node=<number of GPUs per node> option needs to be set.

Visualization Nodes

For requesting resources on a visualization node no --gpu parameter is needed. The available GPU will automatically be shared between all jobs on the node.

List of Currently Running and Pending Jobs

If requested resources are currently not available, jobs are queued and will start as soon as the resources are available again. To check which jobs are currently running, and which ones are pending and for what reason the squeue command can be used. For privacy reasons only the user’s own jobs are displayed.

1[login_id@login001 ~]$ squeue
2             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3             10290       cpu     test login_id  R       2:51      1 cpu001

List of Computing Resources Share

Users/Projects/Groups/Institutes are billed for computing resources used. To check how many resources a user is entitled to and how many they have already used the sshare command is used. For privacy reasons only the user’s own shares are displayed.

1[login_id@login001 ~]$ sshare
2Account                    User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
3-------------------- ---------- ---------- ----------- ----------- ------------- ----------
4testproj_0000          login_id       1000    0.166667    20450435      0.163985   0.681818

List of Project Accounts

Due to technical reasons the project names on Elysium have rather cryptic names, based on the loginID of the project manager and a number. In order to make it easier to select a project account for the --account flag for srun, or sbatch, and to check the share and usage of projects, the RUB-exclusive rub-acclist command can be used.

1[login_id@login001 ~]$ rub-acclist
2Project ID    | Project Description
3--------------+--------------------------------------------------
4testproj_0000 | The fundamental interconnectedness of all things
5testproj_0001 | The translated quaternion for optimal pivoting

Visualization

We provide Visualization via VirtualGL on the visualization nodes on Elysium.hpc.ruhr-uni-bochum.de

Requirements:

X11 server with 24-bit- or 32-bit Visuals. VirtualGL version > 3.0.2 installed.

You can check support for your Operating Sytsem at: https://virtualgl.org/Documentation/OSSupport You can download VirtualGL at: https://github.com/VirtualGL/virtualgl/releases

To use VirtualGL on Elysium, you will only need the VirtualGL client, it is not necessary to configure a VirtualGL Server.

Resource allocation:

Allocate resources in the vis partition.

salloc -p vis -N1 --time=02:00:00 --account=$ACCOUNT

This will allocate a share of one vis node for 2 hours. (For more options on node allocations see SLURM). Wait until a Slot in the vis partition is available. You can check if your resources are already allocated using the ‘squeue’ command.

Establish Virtual GL connection:

Connect directly from your computer to the visualization node via ssh with vglonnect -s Use one of the login servers login[001-004] as a jump host.

vglconnect -s $HPCUSER@vis001.elysium.hpc.rub.de -J $HPCUSER@login001.elysium.hpc.rub.de

If you don’t like long commands, you can configure one of the login nodes as jump host in your ~/.ssh/config for the vis[001-003] hosts. The command vglconnect -s accepts nearly the same syntax as ssh.

Run your Software:

Load a module if required. Start your application using vglrun, please remember to use useful command line options like -fps .

module load vmd
vglrun +pr -fps 60 vmd

Please remember to cancel the resource allocation once you are done with your interactive session.

scancel $jobID

Job Monitoring

With our web-based job monitoring system (ClusterCockpit), you can easily monitor and analyze the performance of your jobs on the Elysium HPC system. For a quick performance check, see Metrics to Check; for an in-depth analysis, refer to the HPC-Wiki. For details on the web interface, consult the official documentation.

Login

To access the job monitoring system, use your RUB LoginID and corresponding password as credentials. JobMon Login Page JobMon Login Page

Overview

After logging in successfully, you will see the “Clusters” overview, which displays the total number of jobs you have run and the current number of jobs running on the cluster. At present, this information includes only the Elysium cluster. You can continue from here, either by going to the total jobs overview, or the running jobs overview. Alternatively, you can click on “My Jobs” in the top left of the page, or search for job names/ids in the top right of the page. JobMon Landing Page JobMon Landing Page

My Jobs

The “My Jobs” page displays a list of your jobs, fully customizable to your requirements. Use the menus in the top left corner to sort or filter the list, and select the metrics you want to display for your jobs. Below, you’ll find a detailed table with job IDs, names, and your selected metrics.

MyJobs Page MyJobs Page

Job Details

This page is split into three sections. The first one shows general information: JobInfo, a footprint and a roofline diagram that shows how efficiently the job utilized the hardware. Note that the footprint is only updated every 10 minutes and the energy footprint is generated after the job finished.

In the next section some metrics are shown as diagrams. For some of the diagrams you can choose the scope, i.e. core, socket or node. The shown metrics and their order can be customized with the “Select Metrics” menu. This selection is saved per partition. Double-click the graph to zoom out if the scale is too small.

The last section displays selected metrics in a numerical way, lets you inspect your job script, and shows more detail about the job allocation an runtime parameters.

Job Page Job Page Job Page Job Page Job Page Job Page Job Page Job Page

Metrics

The following table shows the metrics which are available for jobs on Elysium:

Metric name Meaning Meaningful for shared jobs
CPU
cpu_load Load on the node (processes/threads requesting CPU time) No
cpu_load_core Load on CPU cores of a job (processes/threads per core) Yes
cpu_user Percentage of CPU time spent as user time for each CPU core Yes
clock Frequency of the CPU cores of the job Yes (affected by other jobs)
ipc Instructions per cycle Yes
flops_any Floating-point operations performed by CPU cores Yes
core_power Power consumption of individual CPU cores Yes
Memory
mem_bw Memory bandwidth No (full socket only)
mem_used Main memory used on the node No
disk_free Free disk space on the node No
GPU
nv_compute_processes Number of processes using the GPU Yes
acc_mem_used Accelerator (GPU) memory usage Yes
acc_mem_util Accelerator (GPU) memory utilization Yes
acc_power Accelerator (GPU) power usage Yes
acc_utilization Accelerator (GPU) compute utilization Yes
Filesystem
lustre_write_bw /lustre write bandwidth No
lustre_read_bw /lustre read bandwidth No
lustre_close /lustre file close requests No
lustre_open /lustre file open requests No
lustre_statfs /lustre file stat requests No
io_reads Local Disk I/O read operations No
io_writes Local Disk I/O write operations No
nfs4_close /home + /cluster file close requests No
nfs4_open /home + /cluster file open requests No
nfsio_nread /home + /cluster I/O read bandwidth No
nfsio_nwrite /home + /cluster I/O write bandwidth No
Network
ib_recv Omnipath receive bandwidth No
ib_xmit Omnipath transmit bandwidth No
ib_recv_pkts Omnipath received packets/s No
ib_xmit_pkts Omnipath transmitted packets/s No
net_bytes_in Ethernet incoming bandwidth No
net_bytes_out Ethernet outgoing bandwidth No
net_pkts_in Ethernet incoming packets/s No
net_pkts_out Ethernet outgoing packets/s No
NUMA Nodes
numastats_numa_hit NUMA hits No
numastats_numa_miss NUMA misses No
numastats_interleave_hit NUMA interleave hits No
numastats_local_node NUMA local node accesses No
numastats_numa_foreign NUMA foreign node accesses No
numastats_other_node NUMA other node accesses No
Node metrics
node_total_power Power consumption of the whole node No

Metrics to Check

For a quick performance analysis, here are some key metrics to review:

  • cpu_user: Should be close to 100%. Lower values indicate system processes are using some of your resources.
  • flops_any: Measures calculations per second. On Elysium, a typical CPU node averages around 400 GFLOPS.
  • cpu_load_core: Should be 1 at most for non-OpenMP jobs. Higher values suggest oversubscription.
  • ipc: Instructions executed per cycle. Higher values indicate better efficiency.
  • mem_bw: Memory bandwidth, maxing out at 350 GByte/s. Only meaningful if the node isn’t shared or your job uses a full socket.
  • acc_utilization: GPU compute utilization. Aim for high percentages (e.g., above 80%) to ensure efficient GPU usage.

Known Problems

Occasionally, an orange box labeled “No dataset returned for <metric>” may be shown instead of the graph. This occurs when the ClusterCockpit service was unable to collect the metrics during your job. Job Page Job Page Note that jobs that ran before March 12th 2025 may report missing or incorrect data in some cases.

The measurements for ipc and clock are sometimes too high. This is related to power saving features of the CPU. We are currently investigating how to solve this issue.

For jobs that ran before March 7th 2025 a bug triggered an overflow for the power usage metric resulting in unrealisticly high power consumptions. This bug is fixed, but the fix cannot be applied to older jobs that were affected by it. Job Page Job Page