Subsections of

News

HPC@RUB News - also available via RSS Feed or the hpc-news mailing list.

New Feature: Job Monitoring with ClusterCockpit

We are happy to announce that job monitoring via ClusterCockpit is now available at jobmon.hpc.rub.de!

ClusterCockpit is a new web-based system for job-specific performance and energy monitoring for HPC clusters. It is developed under the MIT open source license at the NHR center NHR@FAU, with contributions from several other HPC centers, including HPC@RUB. Development is funded by the BMBF as part of the Energy-Efficient HPC project.

HPC@RUB is the fourth HPC center to provide ClusterCockpit, and the first Tier-3 center.

Documentation is available at rub.de/hpc/documentation/jobmonitoring.

An online tutorial will be held at March 31, 10:00 via Zoom

User Community Matrix Room

We created the HPC User room in the RUB Matrix Server. Please join! This room is for exchange and collaboration between HPC users at RUB.

The HPC team is also present, but please note that you should still contact us at hpc-helpdesk@ruhr-uni-bochum.de if you need a problem solved.

Elysium Launch Event

The launch event for the Elysium cluster took place yesterday, see the IT.SERVICES News.

Cluster Elysium is now open!

We are happy to announce that the Elysium cluster started operations and is now open for all researchers at RUB!

Upcoming HPC course

IT.SERVICES is offering the course “Introduction to High Performance Computing” again.

It will take place on December 11th 2024 in IA 0/65 in a hybrid format. There will be a lecture with interactive exercises on an HPC system. For the exercises we will provide 15 accounts. The accounts will be given to the first 15 in-person participants. Note that we can not give any accounts to remote participants. Please register in the Moodle course to participate. If you are unsure whether you need the course, you can take the quiz provided in the Moodle course to assess your knowledge of HPC.

For successful participation in the course, solid basic knowledge of Linux systems, especially the BASH shell, is required. We also offer courses for this (Introduction to Linux: November 27th, 9:00-12:30, Registration Linux).

Introduction to HPC:

  • Prerequisite: Solid basic knowledge of Linux
  • When and Where: December 11th, 2024, 9:00-13:00 Uhr in IA 0/65
  • Registration HPC
  • Zoom Link: Will be provided one day in advance to all registered participants.

Upcoming Introduction to the Tier-3 System

IT.SERVICES is offering two introductions to the Tier-3 HPC system via Zoom, providing an overview of the system and access possibilities.

Topics:

  • Introduction to the Advisory Board and HPC Team
  • Overview of node types and storage systems
  • Partitions/Queues
  • Accounting structure and Fair-Share
  • Computing time and access requests
  • Login procedure

Dates:

  • Tue, 05.11.2024, 14:00
  • Fri, 08.11.2024, 14:00

If you are interested, please contact us to get the Zoom link.

If you are unable to attend the offered dates, please let us know so that we can offer an alternative date.

Upcoming Linux and HPC courses

IT.SERVICES is offering two training courses with limited participant numbers. Please register for the Moodle courses. If you’re unsure whether you need the course, you can take the quiz provided in the Moodle course to assess your knowledge of Linux or HPC.

If you find that you don’t need the course or can’t attend, please deregister to allow others to take one of the limited spots.

  • Introduction to Linux:

    • Next dates:
      • 13.11.2024 9:00-12:30 in IA 1/83
      • 27.11.2024 9:00-12:30 in IA 1/83
    • Registration
  • Introduction to HPC:

    • Prerequisite: Solid basic knowledge of Linux
    • Next dates:
      • 06.11.2024 9:00-15:00 in IA 1/83
      • 21.11.2024 9:00-15:00 in IA 1/83
    • Registration

Subsections of News

Subsections of Overview

Regulations

Governing Structure

The HPC@RUB cluster Elysium is operated by the HPC team at IT.SERVICES.

The governing structure of HPC@RUB is defined in the Terms of Use.

The HPC Advisory Board consists of five elected RUB scientists and two IT.SERVICES employees.

The five members of the current HPC Advisory Board, elected on April 18 2024, are:

  • Prof. Ralf Drautz (speaker)
  • Prof. Jörg Behler
  • Prof. Sen Cheng
  • Prof. Markus Stricker
  • Prof. Andreas Vogel

FairShare

One of the main tasks of the HPC Advisory Board is to allocate a so-called FairShare of the HPC resources to Faculties, Research Centres, and Research Departments. Part of the FairShare is always reserved for scientists whose facility does not have its own allocated FairShare, so that the HPC resources are open to every scientist at RUB.

The FairShare is a percentage that determines how much of the resources is available to a given facility on average. A facility with a 10% FairShare can use 10% of the cluster 24/7 on average. If it uses less, others can make use of the free resources, and the priority of the facility to get the next job to run on the cluster will grow. If it uses more (because others don’t make full use of their FairShare), its priority will shrink accordingly. FairShare usage tracking decays over time, so that it is not possible to save up FairShare for nine months and then occupy the full cluster for a full month.

Within a given facility, all scientists that are HPC project managers share its FairShare. All HPC projects share the FairShare of their manager. Finally, all HPC users share the FairShare of their assigned project. This results in the FairShare tree that has become the standard way of managing HPC resources.

Project Management

HPC resources are managed based on projects to which individual users are assigned. The purpose of the projects is to keep an account of resource usage based on the FairShare of project managers within the FairShare of their facility.

Professors and group leaders can apply to become project managers; see the Terms of Use for details.

A project manager may apply for projects, and is responsible for compliance with all rules and regulations. Projects will be granted after a basic plausibility check; there is no review process, and access to resources is granted solely based on the FairShare principle, not based on competing project applications.

Users need to apply for access to the system, but access is only active if the user is currently assigned to at least one active project by a project manager.

Resources at RUB

HPC Cluster Elysium

Node Specifications

Type Count CPU Memory Local NVMe Storage GPU
Thin-CPU 284 2xAMD EPYC 9254 (24 core) 384 GB 960 GB -
Fat-CPU 13 2xAMD EPYC 9454 (48 core) 2304 GB 1.92 TB -
Thin-GPU 20 2xAMD EPYC 9254 (24 core) 384 GB 1.92 TB 3xNVIDIA A30 Tensor Core GPU 24GB, 933GB/s
Fat-GPU 7 2xAMD EPYC 9454 (48 core) 1152 GB 1.92 TB + 15.36 TB 8xNVIDIA H100 SXM5 GPUs 80GB, 3.35TB/s

File Systems

The following file systems are available:

  • /home: For your software and scripts. High availability, but no backup. Quota: 50 GB per user.
  • /lustre: Parallel file system to use for your jobs. High availability, but no backup. Not for long term storage. Quotas: 1 TB and 1.000.000 files per user.
  • /tmp: Fast storage on each node for temporary data. Limited in space, except for FatGPU nodes where multiple TB are available. Data is removed when the job ends.

Partition Overview

Two partitions are available for each type of compute node: the filler partitions are designed for short jobs, while the standard partitions support longer-running tasks.

Jobs in the filler partition have a lower priority and will only start if no job from the regular partition requests resources. Running jobs in the filler will cost only a fraction of the fair share of a regular partition.

The vis partition is special since the visualization nodes are intended for interactive use.

Partition Timelimit Nodelist Max Tasks
per Node
Max Memory per CPU³ Share-Cost²
cpu 2-00:00:00¹ cpu[001-284] 48 8 GB 1.000 / core
cpu_filler 3:00:00 cpu[001-336] 48 8 GB 0.050 / core
fat_cpu 2-00:00:00 fatcpu[001-013] 96 24 GB 1.347 / core
fat_cpu_filler 3:00:00 fatcpu[001-013] 96 24 GB 0.067 / core
gpu 2-00:00:00 gpu[001-020] 48 8 GB 49.374 / GPU
gpu_filler 1:00:00 gpu[001-020] 48 8 GB 12.344 / GPU
fat_gpu 2-00:00:00 fatgpu[001-007] 96 12 GB 196.867 / GPU
fat_gpu_filler 1:00:00 fatgpu[001-007] 96 12 GB 49.217 / GPU
vis 1-00:00:00 vis[001-003] 48 24 GB 5.000 / core

¹ Times of up to 7 days are possible on this partition but not recommended. Only 2 days are guaranteed, jobs running longer than that may get cancelled if that becomes necessary for important maintenance work.

² Cost does not refer to money, but the factor of computing time that is added to a projects used share in order to compute job priorities. The costs are based on the relative monetary costs of the underlying hardware.

³ Some of the memory is reserved for system services. Please check the scontrol show partition <partition_name> command to get the amount of memory that is available for your job via the --mem-per-cpu=<mem> submission flag.

Resources Elsewhere

HPC Pyramid

The HPC resources in Germany are arranged hierarchically in the so-called HPC pyramid.

HPC pyramid HPC pyramid

If suitable for your needs, use the local resources provided by the tier-3 centre. If you need more resources than the local centre can provide, or your project requires specialized hardware, you are welcome to contact another HPC centre or request computing time at a higher tier (tier-2 or tier-1).

State-wide Computing Resources (Tier-2, Tier-3)

Several state-wide tier-2 centres (NHR centres) are available to cater for specialized computing and/or storage requirements. In North Rhine-Westphalia, the RTWH Aachen, the University of Cologne and the University of Paderborn offer structured access to HPC users from NRW institutes.

National and EU-wide HPC-Resources (Tier-1 and Tier-0)

For extremely complex and data-intensive requirements, HPC resources of the highest tier are available in Germany and the EU. Computing time is only allocated after a technical and scientific peer review process (GCS, PRACE).

Access to HPC Resources elsewhere

We would be happy to advise you on the suitability of your projects as well as provide help with the application process for all levels of the HPC pyramid. Please contact us.

HPC.NRW

HPC NRW Logo HPC NRW Logo

The Ruhr-University of Bochum is part of the North Rhine-Westphalian Competence Network for High Performance Computing HPC.NRW. This network offers a first point of contact and central advisory hub with a broad knowledge base for HPC users in NRW. In addition, the tier-2 centres offer uniform, structured access for HPC users of all universities in NRW, ensuring basic services are provided for locations without tier-3 centres and for Universities of Applied Sciences.

A network of thematic clusters for low-threshold training, consulting and coaching services has been created within the framework of HPC.NRW. The aim is to make effective and efficient use of high-performance computing and storage facilities and to support scientific researchers of all levels. The existing resources and services that the state has to offer are also presented in a transparent way.

Access

Access to Elysium is granted based on HPC project applications. If you do scientific work at RUB, you are eligible for access; see the Terms of Use for details.

If you need a user account to login to Elysium go here: Get User Access. Note that access will only be active if you are assigned to at least one HPC project by a project manager!

If you already are a HPC project manager and would like to apply for a new project go here: Apply for a HPC Project.

If you are a professor or research group leader looking to apply for computing resources on Elysium go here: Become Project Manager.

Subsections of Access

Get User Access

In order to get a user account on Elysium you need to download and fill out the user access application form. Note that you can only login to the cluster after you were assigned to at least one project by a project manager.

HPC user access application screenshot HPC user access application screenshot

RUB LoginID: This is your RUB-LoginID. You need to active two-factor authentication for it.

SSH Key (Pub): Your public (not private!) SSH Key. This key must be your own. Sharing keys with others, e.g. members of your work group is not allowed. You can generate an SSH key pair with the following command:

ssh-keygen -t ed25519 -f ~/.ssh/elysium -N "passphrase"

Then you enter the contents of the file ~/.ssh/elysium.pub in the field. The “passphrase” should of course be changed to an appropriately complex password, that prevents malicious usage of your key. Note that you should only do this once. Running the command again, will overwrite your key and block you from accessing the cluster.

Note that RSA keys must have at least 3000 bits in accordance with BSI regulations. We recommand ED25519 keys, as shown in the example above.

After you correctly filled out the form save the PDF as User-Access-Application_<loginID>.pdf e.g. User-Access-Application_mamuster.pdf and send it via email to hpc+applications@ruhr-uni-bochum.de

Updating the SSH-Key

If you want to register another ssh-key or you lost your already registered one, you simply fill out the User-Access-Application again, and send it in with the exact same filename. If your ssh-key was compromised or stolen please notify us immediately so that we can invalidate it.

Apply for a HPC Project

After you were approved as a Project Manager you can apply for research projects on Elysium by downloading and filling out the project application form. The application is required for managing the fair share of computing resources, and for reporting to the HPC-Beirat, Funding organizations, etc. There is no peer-review process and your project is automatically accepted. Note that more projects do not mean more computing resources. Your personal resources are shared between all you projects.

HPC project application screenshot HPC project application screenshot

RUB LoginID: This is your RUB-LoginID.

Project Name: Name under which your project should be listed in any report.

Abstract: A short 2-3 line abstract outlining the contents of your research project.

Field of Science: Identification number according to DFG subject classification system.

Third-party funded: Optional field for research projects with third party funding. E.g. Funding institution, or project number

RUB LoginIDs: Comma separated list of RUB-LoginIDs of the people you want to participate in this project. (Note that the project manager’s LoginID is not automatically included in the participant list. If you, as a project manager, want to participate in your own project, you must enter your Login ID as well.)

Contingent: From the dropdown menu select the computing resources contingent of your project. If you did not participated in the HPC cluster application you need to select “Miscellaneous User Groups”, or get permission by the members of the other contingents to use their resources.

After you correctly filled out the form save the PDF as Project-Application_<loginID>_<number>.pdf e.g. Project-Application_mamuster_5.pdf. The numbering of projects starts at 0 due to technical reasons. Your first project will have number 0, your second will have number 1, … . In the example given above the number 5 refers to the sixths project application that was send in by you, and send it via email to hpc+applications@ruhr-uni-bochum.de

Updating a Project

If you need to add or remove users you can simply modify the user list in the pdf, and send it in again with the exact same filename.

Become Project Manager

In order to get computing resources on Elysium and be able to manage projects you need to download and fill out the project manager application form and sign the compliance to export control regulations form.

Please note that only professors and independent group leaders within the Ruhr-University are eligible to becoming project managers! See the regulations for details.

HPC project manager application screenshot HPC project manager application screenshot

Group Name: The name of your work group. E.g. “Chair for constructive demolition techniques”, or “Computatinal analysis of sumerian poetry”,…

Faculty/Institute: The name of the faculty your group is located in. E.g. “Faculty of Mathematics”, “ICAMS”, or “Universitätsklinikum Josefs-Hospital”

RUB LoginID: This is your RUB-LoginID.

Email: Your RUB-email. (Not your institute email address!) e.g. max.muster@ruhr-uni-bochum.de

Signed export regulations: After you read, understood, and signed the compliance to export control regulations form linked above, you check this box.

After you correctly filled out the form, save the PDF as Projectmanager-Application_<loginID>.pdf e.g. Resources-Application_mamuster.pdf. Save a scan of the signed export control regulations form as Compliance-Export-Control-Regulations_<loginID>.pdf e.g. Compliance-Export-Control-Regulations_mamuster.pdf. Send the application form and a scan of your signed export control regulations form via e-mail to hpc+applications@ruhr-uni-bochum.de

Documentation

Elysium is the central HPC Cluster at the Ruhr-Universität Bochum. See the overview of its resources.

To use Elysium, you need to

Please read about the basic concept of using Elysium first.

The login process combines SSH key-based authentication with web-based two-factor authentication.

After login, you can use available software modules or build your own software.

Read about submitting jobs and allocating resources in the SLURM section.

Subsections of Documentation

Basics

Elysium provides four login nodes: login1.elysium.hpc.rub.de, …, login4.elysium.hpc.rub.de. These are your entry points to the cluster.

After login, you typically use them to prepare your software and copy your data to the appropriate locations.

You can then allocate resources on the cluster using the Slurm workload manager.

After submitting your request, Slurm will grant you the resources as soon as they are free and your priority is higher than the priority of other jobs that might be waiting for some of the same resources.

Your priority depends on your waiting time and your remaining FairShare.

Login

Login to Elysium combines the common SSH key-based authentication with web-based two-factor authentication. In order to be able to authenticate during login you need submit your public SSH key via the User Access Application Form as well as enable two-factor authentication for your RUB LoginID at rub.de/login.

The additional web-based authentication is cached for 14 hours so that you typically only have to do it once per work day, per login node, and per IP address you connect from. After that, your normal key-based SSH workflow will work as expected.

In order to simplify the use of SSH keys we recommend to specify it as identity file in your SSH config. This can be done by adding the following lines to your ~/.ssh/config file:

Host login00*.elysium.hpc.rub.de login00*.elysium.hpc.ruhr-uni-bochum.de
    IdentityFile ~/.ssh/elysium
    User <loginID>

where <loginID> has to be exchanged by your rub-loginID. If your SSH key is located in a different file the IdentityFile path needs to be adjusted accordingly.

Follow these steps:

  1. Start ssh with the correct private key, your RUB LoginID, and one of the four login hosts, e.g.
    ssh -i ~/.ssh/elysium LOGINID@login001.elysium.hpc.ruhr-uni-bochum.de, or ssh login001.elysium.hpc.rub.de if you want to use the SSH config specified above. Available login nodes are login001 to login004. Login step 1: start ssh Login step 1: start ssh

  2. Open the URL in a browser (or scan the QR code with your smartphone) to start web-based two-factor authentication. Login step 2 part 1: start web-based authentication Login step 2 part 1: start web-based authentication

  3. Enter the second factor for two-factor authentication. Login step 2 part 2: web-based LoginID / password authentication Login step 2 part 2: web-based LoginID / password authentication

  4. After successful login, you get a four-digit verification code. Login step 2 part 3: get the verification code Login step 2 part 3: get the verification code

  5. Enter this code at your ssh prompt to finish login. Login step 3: verify the SSH session Login step 3: verify the SSH session

For the next 14 hours, only step 1 (classic key-based authentication) will be necessary on the chosen login node for the IP address you connected from.

Login will fail if:

  • You use the wrong private key (“Permission denied (publickey)”)
  • You are not member of an active HPC project (“Permission denied (publickey)”)
  • You did not enable two-factor authentication for your LoginID (“Two-factor authentication is required”)
  • Web-based login fails
  • You enter the wrong verification code (“Verification failed”)
  • A timeout happens between starting the SSH session and finalizing web-based login (“session_id not found”); just start the process again to get a new session ID.

Software

We provide a basic set of toolchains and some common libraries via modules.

To build common HPC software packages, we provide a central installation of the Spack package manager. A detailed guide on how to use this installation can be found in the Spack Usage Guide.

Subsections of Software

Modules

We use the Lmod module system:

  • module available (shortcut ml av) lists available modules
  • module load loads selected modules
  • module list shows modules currently loaded in your environment

There are also hidden modules that are generally less relevant to users but can be viewed with ml --show_hidden av.

We are committed to providing the tools you need for your research and development efforts. If you require modules that are not listed here or need different versions, please contact our support team, and we will be happy to assist you.

Compilers

  • GCC 11.4.1, default on the system.
  • GCC 13.2.0.
  • AOCC 4.2.0, AMD Optimizing C/C++ Compiler
  • Intel OneAPI Compilers: 2024.1.0
  • Intel Classic Compilers: 2021.10.0
  • NVHPC 24.7, NVIDIA HPC Compilers

MPI Libraries

  • OpenMPI 4.1.6
  • OpenMPI 5.0.3 (default).
  • MPICH 4.2.1
  • Intel OneAPI MPI 2021.12.1.

Mathematical Libraries

  • AMD Math Libraries:
  • AMD BLIS, BLAS-like libraries.
  • AMD FFTW, a fast Fourier transform library.
  • AMD libFLAME, a library for dense matrix computations. (LAPACK)
  • AMD ScaLAPACK.
  • HDF5: Version 1.14.3 (built with MPI)
  • Boost 1.85.0

Programming Languages

  • Julia 1.10.2, a high-level, high-performance dynamic programming language.
  • R 4.4.0, for statistical computing.
  • Python 3.11.7

Tools and Utilities

  • CUDA Toolkit 12.6.1
  • GDB (GNU Debugger)
  • Apptainer

Spack

We use the Spack package manager to provide a collection of common HPC software packages. This page explains how to use the central Spack installation to build your own modulefiles.

Table of Contents

  1. Quick Setup
  2. Guide to Using Spack
  3. Central Spack Installation
  4. Overriding Package Definitions

Quick Setup with rub-deploy-spack-configs

You can directly copy the configuration files described in Central Spack Installation (upstreams.yaml, config.yaml, modules.yaml, compilers.yaml) to your home directory using the rub-deploy-spack-configs command:

rub-deploy-spack-configs

Add these lines to your ~/.bashrc to activate spack with every login:

export MODULEPATH=$MODULEPATH:$HOME/spack/share/spack/lmod/linux-almalinux9-x86_64/Core
. /cluster/spack/0.23.0/share/spack/setup-env.sh

Guide to Using Spack

Below is a detailed guide on how to effectively use Spack.

  1. Searching for Packages
  2. Viewing Package Variants
  3. Enabling/Disabling Variants
  4. Specifying Compilers
  5. Specifying Dependencies
  6. Putting It All Together
  7. Building and Adding a New Compiler
  8. Comparing Installed Package Variants
  9. Removing Packages

Searching for Packages

To find available packages, use:

spack list <keyword>  # Search for packages by name
# Example:
spack list openfoam
openfoam  openfoam-org
==> 2 packages

For detailed information about a package:

spack info <package>  # Show versions, variants, and dependencies
# Example:
spack info hdf5

For a quick search for all available packages in spack, visit https://packages.spack.io/.


Viewing Package Variants

Variants are build options that enable or disable features. List them with spack info <package>:

spack info hdf5

Output includes:

Preferred version:  
    1.14.3           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.3/src/hdf5-1.14.3.tar.gz

Safe versions:  
    1.14.3           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.3/src/hdf5-1.14.3.tar.gz
    1.14.2           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.2/src/hdf5-1.14.2.tar.gz
    1.14.1-2         https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.1-2/src/hdf5-1.14.1-2.tar.gz
    1.14.0           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.0/src/hdf5-1.14.0.tar.gz
    1.12.3           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.3/src/hdf5-1.12.3.tar.gz
    1.12.2           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.2/src/hdf5-1.12.2.tar.gz
    1.12.1           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.1/src/hdf5-1.12.1.tar.gz
    1.12.0           https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.0/src/hdf5-1.12.0.tar.gz
    1.10.11          https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.11/src/hdf5-1.10.11.tar.gz
    1.10.10          https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.10/src/hdf5-1.10.10.tar.gz

Variants:
    api [default]               default, v110, v112, v114, v116, v16, v18
        Choose api compatibility for earlier version
    cxx [false]                 false, true
        Enable C++ support
    fortran [false]             false, true
        Enable Fortran support
    hl [false]                  false, true
        Enable the high-level library
    mpi [true]                  false, true
        Enable MPI support

Defaults are shown in square brackets, possible values to the right.


Checking the Installation

To see which dependencies will be installed, use:

spack spec hdf5

Output includes:

Input spec
--------------------------------
 -   hdf5

Concretized
--------------------------------
[+]  hdf5@1.14.3%gcc@11.4.1~cxx~fortran+hl~ipo~java~map+mpi+shared~subfiling~szip~threadsafe+tools api=default build_system=cmake build_type=Release generator=make patches=82088c8 arch=linux-almalinux9-zen4
[+]      ^cmake@3.27.9%gcc@11.4.1~doc+ncurses+ownlibs build_system=generic build_type=Release arch=linux-almalinux9-zen4
[+]          ^curl@8.7.1%gcc@11.4.1~gssapi~ldap~libidn2~librtmp~libssh+libssh2+nghttp2 build_system=autotools libs=shared,static tls=mbedtls,openssl arch=linux-almalinux9-zen4
[+]              ^libssh2@1.11.0%gcc@11.4.1+shared build_system=autotools crypto=mbedtls patches=011d926 arch=linux-almalinux9-zen4
[+]                  ^xz@5.4.6%gcc@11.4.1~pic build_system=autotools libs=shared,static arch=linux-almalinux9-zen4
[+]              ^mbedtls@2.28.2%gcc@11.4.1+pic build_system=makefile build_type=Release libs=shared,static arch=linux-almalinux9-zen4
[+]              ^nghttp2@1.52.0%gcc@11.4.1 build_system=autotools arch=linux-almalinux9-zen4
[+]                  ^diffutils@3.10%gcc@11.4.1 build_system=autotools arch=linux-almalinux9-zen4
[+]              ^openssl@3.3.0%gcc@11.4.1~docs+shared build_system=generic certs=mozilla arch=linux-almalinux9-zen4
[+]                  ^ca-certificates-mozilla@2023-05-30%gcc@11.4.1 build_system=generic arch=linux-almalinux9-zen4
[+]          ^ncurses@6.5%gcc@11.4.1~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-almalinux9-zen4
[+]      ^gcc-runtime@11.4.1%gcc@11.4.1 build_system=generic arch=linux-almalinux9-zen4
[e]      ^glibc@2.34%gcc@11.4.1 build_system=autotools arch=linux-almalinux9-zen4
[+]      ^gmake@4.4.1%gcc@11.4.1~guile build_system=generic arch=linux-almalinux9-zen4
[+]      ^openmpi@5.0.3%gcc@11.4.1~atomics~cuda~gpfs~internal-hwloc~internal-libevent~internal-pmix~java+legacylaunchers~lustre~memchecker~openshmem~orterunprefix~romio+rsh~static+vt+wrapper-rpath build_system=autotools fabrics=ofi romio-filesystem=none schedulers=slurm arch=linux-almalinux9-zen4

It’s always a good idea to check the specs before installing.


Enabling/Disabling Variants

Control variants with + (enable) or ~ (disable):

spack install hdf5 +mpi +cxx ~hl  # Enable MPI and C++, disable high-level API

For packages with CUDA, use compute capabilities 8.0 (for GPU nodes) and 9.0 (for FatGPU nodes):

spack install openmpi +cuda cuda_arch=80,90

Specifying Compilers

Use % to specify a compiler. Check available compilers with:

spack compilers

Example:

spack install hdf5 %gcc@11.4.1

When using compilers other than GCC 11.4.1, dependencies must also be built with that compiler:

spack install --fresh hdf5 %gcc@13.2.0

Specifying Dependencies

Use ^ to specify dependencies with versions or variants:

spack install hdf5 ^openmpi@4.1.5

Dependencies can also have variants:

spack install hdf5 +mpi ^openmpi@4.1.5 +threads_multiple

Make sure to set the variants for the package and the dependencies on the right position or installation will fail.


Putting It All Together

Combine options for customized installations:

spack install hdf5@1.14.3 +mpi ~hl ^openmpi@4.1.5 +cuda cuda_arch=80,90 %gcc@11.4.1
  • Compiles hdf 1.14.3 with GCC 11.4.1.
  • Enables MPI support, disable high-level API.
  • Uses OpenMPI 4.1.5 with cuda support as a dependency.

Building and Adding a New Compiler

Install a new compiler (e.g., GCC 13.2.0) with:

spack install gcc@13.2.0

Add it to Spack’s compiler list:

spack compiler add $(spack location -i gcc@13.2.0)

Verify it’s recognized:

spack compilers

Use it to build packages:

spack install --fresh hdf5 %gcc@13.2.0

Comparing installed package variants

If you have multiple installations of the same package with different variants, you can inspect their configurations using Spack’s spec command or the find tool.


List installed packages with variants

Use spack find -vl to show all installed variants and their hashes:

spack find -vl hdf5

-- linux-almalinux9-zen4 / gcc@11.4.1 ---------------------------
amrsck6 hdf5@1.14.3~cxx~fortran+hl~ipo~java~map+mpi+shared~subfiling~szip~threadsafe+tools api=default build_system=cmake build_type=Release generator=make patches=82088c8

2dsgtoe hdf5@1.14.3+cxx+fortran~hl~ipo~java~map+mpi+shared~subfiling~szip~threadsafe+tools api=default build_system=cmake build_type=Release generator=make patches=82088c8

==> 2 installed packages
  • amrsck6 and 2dsgtoe are the unique hashes for each installation.
  • You can see one package uses +hl while the other does not.

Inspect specific installations

Use spack spec /<hash> to view details of a specific installation:

spack spec /amrsck6  
spack spec /2dsgtoe 

Compare two installations

To compare variants between two installations, use spack diff with both hashes:

spack diff /amrsck6 /2dsgtoe

You will see a diff in the style of git:

--- hdf5@1.14.3/amrsck6mml43sfv4bhvvniwdydaxfgne
+++ hdf5@1.14.3/2dsgtoevoypx7dr45l5ke2dlb56agvz4
@@ virtual_on_incoming_edges @@
-  openmpi mpi
+  mpich mpi

So one version depends on OpenMPI while the other depends on MPICH.


Removing Packages

For multiple variants of a package, specify the hash:

spack uninstall /amrsck6

Central Spack Installation

Activate the central Spack installation with:

source /cluster/spack/0.23.0/share/spack/setup-env.sh

Use it as a starting point for your own builds without rebuilding everything from scratch.

Add these files to ~/.spack:

  • ~/.spack/upstreams.yaml:
    upstreams:
      central-spack:
        install_tree: /cluster/spack/opt
  • ~/.spack/config.yaml:
    config:
      install_tree:
        root: $HOME/spack/opt/spack
      source_cache: $HOME/spack/cache
      license_dir: $HOME/spack/etc/spack/licenses
  • ~/.spack/modules.yaml:
    modules:
      default:
        roots:
          lmod: $HOME/spack/share/spack/lmod
        enable: [lmod]
        lmod:
          all:
            autoload: direct
          hide_implicits: true
          hierarchy: []

Add these lines to your ~/.bashrc:

export MODULEPATH=$MODULEPATH:$HOME/spack/share/spack/lmod/linux-almalinux9-x86_64/Core
. /cluster/spack/0.23.0/share/spack/setup-env.sh

Then run:

spack compiler find

Overriding Package Definitions

Create ~/.spack/repos.yaml:

repos:
  - $HOME/spack/var/spack/repos

And a local repo description in ~/spack/var/spack/repos/repo.yaml:

repo:
  namespace: overrides

Copy and edit a package definition, e.g., for ffmpeg:

cd ~/spack/var/spack/repos/
mkdir -p packages/ffmpeg
cp /cluster/spack/0.23.0/var/spack/repos/builtin/packages/ffmpeg/package.py packages/ffmpeg
vim packages/ffmpeg/package.py

Alternatively, you can use a fully independent Spack installation in your home directory or opt for EasyBuild.

SLURM

The Elysium HPC system utilizes SLURM as a resource manager, scheduler, and accountant in order to guarantee fair share of the computing resources.

If you are looking for technical details regarding the usage and underlying mechanisms of SLURM we recommend participating in the Introduction to HPC training course.

Examples of job scripts for different job types that are tailored to the Elysium cluster can be found in the Training Section.

List of Partition

All nodes in the Elysium cluster are grouped by their hardware kind, and job submission type. This way users can request specific computing hardware, and multi node jobs are guaranteed to run on nodes with the same setup.

In order to get a list of the available partitions, their current state, and available nodes, the sinfo command can be used.

 1[login_id@login001 ~]$ sinfo
 2PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
 3cpu               up 7-00:00:00      4  alloc cpu[033-034,037-038]
 4cpu               up 7-00:00:00    280   idle cpu[001-032,035-036,039-284]
 5cpu_filler        up    3:00:00      4  alloc cpu[033-034,037-038]
 6cpu_filler        up    3:00:00    280   idle cpu[001-032,035-036,039-284]
 7fat_cpu           up 2-00:00:00     13   idle fatcpu[001-013]
 8fat_cpu_filler    up    3:00:00     13   idle fatcpu[001-013]
 9gpu               up 2-00:00:00     20   idle gpu[001-020]
10gpu_filler        up    1:00:00     20   idle gpu[001-020]
11fat_gpu           up 2-00:00:00      1 drain* fatgpu005
12fat_gpu           up 2-00:00:00      5    mix fatgpu[001,003-004,006-007]
13fat_gpu           up 2-00:00:00      1   idle fatgpu002
14fat_gpu_filler    up    1:00:00      1 drain* fatgpu005
15fat_gpu_filler    up    1:00:00      5    mix fatgpu[001,003-004,006-007]
16fat_gpu_filler    up    1:00:00      1   idle fatgpu002
17vis               up 1-00:00:00      3   idle vis[001-003]

Requesting Nodes of a Partition

SLURM provides two commands to request resources. srun is used to start an interactive session.

1[login_id@login001 ~]$ srun -N 1 --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 --pty bash
2[login_id@cpu001 ~]$

sbatch is used to request resources that will execute a job script.

1[login_id@login001 ~]$ sbatch -N 1 --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 myscript.sh
2Submitted batch job 10290

For sbatch the submission flags can also be incorporated into the job script itself. More information about job scripts, and the required and some optional flags can be found in the Training/SLURM Header section.

On Elysium several flags are mandatory. sbatch and srun will refuse to queue the job and give a detailed explanation which flag is missing and how to incorporate it into your command or script.

Use spredict myscript.sh to estimate the start time of your job.

Shared Nodes

All nodes are shared by default. If a user requests fewer CPU-cores than a node provides, other users may use these resources at the same time. To ensure that the requested nodes are not shared use the --exclusive flag. If more than one node is requested the --exlusive flag is mandatory.

GPU Nodes

For requesting resources on a GPU node the --gpus=<number of GPUs> flag is required. In order to allow for fairly shared resources the number of CPUs per GPU is limited. Thus the --cpus-per-gpu=<number of CPU cores per GPU> is required as well. For multi node jobs --gpus-per-node=<number of GPUs per node> option needs to be set.

Visualization Nodes

For requesting resources on a visualization node no --gpu parameter is needed. The available GPU will automatically be shared between all jobs on the node.

List of Currently Running and Pending Jobs

If requested resources are currently not available, jobs are queued and will start as soon as the resources are available again. To check which jobs are currently running, and which ones are pending and for what reason the squeue command can be used. For privacy reasons only the user’s own jobs are displayed.

1[login_id@login001 ~]$ squeue
2             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3             10290       cpu     test login_id  R       2:51      1 cpu001

List of Computing Resources Share

Users/Projects/Groups/Institutes are billed for computing resources used. To check how many resources a user is entitled to and how many they have already used the sshare command is used. For privacy reasons only the user’s own shares are displayed.

1[login_id@login001 ~]$ sshare
2Account                    User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
3-------------------- ---------- ---------- ----------- ----------- ------------- ----------
4testproj_0000          login_id       1000    0.166667    20450435      0.163985   0.681818

List of Project Accounts

Due to technical reasons the project names on Elysium have rather cryptic names, based on the loginID of the project manager and a number. In order to make it easier to select a project account for the --account flag for srun, or sbatch, and to check the share and usage of projects, the RUB-exclusive rub-acclist command can be used.

1[login_id@login001 ~]$ rub-acclist
2Project ID    | Project Description
3--------------+--------------------------------------------------
4testproj_0000 | The fundamental interconnectedness of all things
5testproj_0001 | The translated quaternion for optimal pivoting

Visualization

We provide Visualization via VirtualGL on the visualization nodes on Elysium.hpc.ruhr-uni-bochum.de

Requirements:

X11 server with 24-bit- or 32-bit Visuals. VirtualGL version > 3.0.2 installed.

You can check support for your Operating Sytsem at: https://virtualgl.org/Documentation/OSSupport You can download VirtualGL at: https://github.com/VirtualGL/virtualgl/releases

To use VirtualGL on Elysium, you will only need the VirtualGL client, it is not necessary to configure a VirtualGL Server.

Resource allocation:

Allocate resources in the vis partition.

salloc -p vis -N1 --time=02:00:00 --account=$ACCOUNT

This will allocate a share of one vis node for 2 hours. (For more options on node allocations see SLURM). Wait until a Slot in the vis partition is available. You can check if your resources are already allocated using the ‘squeue’ command.

Establish Virtual GL connection:

Connect directly from your computer to the visualization node via ssh with vglonnect -s Use one of the login servers login[001-004] as a jump host.

vglconnect -s $HPCUSER@vis001.elysium.hpc.rub.de -J $HPCUSER@login001.elysium.hpc.rub.de

If you don’t like long commands, you can configure one of the login nodes as jump host in your ~/.ssh/config for the vis[001-003] hosts. The command vglconnect -s accepts nearly the same syntax as ssh.

Run your Software:

Load a module if required. Start your application using vglrun, please remember to use useful command line options like -fps .

module load vmd
vglrun +pr -fps 60 vmd

Please remember to cancel the resource allocation once you are done with your interactive session.

scancel $jobID

Job Monitoring

With our web-based job monitoring system (ClusterCockpit), you can easily monitor and analyze the performance of your jobs on the Elysium HPC system. For a quick performance check, see Metrics to Check; for an in-depth analysis, refer to the HPC-Wiki. For details on the web interface, consult the official documentation.

Login

To access the job monitoring system, use your RUB LoginID and corresponding password as credentials. JobMon Login Page JobMon Login Page

Overview

After logging in successfully, you will see the “Clusters” overview, which displays the total number of jobs you have run and the current number of jobs running on the cluster. At present, this information includes only the Elysium cluster. You can continue from here, either by going to the total jobs overview, or the running jobs overview. Alternatively, you can click on “My Jobs” in the top left of the page, or search for job names/ids in the top right of the page. JobMon Landing Page JobMon Landing Page

My Jobs

The “My Jobs” page displays a list of your jobs, fully customizable to your requirements. Use the menus in the top left corner to sort or filter the list, and select the metrics you want to display for your jobs. Below, you’ll find a detailed table with job IDs, names, and your selected metrics.

MyJobs Page MyJobs Page

Job Details

This page is split into three sections. The first one shows general information: JobInfo, a footprint and a roofline diagram that shows how efficiently the job utilized the hardware. Note that the footprint is only updated every 10 minutes and the energy footprint is generated after the job finished.

In the next section some metrics are shown as diagrams. For some of the diagrams you can choose the scope, i.e. core, socket or node. The shown metrics and their order can be customized with the “Select Metrics” menu. This selection is saved per partition. Double-click the graph to zoom out if the scale is too small.

The last section displays selected metrics in a numerical way, lets you inspect your job script, and shows more detail about the job allocation an runtime parameters.

Job Page Job Page Job Page Job Page Job Page Job Page Job Page Job Page

Metrics

The following table shows the metrics which are available for jobs on Elysium:

Metric name Meaning Meaningful for shared jobs
CPU
cpu_load Load on the node (processes/threads requesting CPU time) No
cpu_load_core Load on CPU cores of a job (processes/threads per core) Yes
cpu_user Percentage of CPU time spent as user time for each CPU core Yes
clock Frequency of the CPU cores of the job Yes (affected by other jobs)
ipc Instructions per cycle Yes
flops_any Floating-point operations performed by CPU cores Yes
core_power Power consumption of individual CPU cores Yes
Memory
mem_bw Memory bandwidth No (full socket only)
mem_used Main memory used on the node No
disk_free Free disk space on the node No
GPU
nv_compute_processes Number of processes using the GPU Yes
acc_mem_used Accelerator (GPU) memory usage Yes
acc_mem_util Accelerator (GPU) memory utilization Yes
acc_power Accelerator (GPU) power usage Yes
acc_utilization Accelerator (GPU) compute utilization Yes
Filesystem
lustre_write_bw /lustre write bandwidth No
lustre_read_bw /lustre read bandwidth No
lustre_close /lustre file close requests No
lustre_open /lustre file open requests No
lustre_statfs /lustre file stat requests No
io_reads Local Disk I/O read operations No
io_writes Local Disk I/O write operations No
nfs4_close /home + /cluster file close requests No
nfs4_open /home + /cluster file open requests No
nfsio_nread /home + /cluster I/O read bandwidth No
nfsio_nwrite /home + /cluster I/O write bandwidth No
Network
ib_recv Omnipath receive bandwidth No
ib_xmit Omnipath transmit bandwidth No
ib_recv_pkts Omnipath received packets/s No
ib_xmit_pkts Omnipath transmitted packets/s No
net_bytes_in Ethernet incoming bandwidth No
net_bytes_out Ethernet outgoing bandwidth No
net_pkts_in Ethernet incoming packets/s No
net_pkts_out Ethernet outgoing packets/s No
NUMA Nodes
numastats_numa_hit NUMA hits No
numastats_numa_miss NUMA misses No
numastats_interleave_hit NUMA interleave hits No
numastats_local_node NUMA local node accesses No
numastats_numa_foreign NUMA foreign node accesses No
numastats_other_node NUMA other node accesses No
Node metrics
node_total_power Power consumption of the whole node No

Metrics to Check

For a quick performance analysis, here are some key metrics to review:

  • cpu_user: Should be close to 100%. Lower values indicate system processes are using some of your resources.
  • flops_any: Measures calculations per second. On Elysium, a typical CPU node averages around 400 GFLOPS.
  • cpu_load_core: Should be 1 at most for non-OpenMP jobs. Higher values suggest oversubscription.
  • ipc: Instructions executed per cycle. Higher values indicate better efficiency.
  • mem_bw: Memory bandwidth, maxing out at 350 GByte/s. Only meaningful if the node isn’t shared or your job uses a full socket.
  • acc_utilization: GPU compute utilization. Aim for high percentages (e.g., above 80%) to ensure efficient GPU usage.

Known Problems

Occasionally, an orange box labeled “No dataset returned for <metric>” may be shown instead of the graph. This occurs when the ClusterCockpit service was unable to collect the metrics during your job. Job Page Job Page Note that jobs that ran before March 12th 2025 may report missing or incorrect data in some cases.

The measurements for ipc and clock are sometimes too high. This is related to power saving features of the CPU. We are currently investigating how to solve this issue.

For jobs that ran before March 7th 2025 a bug triggered an overflow for the power usage metric resulting in unrealisticly high power consumptions. This bug is fixed, but the fix cannot be applied to older jobs that were affected by it. Job Page Job Page

Training

Training

Usage of HPC resources differs significantly from handling a regular desktop computer. In order to help people get started we provide two training courses.

In addition to that we recommend online resources. We provide a variety of examples job scripts tailored to Elysium, to get you started with your research.

Subsections of Training

Introduction to Linux

Why Linux?

Linux based operating systems are the de facto standard for HPC systems. Thus it is vital to have a solid understanding of how to work with Linux.

Linux Introductory Training

We offer an in-person course that combines a lecture and interactive exercises. The course covers the following topics:

  1. Why Linux?
  2. Directory Structure
  3. The Terminal
  4. Navigating the Directory Structure
  5. Modifying the Directory Structure
  6. Handling Files
  7. Permission Denied
  8. Editing Files in the Terminal
  9. Workflow and Pipelines
  10. Automation and Scripting
  11. Environment Variables
  12. Monitoring System Resources

Registration

Dates for the courses are announced via the tier3-hpc mailing list. After the next course date was announced registration can be done via Moodle. We expect all who registered in the course to participate in the next course. Note that the number of participants is limited to 20 people per course. If you change your mind about participation please deregister from the course to free one of the limited spots to others.

Do I Need This Course?

If you are already proficient in the topics listed above you may skip the course. It is not a requirement to get access to the cluster. In the Moodle course we provide a quiz where you can check your proficiency with Linux.

Slides

Here you may download the slides for the course: Introduction to Linux.

Introduction to HPC

Why HPC Training?

Usage of HPC resources differs significantly from handling a regular desktop computer. Thus it is vital to have a solid understanding of how to work with HPC systems.

HPC Introductory Training

We offer an in-person course that combines a lecture and interactive exercises. The course covers the following topics:

  1. What is High Performance Computing
  2. HPC-Cluster Components
  3. How to Access a Cluster?
  4. SLURM - Requesting Resources
  5. SLURM - How Resources are Scheduled
  6. SLURM - Accounting and Sharing of Compute Time
  7. Environment Modules
  8. Parallelization Models
  9. Scaling of Parallel Applications
  10. Code of Conduct

Registration

Dates for the courses are announced via the tier3-hpc mailing list. After the next course date was announced registration can be done via Moodle. We expect all who registered in the course to participate in the next course. Note that the number of participants is limited to 20 people per course. If you change your mind about participation please deregister from the course to free one of the limited spots to others.

Do I Need This Course?

If you are already proficient in the topics listed above you may skip the course. It is not a requirement to get access to the cluster. In the Moodle course we provide a quiz where you can check your proficiency with HPC systems.

Slides

Here you may download the slides for the course: Introduction to HPC.

Online Resources

Online Resources

If you do not want to or cannot participate in the training courses, but still want to learn about Linux and HPC, here we provide a list of a few online resources. Note that those materials might not reflect specifics regarding the hardware or environment of the RUB cluster Elysium.

Job Scripts

Job Scripts

Jump to Example Scripts

A SLURM job script usually follows the following steps:

  1. SLURM Header
  2. Create temporary folder on local disk
  3. Copy input data to temporary folder
  4. Load required modules
  5. Perform actual calculation
  6. Copy output file back to global file system
  7. Tidy up local storage

SLURM Header

The SLURM HEADER is a section in the script after the shebang. Every line begins with #SBATCH.

1#SBATCH --nodes=1                 # Request 1 Node
2#SBATCH --partition=gpu           # Run in partition cpu
3#SBATCH --job-name=minimal_gpu    # name of the job in squeue
4#SBATCH --gpus=1                  # number of GPUs to reserve
5#SBATCH --time=00:05:00           # estimated runtime (dd-hh:mm:ss)
6#SBATCH --account=lambem64_0000   # Project ID (check with rub-acclist)

This way the bash interpreter ignores these lines, but SLURM can pick them out to parse the contents. Additionally each line contains one of the sbatch flags. On Elysium the flags --partition, --time, and --account are required. For GPU-jobs the additional --gpus flag needs to be specified and at least 1.

Mandatory Flags

Flag Example Note
--partiton=<partition> --partition=cpu list of partitions with sinfo
--time=<dd-hh:mm:ss> --time=00-02:30:00 maximum time the job will run
--account=<account> --account=snublaew_0001 project the used computing time shall be billed to. list of project accounts with rub-acclist
--gpus=<n> --gpus=1 number of GPUs. Must be at least 1 for GPU partitions

Optional Flags

Flag Example Note
--job-name=<name> --job-name="mysim" job name that is shown in squeue for the job
--exclusive --exclusive Nodes are not shared with other jobs (default on cpu, fat_cpu, gpu).
--output=<filename> --output=%x-%j.out Filename to contain stdout (%x=job name, %j=job-id)
--error=<filename> --error=%x-%j.err Filename to contain stderr (%x=job name, %j=job-id)
--mail-type=<TYPE> --mail-type=ALL Notify user by email when certain event types occur. If specified --mail-user needs to be set.
--mail-user=<rub-mail> --mail-user=max.muster@rub.de Address to which job notifications of type --mail-type are send.

Temporary Folder

If your code reads from some input, or writes output, the performance can strongly depend on where the data is located. If the data is in your home, or on the lustre file system the read/write performance is limited by the bandwidth of the interconnect. In addition to that a parallel file system has problems with many small read/write operations by design. It’s performance shines with reading/writing big chunks. Thus it is advisable to create a folder on the local disks in the /tmp/ directory, and perform all read/write operations in there. At the beginning of the job any input data is put there in one copy, and all output data is copied from the /tmp/ directory to its final location in one go.

 1# obtain the current location
 2HDIR=$(pwd)
 3
 4# create a temporary working directory on the node
 5WDIR=/tmp
 6cd ${WDIR}
 7
 8# copy set of input files to the working directory
 9cp ${HDIR}/inputdata/* ${WDIR}
10
11...
12
13# copy the set of output files back to the original folder
14cp outputdata ${HDIR}/outputs/
15
16# tidy up local files
17rm -rf ${WDIR}/*

Loading Modules

If your program was build with certain versions of libraries it may be required to provide the same libraries at runtime. Since everybody’s needs regarding library versions is different Elysium utilizes environment modules to manager software versions.

 1# unload all previously loaded modules
 2module purge
 3
 4# show all module that are available
 5module avail
 6
 7# load a specific module
 8module load the_modules_name_and_version
 9
10# list all loaded modules
11module list

Perform Calculation

How to perform your calculation strongly depends on your specific software and inputs. In general there are four typical ways to run HPC jobs.

Farming

Farming jobs are used if the program is not parallelized, or scales in a way that it can only utilize a few CPU cores efficiently. Then multiple instances of the same program are started. Each with a different input, as long as the instances have roughly the same runtime.

1for irun in $(seq 1 ${stride} ${ncores})
2do
3    # The core count needs to start at 0 and goes to ncores-1
4    taskset -c $(bc <<< "${irun-1}") ${myexe} inp.${irun} > out.${irun}
5done
6wait

Shared Memory

Programs that incorporate thread spawning (usually via OpenMP) can make use of multiple cores.

1export OMP_NUM_THREADS=${SLURM_TASKS_PER_NODE}
2${myexe} input

Distributed Memory

If programs require more resources than can be provided by one node it is necessary to pass information between the processes running on different nodes. This is usually done via the MPI protocol. A program must be specifically programmed to utilize MPI.

1ncorespernode=48
2nnodes=${SLURM_JOB_NUM_NODES}
3ncorestotal=$(bc <<< "${ncorespernode}*${nnodes}")
4mpirun -np ${ncorestotal} -ppn ${ncorespernode} ${myexe} input

Hybrid Memory (Shared and Distributed Memory)

In programs that utilize distributed memory parallelization via MPI it is possible to spawn threads within each process to make use of shared memory parallelization.

1nthreadsperproc=2
2ncorespernode=$(bc <<< "48/${nthreadsperproc}")
3nnodes=${SLURM_JOB_NUM_NODES}
4ncorestotal=$(bc <<< "${ncorespernode}*${nnodes}")
5export OMP_NUM_THREADS=${nthreadsperproc}
6mpirun -np ${ncorestotal} -ppn ${ncorespernode} ${myexe} input

GPU

Support for offloading tasks to GPUs needs to be incorporated into the program.

1export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
2${myexe} input

Examples

The following example scripts are ready to use on the Elysium cluster. The only change you need to make is to specify a valid account for the --account flag. You can use the rub-acclist command to get a list of your available project accounts. The executed programs do not produce any load and will finish in a few seconds. The generated output shows where each process/thread ran, and if it had access to a GPU.

Minimal CPU Job Script Example

Farming Job Script Example

Shared Memory Job Script Example

Distributed Memory Job Script Example

Hybrid Memory Job Script Example

Minimal GPU Job Script Example

GPU Job Script Example

Distributed Memory with GPU Job Script Example

FAQ

How Do I …?

Simply fill out the user access application form with your new ssh-key and send it to hpc+applications@ruhr-uni-bochum.de. Already existing keys are not invalidated if you send in multiple keys. If your old key need to be invalidated, please inform us immediately.

Simply edit the user list in your project application, save it under the exact same filename as before, and send it to hpc+applications@ruhr-uni-bochum.de.

Unfortunately it is not possible to delegate any of the applications to other people. However, you may fill out the application proposal, but the actual applicant has to send it in from their RUB email address, to prevent fraud.

If your job is the only job on the node (e.g. you specified the --exclusive flag) you may simply use ssh <nodename>. SSH connections to compute nodes are only permitted if a user has a running job on it.

If your job shares the node with other jobs you should use srun --pty --overlap --jobid=<jobid> /bin/bash, which will connect your terminal to the already running job. You will have access to the exact same resources, that your job allocated. Thus, it is not possible to confuse your resources with the ones from other jobs.

When you try to connect to the cluster the following error might occur:

$ ssh <username>@login1.elysium.hpc.ruhr-uni-bochum.de -i ~/.ssh/elysium 
<username>@login1.elysium.hpc.ruhr-uni-bochum.de: Permission denied (publickey,hostbased)

One possibility is that you are not a member of an HPC project. Please verify that your supervisor added you to one of their projects.

It might be that you are using the wrong key. Please verify that the specified key file (the one after the -i flag) contains the key you supplied with your user application. If you are using an SSH config entry please make sure that the IdentityFile path is set correctly.

If you verified that you are using the correct key please add the -vvv flag to your ssh command and send the output to hpc-helpdesk@ruhr-uni-bochum.de.

The following commands expect an SSH config entry as it is shown here.

Data can be copied to and from the cluster using scp, or rsync. We strongly recommend rsync due to the many quality of life features.

# Copy local to cluster
rsync -r --progress --compress --bwlimit=10240 <local_source_path> login001.elysium.hpc.rub.de:<remote_destination_path>

# Copy cluster to local
rsync -r --progress --compress --bwlimit=10240 login001.elysium.hpc.rub.de:<remote_source_path> <local_destination_path>

The paths to the data which shall be copied and the destination, as well as the username need to be adjusted. Note that there is no trailing “/” at the end of the source path. If there was one, the directories contents, not the directory itself would be copied.

Flags:

  • -r enables recursive copies (directories and their content)
  • --progress gives you a live update about the amount that has been copied already and an estimate of the remaining time
  • --compress attempts to compress the data on the fly to speed up the data transfer even more
  • --bwlimit limits the data transfer rate in order to leave some bandwidth to other people who want to copy data, or work interactively.

If multiple file are to be copied to/from the cluster, the data should be packed into a tar-archive before sending:

# create a tar archive
tar -cvf myfiles.tar dir_or_files.*

# extract a tar archive
tar -xvf myfiles.tar

Note that running multiple instances of rsync, or scp will not speed up the copy process, but slow it down even more!

Compute nodes can only connect to hosts in the university network by default, and for good reasons. Only the login nodes have internet access.

Please organize your computations in such a way that internet access is only required for preparation and postprocessing, i.e. before your computations start or after they end. For these purposes, internet access from the login nodes is sufficient.

If you absolutely must access hosts outside of the university network from a compute node, you can use the RUB WWW Proxy Cache: export https_proxy=https://www-cache.rub.de:443. However, make sure to use the cache responsibly, and keep in mind the following drawbacks:

  • Your computations depend on the availability of external network resources, which introduces the risk of job failure and therefore waste of resources.
  • The proxy cache may be bandwidth limited.
  • Network transfer times on compute nodes are fully billed in the FairShare system.

The rub-quota tool reports disk usage on both /home and /lustre.

According to the Terms of Use, publications must contain an acknowledgement if HPC resources were used. For example: “Calculations (or parts of them) for this publication were performed on the HPC cluster Elysium of the Ruhr University Bochum, subsidised by the DFG (INST 213/1055-1).”

More answers coming soon.

In the meantime, please see Help.

Help

Stay Informed

The following news channels are available:

  • The tier3-hpc mailing list is for general news about HPC@RUB. Subscribe if you are interested.
  • News for Elysium users and course announcements are available in the News section. You can subscribe to the RSS Feed or to the hpc-news mailing list to get these automatically.
  • Urgent messages regarding operations of Elysium will be sent via direct mail to all active users.

User Community

The HPC User room in the RUB Matrix Server is for exchange and collaboration between HPC users at RUB. Please join!

The HPC team is also present, but please note that you should still contact us at hpc-helpdesk@ruhr-uni-bochum.de if you need a problem solved.

Contact

The HPC@RUB cluster Elysium is operated by the HPC team at IT.SERVICES.

Contact us at hpc-helpdesk@ruhr-uni-bochum.de to ask questions or report problems.

Please include as much information as possible to help us identify the problem, e.g. LoginID, JobID, submission script, directory and file names etc., if applicable.

Courses at RUB

The HPC@RUB team offers two courses:

External Resources