Subsections of

News

HPC@RUB News - also available via RSS Feed or the hpc-news mailing list.

User Community Matrix Room

We created the HPC User room in the RUB Matrix Server. Please join! This room is for exchange and collaboration between HPC users at RUB.

The HPC team is also present, but please note that you should still contact us at hpc-helpdesk@ruhr-uni-bochum.de if you need a problem solved.

Elysium Launch Event

The launch event for the Elysium cluster took place yesterday, see the IT.SERVICES News.

Cluster Elysium is now open!

We are happy to announce that the Elysium cluster started operations and is now open for all researchers at RUB!

Upcoming HPC course

IT.SERVICES is offering the course “Introduction to High Performance Computing” again.

It will take place on December 11th 2024 in IA 0/65 in a hybrid format. There will be a lecture with interactive exercises on an HPC system. For the exercises we will provide 15 accounts. The accounts will be given to the first 15 in-person participants. Note that we can not give any accounts to remote participants. Please register in the Moodle course to participate. If you are unsure whether you need the course, you can take the quiz provided in the Moodle course to assess your knowledge of HPC.

For successful participation in the course, solid basic knowledge of Linux systems, especially the BASH shell, is required. We also offer courses for this (Introduction to Linux: November 27th, 9:00-12:30, Registration Linux).

Introduction to HPC:

  • Prerequisite: Solid basic knowledge of Linux
  • When and Where: December 11th, 2024, 9:00-13:00 Uhr in IA 0/65
  • Registration HPC
  • Zoom Link: Will be provided one day in advance to all registered participants.

Upcoming Introduction to the Tier-3 System

IT.SERVICES is offering two introductions to the Tier-3 HPC system via Zoom, providing an overview of the system and access possibilities.

Topics:

  • Introduction to the Advisory Board and HPC Team
  • Overview of node types and storage systems
  • Partitions/Queues
  • Accounting structure and Fair-Share
  • Computing time and access requests
  • Login procedure

Dates:

  • Tue, 05.11.2024, 14:00
  • Fri, 08.11.2024, 14:00

If you are interested, please contact us to get the Zoom link.

If you are unable to attend the offered dates, please let us know so that we can offer an alternative date.

Upcoming Linux and HPC courses

IT.SERVICES is offering two training courses with limited participant numbers. Please register for the Moodle courses. If you’re unsure whether you need the course, you can take the quiz provided in the Moodle course to assess your knowledge of Linux or HPC.

If you find that you don’t need the course or can’t attend, please deregister to allow others to take one of the limited spots.

  • Introduction to Linux:

    • Next dates:
      • 13.11.2024 9:00-12:30 in IA 1/83
      • 27.11.2024 9:00-12:30 in IA 1/83
    • Registration
  • Introduction to HPC:

    • Prerequisite: Solid basic knowledge of Linux
    • Next dates:
      • 06.11.2024 9:00-15:00 in IA 1/83
      • 21.11.2024 9:00-15:00 in IA 1/83
    • Registration

Subsections of News

Subsections of Overview

Regulations

Governing Structure

The HPC@RUB cluster Elysium is operated by the HPC team at IT.SERVICES.

The governing structure of HPC@RUB is defined in the Terms of Use.

The HPC Advisory Board consists of five elected RUB scientists and two IT.SERVICES employees.

The five members of the current HPC Advisory Board, elected on April 18 2024, are:

  • Prof. Ralf Drautz (speaker)
  • Prof. Jörg Behler
  • Prof. Sen Cheng
  • Prof. Markus Stricker
  • Prof. Andreas Vogel

FairShare

One of the main tasks of the HPC Advisory Board is to allocate a so-called FairShare of the HPC resources to Faculties, Research Centres, and Research Departments. Part of the FairShare is always reserved for scientists whose facility does not have its own allocated FairShare, so that the HPC resources are open to every scientist at RUB.

The FairShare is a percentage that determines how much of the resources is available to a given facility on average. A facility with a 10% FairShare can use 10% of the cluster 24/7 on average. If it uses less, others can make use of the free resources, and the priority of the facility to get the next job to run on the cluster will grow. If it uses more (because others don’t make full use of their FairShare), its priority will shrink accordingly. FairShare usage tracking decays over time, so that it is not possible to save up FairShare for nine months and then occupy the full cluster for a full month.

Within a given facility, all scientists that are HPC project managers share its FairShare. All HPC projects share the FairShare of their manager. Finally, all HPC users share the FairShare of their assigned project. This results in the FairShare tree that has become the standard way of managing HPC resources.

Project Management

HPC resources are managed based on projects to which individual users are assigned. The purpose of the projects is to keep an account of resource usage based on the FairShare of project managers within the FairShare of their facility.

Professors and group leaders can apply to become project managers; see the Terms of Use for details.

A project manager may apply for projects, and is responsible for compliance with all rules and regulations. Projects will be granted after a basic plausibility check; there is no review process, and access to resources is granted solely based on the FairShare principle, not based on competing project applications.

Users need to apply for access to the system, but access is only active if the user is currently assigned to at least one active project by a project manager.

Resources at RUB

HPC Cluster Elysium

Node Specifications

Type Count CPU Memory Local NVMe Storage GPU
Thin-CPU 284 2xAMD EPYC 9254 (24 core) 384 GB 960 GB -
Fat-CPU 13 2xAMD EPYC 9454 (48 core) 2304 GB 1.92 TB -
Thin-GPU 20 2xAMD EPYC 9254 (24 core) 384 GB 1.92 TB 3xNVIDIA A30 Tensor Core GPU 24GB, 933GB/s
Fat-GPU 7 2xAMD EPYC 9454 (48 core) 1152 GB 1.92 TB + 15.36 TB 8xNVIDIA H100 SXM5 GPUs 80GB, 3.35TB/s

File Systems

The following file systems are available:

  • /home: For your software and scripts. High availability, but no backup. Quota: 50 GB per user.
  • /lustre: Parallel file system to use for your jobs. High availability, but no backup. Not for long term storage. Quotas: 1 TB and 1.000.000 files per user.
  • /tmp: Fast storage on each node for temporary data. Limited in space, except for FatGPU nodes where multiple TB are available. Data is removed when the job ends.

Partition Overview

Two partitions are available for each type of compute node: the filler partitions are designed for short jobs, while the standard partitions support longer-running tasks.

Jobs in the filler partition have a lower priority and will only start if no job from the regular partition requests resources. Running jobs in the filler will cost only a fraction of the fair share of a regular partition.

The vis partition is special since the visualization nodes are intended for interactive use.

Partition Timelimit Nodelist Max Tasks
per Node
Share-Cost²
cpu 2-00:00:00¹ cpu[001-284] 48 1.000 / core
cpu_filler 3:00:00 cpu[001-284] 48 0.050 / core
fat_cpu 2-00:00:00 fatcpu[001-013] 96 1.347 / core
fat_cpu_filler 3:00:00 fatcpu[001-013] 96 0.067 / core
gpu 2-00:00:00 gpu[001-020] 48 1.000 / core
49.374 / GPU
gpu_filler 1:00:00 gpu[001-020] 48 1.000 / core
12.344 / GPU
fat_gpu 2-00:00:00 fatgpu[001-007] 96 1.000 / core
169.867 / GPU
fat_gpu_filler 1:00:00 fatgpu[001-007] 96 1.000 / core
49.217 / GPU
vis 1-00:00:00 vis[001-003] 2.000 / core
29.401 / GPU

¹ Times of up to 7 days are possible on this partition but not recommended. Only 2 days are guaranteed, jobs running longer than that may get cancelled if that becomes necessary for important maintenance work.

² Cost does not refer to money, but the factor of computing time that is added to a projects used share in order to compute job priorities. The costs are based on the relative monetary costs of the underlying hardware.

Resources Elsewhere

HPC Pyramid

The HPC resources in Germany are arranged hierarchically in the so-called HPC pyramid.

HPC pyramid HPC pyramid

If suitable for your needs, use the local resources provided by the tier-3 centre. If you need more resources than the local centre can provide, or your project requires specialized hardware, you are welcome to contact another HPC centre or request computing time at a higher tier (tier-2 or tier-1).

State-wide Computing Resources (Tier-2, Tier-3)

Several state-wide tier-2 centres (NHR centres) are available to cater for specialized computing and/or storage requirements. In North Rhine-Westphalia, the RTWH Aachen, the University of Cologne and the University of Paderborn offer structured access to HPC users from NRW institutes.

National and EU-wide HPC-Resources (Tier-1 and Tier-0)

For extremely complex and data-intensive requirements, HPC resources of the highest tier are available in Germany and the EU. Computing time is only allocated after a technical and scientific peer review process (GCS, PRACE).

Access to HPC Resources elsewhere

We would be happy to advise you on the suitability of your projects as well as provide help with the application process for all levels of the HPC pyramid. Please contact us.

HPC.NRW

HPC NRW Logo HPC NRW Logo

The Ruhr-University of Bochum is part of the North Rhine-Westphalian Competence Network for High Performance Computing HPC.NRW. This network offers a first point of contact and central advisory hub with a broad knowledge base for HPC users in NRW. In addition, the tier-2 centres offer uniform, structured access for HPC users of all universities in NRW, ensuring basic services are provided for locations without tier-3 centres and for Universities of Applied Sciences.

A network of thematic clusters for low-threshold training, consulting and coaching services has been created within the framework of HPC.NRW. The aim is to make effective and efficient use of high-performance computing and storage facilities and to support scientific researchers of all levels. The existing resources and services that the state has to offer are also presented in a transparent way.

Access

Access to Elysium is granted based on HPC project applications. If you do scientific work at RUB, you are eligible for access; see the Terms of Use for details.

If you need a user account to login to Elysium go here: Get User Access. Note that access will only be active if you are assigned to at least one HPC project by a project manager!

If you already are a HPC project manager and would like to apply for a new project go here: Apply for a HPC Project.

If you are a professor or research group leader looking to apply for computing resources on Elysium go here: Become Project Manager.

Subsections of Access

Get User Access

In order to get a user account on Elysium you need to download and fill out the user access application form. Note that you can only login to the cluster after you were assigned to at least one project by a project manager.

HPC user access application screenshot HPC user access application screenshot

RUB LoginID: This is your RUB-LoginID. You need to active two-factor authentication for it.

SSH Key (Pub): Your public (not private!) SSH Key. This key must be your own. Sharing keys with others, e.g. members of your work group is not allowed. You can generate an SSH key pair with the following command:

ssh-keygen -t ed25519 -f ~/.ssh/elysium -N "passphrase"

Then you enter the contents of the file ~/.ssh/elysium.pub in the field. The “passphrase” should of course be changed to an appropriately complex password, that prevents malicious usage of your key.

Note that RSA keys must have at least 3000 bits in accordance with BSI regulations. We recommand ED25519 keys, as shown in the example above.

After you correctly filled out the form save the PDF as User-Access-Application_<loginID>.pdf e.g. User-Access-Application_mamuster.pdf and send it via email to hpc+applications@ruhr-uni-bochum.de

Apply for a HPC Project

After you were approved as a Project Manager you can apply for research projects on Elysium by downloading and filling out the project application form. The application is required for managing the fair share of computing resources, and for reporting to the HPC-Beirat, Funding organizations, etc. There is no peer-review process and your project is automatically accepted. Note that more projects do not mean more computing resources. Your personal resources are shared between all you projects.

HPC project application screenshot HPC project application screenshot

RUB LoginID: This is your RUB-LoginID.

Project Name: Name under which your project should be listed in any report.

Abstract: A short 2-3 line abstract outlining the contents of your research project.

Field of Science: Identification number according to DFG subject classification system.

Third-party funded: Optional field for research projects with third party funding. E.g. Funding institution, or project number

RUB LoginIDs: Comma separated list of RUB-LoginIDs of the people you want to participate in this project. (Note that the project manager’s LoginID is not automatically included in the participant list. If you, as a project manager, want to participate in your own project, you must enter your Login ID as well.)

Contingent: From the dropdown menu select the computing resources contingent of your project. If you did not participated in the HPC cluster application you need to select “Miscellaneous User Groups”, or get permission by the members of the other contingents to use their resources.

After you correctly filled out the form save the PDF as Project-Application_<loginID>_<number>.pdf e.g. Project-Application_mamuster_5.pdf, where the number 5 refers to the fifth project application that was send in by you, and send it via email to hpc+applications@ruhr-uni-bochum.de

Become Project Manager

In order to get computing resources on Elysium and be able to manage projects you need to download and fill out the project manager application form and sign the compliance to export control regulations form.

Please note that only professors and independent group leaders within the Ruhr-University are eligible to becoming project managers! See the regulations for details.

HPC project manager application screenshot HPC project manager application screenshot

Group Name: The name of your work group. E.g. “Chair for constructive demolition techniques”, or “Computatinal analysis of sumerian poetry”,…

Faculty/Institute: The name of the faculty your group is located in. E.g. “Faculty of Mathematics”, “ICAMS”, or “Universitätsklinikum Josefs-Hospital”

RUB LoginID: This is your RUB-LoginID.

Email: Your RUB-email. (Not your institute email address!) e.g. max.muster@ruhr-uni-bochum.de

Signed export regulations: After you read, understood, and signed the compliance to export control regulations form linked above, you check this box.

After you correctly filled out the form, save the PDF as Projectmanager-Application_<loginID>.pdf e.g. Resources-Application_mamuster.pdf. Save a scan of the signed export control regulations form as Compliance-Export-Control-Regulations_<loginID>.pdf e.g. Compliance-Export-Control-Regulations_mamuster.pdf. Send the application form and a scan of your signed export control regulations form via e-mail to hpc+applications@ruhr-uni-bochum.de

Documentation

Elysium is the central HPC Cluster at the Ruhr-Universität Bochum. See the overview of its resources.

To use Elysium, you need to

Please read about the basic concept of using Elysium first.

The login process combines SSH key-based authentication with web-based two-factor authentication.

After login, you can use available software modules or build your own software.

Read about submitting jobs and allocating resources in the SLURM section.

Subsections of Documentation

Basics

Elysium provides four login nodes: login1.elysium.hpc.rub.de, …, login4.elysium.hpc.rub.de. These are your entry points to the cluster.

After login, you typically use them to prepare your software and copy your data to the appropriate locations.

You can then allocate resources on the cluster using the Slurm workload manager.

After submitting your request, Slurm will grant you the resources as soon as they are free and your priority is higher than the priority of other jobs that might be waiting for some of the same resources.

Your priority depends on your waiting time and your remaining FairShare.

Login

Login to Elysium combines the common SSH key-based authentication with web-based two-factor authentication. You need to enable two-factor authentication for your RUB LoginID at rub.de/login.

The additional web-based authentication is cached for 14 hours so that you typically only have to do it once per work day, per login node, and per IP address you connect from. After that, your normal key-based SSH workflow will work as expected.

Follow these steps:

  1. Start ssh with the correct private key, your RUB LoginID, and one of the four login hosts, e.g.
    ssh -i ~/.ssh/elysium LOGINID@login1.elysium.hpc.ruhr-uni-bochum.de.
    Available login nodes are login1 to login4. Login step 1: start ssh Login step 1: start ssh

  2. Open the URL in a browser (or scan the QR code with your smartphone) to start web-based two-factor authentication. Login step 2 part 1: start web-based authentication Login step 2 part 1: start web-based authentication

  3. Enter the second factor for two-factor authentication. Login step 2 part 2: web-based LoginID / password authentication Login step 2 part 2: web-based LoginID / password authentication

  4. After successful login, you get a four-digit verification code. Login step 2 part 3: get the verification code Login step 2 part 3: get the verification code

  5. Enter this code at your ssh prompt to finish login. Login step 3: verify the SSH session Login step 3: verify the SSH session

For the next 14 hours, only step 1 (classic key-based authentication) will be necessary on the chosen login node for the IP address you connected from.

Login will fail if:

  • You use the wrong private key (“Permission denied (publickey)”)
  • You are not member of an active HPC project (“Permission denied (publickey)”)
  • You did not enable two-factor authentication for your LoginID (“Two-factor authentication is required”)
  • Web-based login fails
  • You enter the wrong verification code (“Verification failed”)
  • A timeout happens between starting the SSH session and finalizing web-based login (“session_id not found”); just start the process again to get a new session ID.

Software

We provide a basic set of toolchains and some common libraries via modules.

To build common HPC software packages, we provide a central installation of the Spack package manager.

Subsections of Software

Modules

We use the Lmod module system:

  • module available (shortcut ml av) lists available modules
  • module load loads selected modules
  • module list shows modules currently loaded in your environment

There are also hidden modules that are generally less relevant to users but can be viewed with ml --show_hidden av.

We are committed to providing the tools you need for your research and development efforts. If you require modules that are not listed here or need different versions, please contact our support team, and we will be happy to assist you.

Compilers

  • GCC 11.4.1, default on the system.
  • GCC 13.2.0.
  • AOCC 4.2.0, AMD Optimizing C/C++ Compiler
  • Intel OneAPI Compilers: 2024.1.0
  • Intel Classic Compilers: 2021.10.0
  • NVHPC 24.7, NVIDIA HPC Compilers

MPI Libraries

  • OpenMPI 4.1.6
  • OpenMPI 5.0.3 (default).
  • MPICH 4.2.1
  • Intel OneAPI MPI 2021.12.1.

Mathematical Libraries

  • AMD Math Libraries:
  • AMD BLIS, BLAS-like libraries.
  • AMD FFTW, a fast Fourier transform library.
  • AMD libFLAME, a library for dense matrix computations. (LAPACK)
  • AMD ScaLAPACK.
  • HDF5: Version 1.14.3 (built with MPI)
  • Boost 1.85.0

Programming Languages

  • Julia 1.10.2, a high-level, high-performance dynamic programming language.
  • R 4.4.0, for statistical computing.
  • Python 3.11.7

Tools and Utilities

  • CUDA Toolkit 12.6.1
  • GDB (GNU Debugger)
  • Apptainer

Spack

We use the Spack package manager to build a set of common HPC software packages.

This section describes how to use an extent the central installation.

Alternatively, you can use a full independent Spack installation in your home directory, or use EasyBuild.

Using the Central Installation

Activate the central Spack installation with source /cluster/spack/0.22.2/share/spack/setup-env.sh.

You can use this as a starting point for your own software builds, without the need to rebuild everything from scratch.

For this purpose, add the following three files to ~/.spack:

  • ~/.spack/upstreams.yaml
    upstreams:
      central-spack:
        install_tree: /cluster/spack/0.22.2/opt/spack
  • ~/.spack/config.yaml
    config:
      install_tree:
        root: $HOME/spack/opt/spack
      source_cache: $HOME/spack/cache
  • ~/.spack/modules.yaml
    modules:
      default:
        roots:
          lmod: $HOME/spack/share/spack/lmod
        enable: [lmod]
        lmod:
          all:
            autoload: direct
          hide_implicits: true
          hierarchy: []

Also, put these two lines into your ~/.bashrc:

export MODULEPATH=$MODULEPATH:$HOME/spack/share/spack/lmod/linux-almalinux9-x86_64/Core
. /cluster/spack/0.22.2/share/spack/setup-env.sh

You can then use the central Spack installation, with local additions added in ~/spack.

Run spack compiler find to add the system compiler to your compiler list. Also run it after loading other compilers via module load to add those, too.

Overriding Package Definitions

If you need to override selected package definitions, create an additional file ~/.spack/repos.yaml:

repos:
  - $HOME/spack/var/spack/repos

and create a description for your local repo in ~/spack/var/spack/repos/repo.yaml:

repo:
  namespace: overrides

You can then copy the package definition you need to override, and edit it locally. Example for ffmpeg:

$ cd ~/spack/var/spack/repos/
$ mkdir -p packages/ffmpeg
$ cp /cluster/spack/0.22.2/var/spack/repos/builtin/packages/ffmpeg/package.py packages/ffmpeg
$ vim packages/ffmpeg/package.py
... edit as necessary (e.g. disable the patch for version 6.1.1) ...

When running spack install ffmpeg, your local override will take precedence over the central version.

SLURM

The Elysium HPC system utilizes SLURM as a resource manager, scheduler, and accountant in order to guarantee fair share of the computing resources.

If you are looking for technical details regarding the usage and underlying mechanisms of SLURM we recommend participating in the Introduction to HPC training course.

Examples of job scripts for different job types that are tailored to the Elysium cluster can be found in the Training Section.

List of Partition

All nodes in the Elysium cluster are grouped by their hardware kind, and job submission type. This way users can request specific computing hardware, and multi node jobs are guaranteed to run on nodes with the same setup.

In order to get a list of the available partitions, their current state, and available nodes, the sinfo command can be used.

 1[login_id@login001 ~]$ sinfo
 2PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
 3cpu               up 7-00:00:00      4  alloc cpu[033-034,037-038]
 4cpu               up 7-00:00:00    280   idle cpu[001-032,035-036,039-284]
 5cpu_filler        up    3:00:00      4  alloc cpu[033-034,037-038]
 6cpu_filler        up    3:00:00    280   idle cpu[001-032,035-036,039-284]
 7fat_cpu           up 2-00:00:00     13   idle fatcpu[001-013]
 8fat_cpu_filler    up    3:00:00     13   idle fatcpu[001-013]
 9gpu               up 2-00:00:00     20   idle gpu[001-020]
10gpu_filler        up    1:00:00     20   idle gpu[001-020]
11fat_gpu           up 2-00:00:00      1 drain* fatgpu005
12fat_gpu           up 2-00:00:00      5    mix fatgpu[001,003-004,006-007]
13fat_gpu           up 2-00:00:00      1   idle fatgpu002
14fat_gpu_filler    up    1:00:00      1 drain* fatgpu005
15fat_gpu_filler    up    1:00:00      5    mix fatgpu[001,003-004,006-007]
16fat_gpu_filler    up    1:00:00      1   idle fatgpu002
17vis               up 1-00:00:00      3   idle vis[001-003]

Requesting Nodes of a Partition

SLURM provides two commands to request resources. srun is used to start an interactive session.

1[login_id@login001 ~]$ srun --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 --pty bash
2[login_id@cpu001 ~]$

sbatch is used to request resources that will execute a job script.

1[login_id@login001 ~]$ sbatch --partition=cpu --job-name=test --time=00:05:00 --account=testproj_0000 myscript.sh
2Submitted batch job 10290

For sbatch the submission flags can also be incorporated into the job script itself. More information about job scripts, and the required and some optional flags can be found in the Training/SLURM Header section.

Use spredict myscript.sh to estimate the start time of your job.

List of Currently Running and Pending Jobs

If requested resources are currently not available, jobs are queued and will start as soon as the resources are available again. To check which jobs are currently running, and which ones are pending and for what reason the squeue command can be used. For privacy reasons only the user’s own jobs are displayed.

1[login_id@login001 ~]$ squeue
2             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3             10290       cpu     test login_id  R       2:51      1 cpu001

List of Computing Resources Share

Users/Projects/Groups/Institutes are billed for computing resources used. To check how many resources a user is entitled to and how many they have already used the sshare command is used. For privacy reasons only the user’s own shares are displayed.

1[login_id@login001 ~]$ sshare
2Account                    User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
3-------------------- ---------- ---------- ----------- ----------- ------------- ----------
4testproj_0000          login_id       1000    0.166667    20450435      0.163985   0.681818

List of Project Accounts

Due to technical reasons the project names on Elysium have rather cryptic names, based on the loginID of the project manager and a number. In order to make it easier to select a project account for the --account flag for srun, or sbatch, and to check the share and usage of projects, the RUB-exclusive rub-acclist command can be used.

1[login_id@login001 ~]$ rub-acclist
2Project ID    | Project Description
3--------------+--------------------------------------------------
4testproj_0000 | The fundamental interconnectedness of all things
5testproj_0001 | The translated quaternion for optimal pivoting

Visualization

We provide Visualization via VirtualGL on the visualization nodes on Elysium.hpc.ruhr-uni-bochum.de

Requirements:

X11 server with 24-bit- or 32-bit Visuals. VirtualGL version > 3.0.2 installed.

You can check support for your Operating Sytsem at: https://virtualgl.org/Documentation/OSSupport You can download VirtualGL at: https://github.com/VirtualGL/virtualgl/releases

To use VirtualGL on Elysium, you will only need the VirtualGL client, it is not necessary to configure a VirtualGL Server.

Resource allocation:

Allocate resources in the vis partition.

salloc -p vis -N1 --time=02:00:00 --account=$ACCOUNT

This will allocate one entire vis node for 2 hours. Wait until a Slot in the vis partition is available. You can check if your resources are already allocated using the ‘squeue’ command.

Establish Virtual GL connection:

Connect directly from your computer to the visualization node via ssh with vglonnect -s Use one of the login servers login[001-004] as a jump host.

vglconnect -s $HPCUSER@vis001.elysium.hpc.rub.de -J $HPCUSER@login001.elysium.hpc.rub.de

If you don’t like long commands, you can configure one of the login nodes as jump host in your ~/.ssh/config for the vis[001-003] hosts. The command vglconnect -s accepts nearly the same syntax as ssh.

Run your Software:

Load a module if required. Start your application using vglrun, please remember to use useful command line options like -fps .

module load vmd
vglrun +pr -fps 60 vmd

Please remember to cancel the resource allocation once you are done with your interactive session.

scancel $jobID

Training

Training

Usage of HPC resources differs significantly from handling a regular desktop computer. In order to help people get started we provide two training courses.

In addition to that we recommend online resources. We provide a variety of examples job scripts tailored to Elysium, to get you started with your research.

Subsections of Training

Introduction to Linux

Why Linux?

Linux based operating systems are the de facto standard for HPC systems. Thus it is vital to have a solid understanding of how to work with Linux.

Linux Introductory Training

We offer an in-person course that combines a lecture and interactive exercises. The course covers the following topics:

  1. Why Linux?
  2. Directory Structure
  3. The Terminal
  4. Navigating the Directory Structure
  5. Modifying the Directory Structure
  6. Handling Files
  7. Permission Denied
  8. Editing Files in the Terminal
  9. Workflow and Pipelines
  10. Automation and Scripting
  11. Environment Variables
  12. Monitoring System Resources

Registration

Dates for the courses are announced via the tier3-hpc mailing list. Registration for the next course can be done via Moodle. We expect all who registered in the course to participate in the next course. If you change your mind about participation please deregister from the course to free one of the limited spots to others.

Do I Need This Course?

If you are already proficient in the topics listed above you may skip the course. It is not a requirement to get access to the cluster. In the Moodle course we provide a quiz where you can check your proficiency with Linux.

Slides

Here you may download the slides for the course: Introduction to Linux.

Introduction to HPC

Why HPC Training?

Usage of HPC resources differs significantly from handling a regular desktop computer. Thus it is vital to have a solid understanding of how to work with HPC systems.

HPC Introductory Training

We offer an in-person course that combines a lecture and interactive exercises. The course covers the following topics:

  1. What is High Performance Computing
  2. HPC-Cluster Components
  3. How to Access a Cluster?
  4. SLURM - Requesting Resources
  5. SLURM - How Resources are Scheduled
  6. SLURM - Accounting and Sharing of Compute Time
  7. Environment Modules
  8. Parallelization Models
  9. Scaling of Parallel Applications
  10. Code of Conduct

Registration

Dates for the courses are announced via the tier3-hpc mailing list. Registration for the next course can be done via Moodle. We expect all who registered in the course to participate in the next course. If you change your mind about participation please deregister from the course to free one of the limited spots to others.

Do I Need This Course?

If you are already proficient in the topics listed above you may skip the course. It is not a requirement to get access to the cluster. In the Moodle course we provide a quiz where you can check your proficiency with HPC systems.

Slides

Here you may download the slides for the course: Introduction to HPC.

Online Resources

Online Resources

If you do not want to or cannot participate in the training courses, but still want to learn about Linux and HPC, here we provide a list of a few online resources. Note that those materials might not reflect specifics regarding the hardware or environment of the RUB cluster Elysium.

Job Scripts

Job Scripts

Jump to Example Scripts

A SLURM job script usually follows the following steps:

  1. SLURM Header
  2. Create temporary folder on local disk
  3. Copy input data to temporary folder
  4. Load required modules
  5. Perform actual calculation
  6. Copy output file back to global file system
  7. Tidy up local storage

SLURM Header

The SLURM HEADER is a section in the script after the shebang. Every line begins with #SBATCH.

1#SBATCH --nodes=1                 # Request 1 Node
2#SBATCH --partition=gpu           # Run in partition cpu
3#SBATCH --job-name=minimal_gpu    # name of the job in squeue
4#SBATCH --gpus=1                  # number of GPUs to reserve
5#SBATCH --time=00:05:00           # estimated runtime (dd-hh:mm:ss)
6#SBATCH --account=lambem64_0000   # Project ID (check with rub-acclist)

This way the bash interpreter ignores these lines, but SLURM can pick them out to parse the contents. Additionally each line contains one of the sbatch flags. On Elysium the flags --partition, --time, and --account are required. For GPU-jobs the additional --gpus flag needs to be specified and at least 1.

Mandatory Flags

Flag Example Note
--partiton=<partition> --partition=cpu list of partitions with sinfo
--time=<dd-hh:mm:ss> --time=00-02:30:00 maximum time the job will run
--account=<account> --account=snublaew_0001 project the used computing time shall be billed to. list of project accounts with rub-acclist
--gpus=<n> --gpus=1 number of GPUs. Must be at least 1 for GPU partitions

Optional Flags

Flag Example Note
--job-name=<name> --job-name="mysim" job name that is shown in squeue for the job
--exclusive --exclusive Nodes are not shared with other jobs (default on cpu, fat_cpu, gpu).
--output=<filename> --output=%x-%j.out Filename to contain stdout (%x=job name, %j=job-id)
--errput=<filename> --errput=%x-%j.err Filename to contain stderr (%x=job name, %j=job-id)
--mail-type=<TYPE> --mail-type=ALL Notify user by email when certain event types occur. If specified --mail-user needs to be set.
--mail-user=<rub-mail> --mail-user=max.muster@rub.de Address to which job notifications of type --mail-type are send.

Temporary Folder

If your code reads from some input, or writes output, the performance can strongly depend on where the data is located. If the data is in your home, or on the lustre file system the read/write performance is limited by the bandwidth of the interconnect. In addition to that a parallel file system has problems with many small read/write operations by design. It’s performance shines with reading/writing big chunks. Thus it is advisable to create a folder on the local disks in the /tmp/ directory, and perform all read/write operations in there. At the beginning of the job any input data is put there in one copy, and all output data is copied from the /tmp/ directory to its final location in one go.

 1# obtain the current location
 2HDIR=$(pwd)
 3
 4# create a temporary working directory on the node
 5WDIR=/tmp
 6cd ${WDIR}
 7
 8# copy set of input files to the working directory
 9cp ${HDIR}/inputdata/* ${WDIR}
10
11...
12
13# copy the set of output files back to the original folder
14cp outputdata ${HDIR}/outputs/
15
16# tidy up local files
17rm -rf ${WDIR}/*

Loading Modules

If your program was build with certain versions of libraries it may be required to provide the same libraries at runtime. Since everybody’s needs regarding library versions is different Elysium utilizes environment modules to manager software versions.

 1# unload all previously loaded modules
 2module purge
 3
 4# show all module that are available
 5module avail
 6
 7# load a specific module
 8module load the_modules_name_and_version
 9
10# list all loaded modules
11module list

Perform Calculation

How to perform your calculation strongly depends on your specific software and inputs. In general there are four typical ways to run HPC jobs.

Farming

Farming jobs are used if the program is not parallelized, or scales in a way that it can only utilize a few CPU cores efficiently. Then multiple instances of the same program are started. Each with a different input, as long as the instances have roughly the same runtime.

1for irun in $(seq 1 ${stride} ${ncores})
2do
3    # The core count needs to start at 0 and goes to ncores-1
4    taskset -c $(bc <<< "${irun-1}") ${myexe} inp.${irun} > out.${irun}
5done
6wait

Shared Memory

Programs that incorporate thread spawning (usually via OpenMP) can make use of multiple cores.

1export OMP_NUM_THREADS=${SLURM_TASKS_PER_NODE}
2${myexe} input

Distributed Memory

If programs require more resources than can be provided by one node it is necessary to pass information between the processes running on different nodes. This is usually done via the MPI protocol. A program must be specifically programmed to utilize MPI.

1ncorespernode=48
2nnodes=${SLURM_JOB_NUM_NODES}
3ncorestotal=$(bc <<< "${ncorespernode}*${nnodes}")
4mpirun -np ${ncorestotal} -ppn ${ncorespernode} ${myexe} input

Hybrid Memory (Shared and Distributed Memory)

In programs that utilize distributed memory parallelization via MPI it is possible to spawn threads within each process to make use of shared memory parallelization.

1nthreadsperproc=2
2ncorespernode=$(bc <<< "48/${nthreadsperproc}")
3nnodes=${SLURM_JOB_NUM_NODES}
4ncorestotal=$(bc <<< "${ncorespernode}*${nnodes}")
5export OMP_NUM_THREADS=${nthreadsperproc}
6mpirun -np ${ncorestotal} -ppn ${ncorespernode} ${myexe} input

GPU

Support for offloading tasks to GPUs needs to be incorporated into the program.

1export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
2${myexe} input

Examples

The following example scripts are ready to use on the Elysium cluster. The only change you need to make is to specify a valid account for the --account flag. You can use the rub-acclist command to get a list of your available project accounts. The executed programs do not produce any load and will finish in a few seconds. The generated output shows where each process/thread ran, and if it had access to a GPU.

Minimal CPU Job Script Example

Farming Job Script Example

Shared Memory Job Script Example

Distributed Memory Job Script Example

Hybrid Memory Job Script Example

Minimal GPU Job Script Example

GPU Job Script Example

Distributed Memory with GPU Job Script Example

FAQ

How Do I …?

… Transfer Data To/From The Cluster?

Data can be copied to and from the cluster using scp, or rsync. We strongly recommend rsync due to the many quality of life features.

# Copy local to cluster
rsync -r --progress --compress --bwlimit=10240 <local_source_path> <username>@login001.elysium.hpc.rub.de:<remote_destination_path>

# Copy cluster to local
rsync -r --progress --compress --bwlimit=10240 <username>@login001.elysium.hpc.rub.de:<remote_source_path> <local_destination_path>

The paths to the data which shall be copied and the destination, as well as the username need to be adjusted. Note that there is no trailing “/” at the end of the source path. If there was one, the directories contents, not the directory itself would be copied.

Flags:

  • -r enables recursive copies (directories and their content)
  • --progress gives you a live update about the amount that has been copied already and an estimate of the remaining time
  • --compress attempts to compress the data on the fly to speed up the data transfer even more
  • --bwlimit limits the data transfer rate in order to leave some bandwidth to other people who want to copy data, or work interactively.

If multiple file are to be copied to/from the cluster, the data should be packed into a tar-archive before sending:

# create a tar archive
tar -cvf myfiles.tar dir_or_files.*

# extract a tar archive
tar -xvf myfiles.tar

Note that running multiple instances of rsync, or scp will not speed up the copy process, but slow it down even more!

More answers coming soon.

In the meantime, please see Help.

Help

Stay Informed

The following news channels are available:

  • The tier3-hpc mailing list is for general news about HPC@RUB. Subscribe if you are interested.
  • News for Elysium users and course announcements are available in the News section. You can subscribe to the RSS Feed or to the hpc-news mailing list to get these automatically.
  • Urgent messages regarding operations of Elysium will be sent via direct mail to all active users.

User Community

The HPC User room in the RUB Matrix Server is for exchange and collaboration between HPC users at RUB. Please join!

The HPC team is also present, but please note that you should still contact us at hpc-helpdesk@ruhr-uni-bochum.de if you need a problem solved.

Contact

The HPC@RUB cluster Elysium is operated by the HPC team at IT.SERVICES.

Contact us at hpc-helpdesk@ruhr-uni-bochum.de to ask questions or report problems.

Please include as much information as possible to help us identify the problem, e.g. LoginID, JobID, submission script, directory and file names etc., if applicable.

Courses at RUB

The HPC@RUB team offers two courses:

External Resources