Skip to content

Welcome

This guide provides basic information for using the AI4S Supercomputer during early access, including login instructions and simple Slurm job examples.

For hardware information about this system, refer to https://github.com/RIKEN-RCCS/AI-for-Science-Supercomputer.

To request an account for the AI4S Supercomputer early access program, submit the account registration form.

Account registration form

SSH Login

Replace USERNAME below with your own username.

$ ssh USERNAME@login01.ai.r-ccs.riken.jp

Usage

Submit jobs to the system using Slurm. The following partitions are available.

Partition Max Nodes Max GPUs/Node Max Wall Time Max Memory/Node
1n1gpu 1 1 96 hours 400GB
1n2gpu 1 2 96 hours 800GB
1n4gpu 1 4 96 hours 1600GB
2n4gpu 2 4 96 hours 1600GB
4n4gpu 4 4 96 hours 1600GB
4n4gpu-p 4 4 unlimited unlimited

Module environment

The system provides several module environments for development and execution with the NVIDIA HPC Software Development Kit.

Module Description
nvhpc Standard NVHPC environment
nvhpc-nompi NVHPC without MPI; use this when you manage MPI yourself
nvhpc-hpcx NVHPC with IB and HPC-X
nvhpc-hpcx-cuda13 NVHPC with HPC-X and CUDA 13 fixed
nvhpc-byo-compiler Use the system GCC; BYO means bring your own

List available modules

Use module avail to see the detailed list of available modules.

$ module avail

List loaded modules

Use module list to check which modules are currently loaded.

$ module list

Load a module

Use module load to load the environment you want to use.

$ module load nvhpc

Switch modules

Unload the current module before loading another one.

$ module unload nvhpc
$ module load nvhpc-hpcx

Inspect a module

Use module show to check what a module changes in your environment.

$ module show nvhpc-hpcx

Submit a batch job

Create a job script such as job.sh. The following example requests one node and one GPU on the 1n1gpu partition.

#!/bin/bash
#SBATCH --job-name=test-job
#SBATCH --partition=1n1gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=00:10:00

module load nvhpc

hostname
nvidia-smi

Submit the job with sbatch.

$ sbatch job.sh

Check your jobs with squeue.

$ squeue -u $USER

Run an interactive job

With salloc

Use salloc to allocate resources for an interactive session. The following example requests one node and one GPU on the 1n1gpu partition for 10 minutes.

$ salloc --partition=1n1gpu --nodes=1 --gpus-per-node=1 --time=00:10:00

After the allocation starts, use srun to run commands on the allocated node.

$ srun hostname
$ srun nvidia-smi

When you are finished, run exit to release the interactive allocation.

$ exit
exit
salloc: Relinquishing job allocation 1066

With srun

Use srun --pty bash to start an interactive shell on a compute node. The following example requests one node and one GPU on the 1n1gpu partition for 10 minutes.

$ srun --partition=1n1gpu --nodes=1 --gpus-per-node=1 --time=00:10:00 --pty bash

When you are finished, run exit to leave the shell and end the srun job.

$ exit
exit

Cancel a job

Cancel a submitted or running job with scancel. Replace JOBID with the job ID shown by sbatch or squeue.

$ scancel JOBID

Check partition status

Use sinfo to check the status of partitions and nodes.

$ sinfo

To check a specific partition, use -p.

$ sinfo -p 1n1gpu

Local scratch storage

Each compute node provides a local scratch area on an approximately 7TB NVMe SSD. Use the USER_SCRATCH_DIR environment variable to access this scratch area from your job.

This storage is useful when you need fast local SSD access for temporary files, datasets, checkpoints, or intermediate results during computation. Files in this area are automatically deleted after the job finishes, so copy any results that you need to keep to your persistent storage before the job ends.

#!/bin/bash
#SBATCH --job-name=scratch-example
#SBATCH --partition=1n1gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=00:10:00

module load nvhpc

echo "Scratch directory: ${USER_SCRATCH_DIR}"

./my_application > ${USER_SCRATCH_DIR}/output.log

# Copy results that must be kept before the job finishes.
# SLURM_SUBMIT_DIR is the directory where you ran the sbatch command.
cp ${USER_SCRATCH_DIR}/output.log ${SLURM_SUBMIT_DIR}/