Welcome¶
This guide provides basic information for using the AI4S Supercomputer during early access, including login instructions and simple Slurm job examples.
For hardware information about this system, refer to https://github.com/RIKEN-RCCS/AI-for-Science-Supercomputer.
To request an account for the AI4S Supercomputer early access program, submit the account registration form.
SSH Login¶
Replace USERNAME below with your own username.
Usage¶
Submit jobs to the system using Slurm. The following partitions are available.
| Partition | Max Nodes | Max GPUs/Node | Max Wall Time | Max Memory/Node |
|---|---|---|---|---|
| 1n1gpu | 1 | 1 | 96 hours | 400GB |
| 1n2gpu | 1 | 2 | 96 hours | 800GB |
| 1n4gpu | 1 | 4 | 96 hours | 1600GB |
| 2n4gpu | 2 | 4 | 96 hours | 1600GB |
| 4n4gpu | 4 | 4 | 96 hours | 1600GB |
| 4n4gpu-p | 4 | 4 | unlimited | unlimited |
Module environment¶
The system provides several module environments for development and execution with the NVIDIA HPC Software Development Kit.
| Module | Description |
|---|---|
| nvhpc | Standard NVHPC environment |
| nvhpc-nompi | NVHPC without MPI; use this when you manage MPI yourself |
| nvhpc-hpcx | NVHPC with IB and HPC-X |
| nvhpc-hpcx-cuda13 | NVHPC with HPC-X and CUDA 13 fixed |
| nvhpc-byo-compiler | Use the system GCC; BYO means bring your own |
List available modules¶
Use module avail to see the detailed list of available modules.
List loaded modules¶
Use module list to check which modules are currently loaded.
Load a module¶
Use module load to load the environment you want to use.
Switch modules¶
Unload the current module before loading another one.
Inspect a module¶
Use module show to check what a module changes in your environment.
Submit a batch job¶
Create a job script such as job.sh. The following example requests one node and one GPU on the 1n1gpu partition.
#!/bin/bash
#SBATCH --job-name=test-job
#SBATCH --partition=1n1gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=00:10:00
module load nvhpc
hostname
nvidia-smi
Submit the job with sbatch.
Check your jobs with squeue.
Run an interactive job¶
With salloc¶
Use salloc to allocate resources for an interactive session. The following example requests one node and one GPU on the 1n1gpu partition for 10 minutes.
After the allocation starts, use srun to run commands on the allocated node.
When you are finished, run exit to release the interactive allocation.
With srun¶
Use srun --pty bash to start an interactive shell on a compute node. The following example requests one node and one GPU on the 1n1gpu partition for 10 minutes.
When you are finished, run exit to leave the shell and end the srun job.
Cancel a job¶
Cancel a submitted or running job with scancel. Replace JOBID with the job ID shown by sbatch or squeue.
Check partition status¶
Use sinfo to check the status of partitions and nodes.
To check a specific partition, use -p.
Local scratch storage¶
Each compute node provides a local scratch area on an approximately 7TB NVMe
SSD. Use the USER_SCRATCH_DIR environment variable to access this scratch area
from your job.
This storage is useful when you need fast local SSD access for temporary files, datasets, checkpoints, or intermediate results during computation. Files in this area are automatically deleted after the job finishes, so copy any results that you need to keep to your persistent storage before the job ends.
#!/bin/bash
#SBATCH --job-name=scratch-example
#SBATCH --partition=1n1gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=00:10:00
module load nvhpc
echo "Scratch directory: ${USER_SCRATCH_DIR}"
./my_application > ${USER_SCRATCH_DIR}/output.log
# Copy results that must be kept before the job finishes.
# SLURM_SUBMIT_DIR is the directory where you ran the sbatch command.
cp ${USER_SCRATCH_DIR}/output.log ${SLURM_SUBMIT_DIR}/