6. Runtime options and behavior¶

Here, we will introduce some options that are often used when running Singularity, and some useful behaviors.

6.1. Incorporating the environment on the host side¶

6.1.1. Handling of environment variables¶

We have seen that access to files in your home directory is enabled by default. What about environment variables? If you display the environment variables in the container as follows, you can see that the variables on the host side at runtime are inherited.

$ singularity exec ubuntu2004.sif env

The content of %environment described in the definition file is actually posted as a script in the image, and the operation that is sourced when the container is started. Therefore, it is possible to capture and manipulate variables on the host side or overwrite them. Also, if you set a variable called SINGULARITYENV_*** (*** is an arbitrary string that can be used for variables) on the host side before executing the container, since the variable *** will be set in the container, the variable in the container is set in advance without interfering with the environment on the host side. And the content described in %environment is temporarily set or overwritten without recreating the image.

$ cat example.def
<<..snip..>>
%environment
    export WORKFILE=/tmp/work_defined_in_env
%runscript
    echo 'WORKFILE in Container' = ${WORKFILE}

$ cat job.sh
<<..snip..>>
    export WORKFILE=/workdir/workfile
    export SINGULARITYENV_WORKFILE=/workdir/work_defined_by_SINGULARITYENV
    ~/example.sif

The execution result in this case is as follows.

WORKFILE in Container = /workdir/work_defined_by_SINGULARITYENV

# without setting SINGULARITYENV
WORKFILE in Container = /tmp/work_defined_in_env

# without setting both of SINGULARITYENV and %environment
WORKFILE in Container = /workdir/workfile

This makes it easy to change conditions in different environments without having to recreate the image.

6.1.2. Directory binding¶

When you start a singularity container, it binds your own home directory, /tmp, /var/tmp, /dev, /proc, /sys. inside the container by default. It also binds the current directory at startup, but only if a directory with the same name exists in the container. If you boot the image from a directory that does not exist in the container, you may want to access other directories from within the container, as well as the current directory. For example, applications and data already installed on the host side may be redundant or huge to be imported into the image, or you may want to use the home directory of another user who shares the file.

In this case, you can specify additional directories to bind with the -B option at boot time. you can specify different paths by separating the directory on the host side and the mount point in the container with ':' , but if you omit the mount destination, it will be bound to the directory with the same name.

$ singularity exec ubuntu2004.sif -B /worktmp:/scratch ls -al /scratch

It will be remounted inside the container with the -bind option, but you can add more mount options here with ':' . For example, if you want to make it read-only in a container, you can write as follows.

$ singularity exec ubuntu2004.sif -B /system/reference:/opt/data:ro ls -al /opt/data

Also, if you want to specify multiple directories, you can use multiple -B or separate them with ','.

There are some caveats to the binding behavior. The current directory is bound by default, so if you start a container from a directory such as /usr/ bin or /usr/lib, it will replace them in the container with the directory on the host side, resulting in a runtime error. There is. also, if you start from another user’s home directory and try to bind automatically, it will not be bound because that directory does not exist in the container. Must be explicitly specified with the -B option.

6.2. Instantiation and interactive processing¶

So far, we have integrated the launch of the container and the launch of the application and used it so that the user can launch the application directly without being aware of the container. However, with this usage, when the started application is closed, the container will be closed at the same time. Of course, you can restart it, but even if it is a lightweight container implementation, if you repeat it many times, the overhead will be wasted. Also, can singularity not be used for service operations such as web servers? To accommodate these uses, singularity has a feature called instantiation that allows you to maintain containers.

Start and instantiate only the container as shown below. The last string is the instance name, which must be a name that is not covered by what is already running. Make sure it is started correctly with the list subcommand.

$ singularity instance start ubuntu2004.sif Focal
$ singularity instance list
INSTANCE NAME    PID      IP    IMAGE
Focal            12016          /home/uXXXXX/ubuntu2004.sif

The image is now unpacked and mounted, that is, the container is up and waiting. To run your application within this instance, specify the instance name instead of the image name, as follows:

$ singularity exec instance://Focal cat /etc/os-release

Multiple applications can be submitted and executed on a single instance. For example, suppose you need to perform the same process on a huge number of files in a certain directory.

for data in data.*
do
    singularity exec jammy.sif appl ${data}
done

While it is tempting to write such a script, the following will generally reduce the container launch overhead and thus increase throughput.

singularity instance start jammy jammy.sif
for data in data.*
do
    singularity exec instance://jammy appl ${data}
done

Since you can start multiple instances simultaneously, you can perform different operations in different environments in single job.

To terminate running instance, use stop option.

$ singularity instance stop Focal

6.3. Direct execution with specified repository¶

Recall that we were getting a docker image when we created the sif image. during pull and build, the layer files that make up the docker image are cached under ~/.singularity. The behavior of this cache does not change even if you specify oras or library as the target. And singularity can do this directly.

$ singularity exec docker://ubuntu:latest head -n 5 /etc/os-release
INFO:    Converting OCI blobs to SIF format
WARNING: 'nodev' mount option set on /worktmp, it could be a source of failure during build process
INFO:    Starting build...
Getting image source signatures
Copying blob a39c84e173f0 done
Copying config 4db8450a4d done
Writing manifest to image destination
Storing signatures
2021/03/21 12:14:15  info unpack layer: sha256:a39c84e173f038958d338f55a9e8ee64bb6643e8ac6ae98e08ca65146e668d86
2021/03/21 12:14:15  warn xattr{etc/gshadow} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2021/03/21 12:14:15  warn xattr{/worktmp/build-temp-063655636/rootfs/etc/gshadow} destination filesystem does not support xattrs, further warnings will be suppressed
INFO:    Creating SIF file...
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"

As you can see from the message, it is internally sif imaged and stored under ~/.singularity. when you re-execute, it will check the repository for updates, make the best use of the cache, and synchronize. If there is no update, launch his sif image in the cache directly as follows:

$ singularity exec docker://ubuntu:latest head -n 5 /etc/os-release
INFO:    Using cached SIF image
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"

If there is an update on the repository side, get it, recreate the sif image, and then execute it. This execution method is not suitable when it is necessary to maintain an execution environment with a proven track record. It is recommended that you always use it only when you want to import and use the latest version of the repository.

You can also view cache information and clear cache files.

$ singularity cache list -v
NAME                     DATE CREATED           SIZE             TYPE
061be60b872e3a155a2364   2021-10-28 22:14:15    0.34 KiB         blob
2d5d59c100e7ad4ac35a22   2021-11-03 13:54:19    78.57 MiB        blob
<<..snip..>>
sha256.9a6ee1f8fdecb21   2023-02-22 17:14:02    2.65 MiB         library
626ffe58f6e7566e00254b   2021-10-28 22:14:18    25.15 MiB        oci-tmp

There are 2 container file(s) using 27.81 MiB and 25 oci blob file(s) using 692.86 MiB of space
Total space used: 720.67 MiB

$ singularity cache clean

These caches are used not only at runtime, but also at build/pull time.

6.4. Overlay option¶

The SIF image entity is SquashFS(almost tar+gzip), so it is read-only at runtime. However, there are cases when you want to temporarily place files in the image, or when you need an image that is only partially different, or when you want to have only the differences because having multiple images that contain everything is a waste of capacity. Singularity provides 2 solutions.

The –writable-tmpfs option that makes memory-derived tmpfs temporarily writable by superimposing them on the read-only image.
The –overlay option that prepares a separate persistent writable file and overlays it at runtime.

The writable-tmpfs option superimposes tmpfs on the existing image, so new files and directories can be created at any location in container. However, files originally in the image cannot be manipulated with causing the error of Permission denied. Also, the changes will only exist in tmpfs, so they will disappear after the container is terminated. If you have necessary data or files, save them separately before exiting the container.

$ singularity shell -writable-tmpfs something.sif
Singularity> echo hogefuga > /etc/testfile
Singularity> ls -al /etc/testfile
-rw-rw-r-- 1 uXXXXX fugaku 9 Feb 14 04:50 /etc/testfile
Singularity> echo hogefuga > /etc/os-release
bash: /etc/os-release: Permission denied
Singularity> exit

$ singularity shell something.sif
Singularity> ls -al /etc/hogefuga
ls: cannot access '/etc/hogefuga': No such file or directory

The –overlay function, which overlays a separately prepared persistent writable file at runtime, requires a separate image file to be created instead of tmpfs. If the same file exists in both SIF and Overlay, the overlay side has priority. This means that it enables partial modification of the image.

# 1GBの overlay イメージを作成し、指定したディレクトリを予め作成しておく。
# Creating a 1GB of image file with prepared directory inside.
$ singularity overlay create --size 1024 --create-dir /usr/share/apps overlay.img

# overlay イメージ内にデータを展開
# Extracting data into Overlay image.
$ singularity exec --overlay overlay.img core.sif tar xzf reference-data.tar.gz -C /usr/share/apps

# overlay イメージをコンテナに重ね合わせて実行
# Executeing container with the overlay image.
$ singularity exec --overlay overlay.img core.sif ls -al /usr/share/apps

The written data remains in overlay.img and can be used to carry and segregate your data. However, please note that the image file is formatted in ext3 and the owner information is also retained as well, so be careful when using it on a different system. On the other hand, the image file can be used on both of AARCH64 and x86_64 by singularity. It is also possible to directly manipulate the image by mounting it as loopback, or to resize it in the same way as a normal partition, as shown below.

$ e2fsck -f overlay.img && resize2fs overlay.img 4096M

In addition, you can integrate the overlay image into a SIF image to be a single file by means of singularity sif command.

$ singularity sif list ubuarm20.sif
------------------------------------------------------------------------------
ID   |GROUP   |LINK    |SIF POSITION (start-end)  |TYPE
------------------------------------------------------------------------------
1    |1       |NONE    |65536-65574               |Def.FILE
2    |1       |NONE    |131072-131169             |JSON.Generic
3    |1       |NONE    |196608-26320896           |FS (Squashfs/*System/arm64)

$ singularity sif add --datatype 4 --partfs 2 --parttype 4 --partarch 4 ubuarm20.sif overlay.img
$ singularity sif list ubuarm20.sif
------------------------------------------------------------------------------
ID   |GROUP   |LINK    |SIF POSITION (start-end)  |TYPE
------------------------------------------------------------------------------
1    |1       |NONE    |65536-65574               |Def.FILE
2    |1       |NONE    |131072-131169             |JSON.Generic
3    |1       |NONE    |196608-26320896           |FS (Squashfs/*System/arm64)
4    |NONE    |NONE    |26320896-1100062720       |FS (Ext3/Overlay/arm64)

To see the meaning of each parameter, please refer to the manual which you can see by the command singularity sif add –help. This function allows for the operation of images whose contents can be modified. Later, these additions can be taken out or deleted.

$ singularity sif dump 4 ubuarm20.sif  > ovl.img
$ singularity sif del 4 ubuarm20.sif

These operations can also be performed on the login node, so there is no need to submit a job for this purpose. Overlay also contributes to reducing the metadata access load on the shared file system, especially when there are many small files, since I/O is completed within a single file.

6.5. MPI parallelism¶

Regarding mpi parallelism, there are no major problems up to parallelism within a node, but since there are various points to be solved with multi-node parallelism, additional work and configuration is required at image creation and runtime.

6.5.1. Parallel within a node¶

In parallel within a node, the process manager launches the application directly, so processing does not extend outside the container. While launching the container with singularity, launch the mpi application with mpirun (mpiexec) etc. of course, it is assumed that the required mpi runtime is in the container. An example command looks like this:

$ singularity exec mpi-apps.sif mpiexec -np 8 ~/myapps/hoge inputfile

Only one container is started, and multiple processes run in it.

6.5.2. Overview of multi-node parallel¶

When multi-node parallel application is started in the same way as parallel in a node, the process manager will be started to other nodes by ssh or PMI. However, since the container isn’t started on the destination node, the application almost always fails to start with error. Therefore, we address this by having the MPI process manager include container startup, not only application startup. An example of how to start it is as follows.

$ mpiexec -np 32 --machinefile nodefile singularity exec mpi-apps.sif ~/myapps/hoge inputfile

Compared to intranode parallel execution, the order of mpiexec and singularity commands is different. In this case, you need to execute mpiexec outside the Singularity container, This means that you need to ensure that the same MPI is installed both on the host side and inside the container, or using options -B to bind directory where MPI installed into the container. Consequently, full separation of the user environment via containers is not achievable. Additionally, launching one container per process results in increased overhead.

Furthermore, compatibility with interconnects used for RDMA communication is necessary. In Singularity, /dev is shared with the host, so the devices themselves remain visible. However, the software stack and sockets required for device utilization must be prepared within the container. Additionally, by retrieving various information about nodes the scheduler allocated to jobs from environment variables, and passing them to the actual MPI finally enables multi-node MPI parallel.

6.5.3. Multi-node parallel on FUGAKU¶

When creating and executing an image that incorporates MPI included in TCS by Fujitsu, the necessary steps are as follows. In the image creation section, we installed TCSDS from a directory that was mounted. However, since a network accessible repository is also available, we will use it here. When describing only the part of the definition file that prepares the execution environment, it looks as follows:

Bootstrap: docker
From: redhat/ubi8:8.9
%files
%post
  echo 'timeout=14400' >> /etc/yum.conf
  dnf --noplugins --releasever=8.9 -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
  # BEGIN Fujitsu TCS Part
  cat <<EOF > /etc/yum.repos.d/fugaku.repo
[FUGAKU-AppStream]
name=FUGAKU-AppStream
baseurl=http://10.4.38.1/pxinst/repos/FUGAKU/AppStream
enabled=1
gpgcheck=0
[FUGAKU-BaseOS]
name=FUGAKU-BaseOS
baseurl=http://10.4.38.1/pxinst/repos/FUGAKU/BaseOS
enabled=1
gpgcheck=0
EOF
  dnf --noplugins --releasever=8.9 -y install xpmem libevent tcl less hwloc openssh-clients gcc-c++ elfutils-libelf-devel FJSVpxtof FJSVpxple FJSVpxpsm FJSVpxkrm FJSVxoslibmpg papi-devel
  dnf --noplugins clean all
  # END Fujitsu TCS Part

You can install the necessary packages with this. Let’s try running the binary of a MPI sample code within the container for inter-node parallel execution. In the following example, we submit it as an interactive job with two nodes to verify its functionality.

pjsub --interact --sparam wait-time=30 -L "rscunit=rscunit_ft01,rscgrp=int,node=2,elapse=0:30:00"

First, for the MPI binaries created in the ‘Fugaku’ environment, the runtime libraries included in TCS are necessary. Therefore, we will bind the host-side directory /opt/FJSVxtclanga using the -B option or set it as an environment variable SINGULARITY_BIND. Additionally, the necessary sockets for communication will be created in /run and /var/opt/FJSVtcs, so we’ll also incorporate those. Furthermore, information about CPU core affinity (assignment) resides in /sys/devices/system/cpu. While Singularity shares /sys by default with remounting, it lead to missing this configuration information. Similarly, cgroup information faces a similar situation, so we’ll include all of these collectively.

export SINGULARITY_BIND='/opt/FJSVxtclanga,/var/opt/FJSVtcs,/run,/sys/devices/system/cpu,/sys/fs/cgroup'
or

mpiexec -np 2 singularity exec -B '/opt/FJSVxtclanga,/var/opt/FJSVtcs,/run,/sys/devices/system/cpu,/sys/fs/cgroup' sample.sif ./appbin

Singularity generally imports almost all host-side environment variables at runtime, but there are a few exceptions. The most notable ones are PATH and LD_LIBRARY_PATH. Therefore, the libraries located in the directory /opt/FJSVxtclanga where you’ve bound runtime won’t be accessible. To address this, you have two options during image creation. Embed these environment variables directly into %environment. Or retrieve them from other variables within %runscript.Here’s an example of how it can be done:

%environment
  export LD_LIBRARY_PATH=/opt/FJSVxtclanga/tcsds-1.2.39/lib64:$LD_LIBRARY_PATH
or

%runscript
  export LD_LIBRARY_PATH=$FJSVXTCLANGA/lib64:$LD_LIBRARY_PATH
  export PATH=$FJSVXTCLANGA/bin:$PATH
  application

Or, you can pass them by the –env option, like this.

mpiexec -np 2 singularity exec --env LD_LIBRARY_PATH=$LD_LIBRARY_PATH sample.sif ./appbin

The standard output and error output are not displayed on the screen, but are saved in the directory output.<JOBID> in your home directory. For this testing purpose, we used a sample code that outputs information for each rank.

[a0XXXX@f34-0008c ~] mpifcc -o sample sample.c
[a0XXXX@f34-0008c ~] export SINGULARITY_BIND='/opt/FJSVxtclanga,/var/opt/FJSVtcs,/run,/sys/devices/system/cpu,/sys/fs/cgroup'
[a0XXXX@f34-0008c ~] mpiexec -n 2 singularity exec --env LD_LIBRARY_PATH=$LD_LIBRARY_PATH sample.sif ./sample
[a0XXXX@f34-0008c ~] logout

login1$ cd output.30999999/0/1/
login1$ ls -l
total 8
-rw-r--r-- 1 a0XXXX rccs-aot   65 Feb  1 14:59 stdout.1.0
-rw-r--r-- 1 a0XXXX rccs-aot   65 Feb  1 14:59 stdout.1.1
login1$ cat stdout.1.0
Hello world from processor f34-0008c, rank 0 out of 2 processors
login1$ cat stdout.1.1
Hello world from processor f34-0000c, rank 1 out of 2 processors

In this section, we extracted only the necessary conditions for multi-node MPI parallel in the FUGAKU environment to evaluate the functionality.