1. Introduction

1.1. What is This Document

This document explains tuning techniques applicable to other programs in common, based on real cases presented in “Meetings for application code tuning on A64FX computer systems”. These techniques are practical ones experienced in real application programs shown below:

Application area

Program name

Electromagnetic

WumingPIC2D

Fluid dynamics

FFVHC-ACE

FrontFlow/X

Nek5000/RS

Molecular dynamics

GENESIS

GROMACS

LAMMPS

Quantum chromodynamics

LQCD

Weather, climate

SCALE

The techniques are grouped by objectives, i.e., tuning effects, so that readers can find out candidates from the techniques based on programs’ profiling data such as CPU performance reports.

1.2. Structure of This Document

Eight techniques are explained in this document and grouped by the following three objectives which should be focused on:

  1. Promoting Vectorization

  2. Reduction of Waiting Time for Calculation

  3. Reduction of Waiting Time for Cache Access

Each explanation for the techniques consists of the following pieces:

  • Motivation to apply the technique

  • Applied example showing performance improvement

  • Reference links to real cases presented in “Meetings for application code tuning on A64FX computer systems”

  • Reference links to related information such as compiler user’s guides and programming guides

Readers who have already profiled their program’s performance are recommended to look for applicable techniques which may match their program in terms of the above objectives.

Interested readers can learn more by following each technique’s reference links to related information such as published documents in “Meetings for application code tuning on A64FX computer systems” and tuning advices in programming guides.

1.3. Environment for Performance Measurement

Performance data shown in this document was measured under the following condition. Although C/C++ compilers were used in trad mode, ideas of the explained techniques in this document are also applicable under clang mode.

Measured date

November 2023

Machine

Supercomputer Fugaku

Language environment

Fujitsu Fortran/C/C++ Compiler 4.9.0 (tcsds-1.2.37)

Compiler optimization flags

-Kfast,openmp,ocl

Number of processes and threads at run time

4 processes, 12 threads per process

About usage of CPU performance reports which were used for observing results of performance improvement by explained techniques, please refer to the following documents such as profiler user’s guide.

Notice: Access rights for Fugaku User Portal are required to read the above documents.