Introduction

This article is a tuning case study for Fugaku, based on a reproduction of a work by Dr. Tsuji at RIKEN R-CCS. For the original literature, see [1] (in Japanese) and [2].

What it computes

We examine a kernel from “axhelm”, a spectral element method (a family of finite element methods). It has 83 points in a hexahedron element. To be precise, the order is variable, but in this tuning, we fix it to 8, the SIMD length of double precision.

By multiplying an 8×8 matrix for the vectors of each axis, we have gradients. It is a quadruple loop in 84 (loop-2a).

Next, we multiply a 3×3 symmetric matrix of geometry factors. This connects the internal coordinates of the element with the external ones, including the waiting by volume and density. This part is a triple loop in 83, with a relatively large data fetching of 6×83 words (loop-2b).

Loop-2a and Loop-2b are fused to save space for intermediate variables. From these to loop-3, we need storage of 3×83 words, which is 12 KiB and fits into the 64 KiB L1 cache.

Finally, we multiply the transposed matrix of the previous gradients.

Mathematically, this kernel applies a matrix

\[\begin{split}A= \begin{pmatrix} D_1 \\ D_2 \\ D_3 \end{pmatrix}^T \begin{pmatrix} G_{11} & G_{12} & G_{13} \\ G_{12} & G_{22} & G_{23} \\ G_{13} & G_{23} & G_{33} \end{pmatrix} \begin{pmatrix} D_1 \\ D_2 \\ D_3 \end{pmatrix},\end{split}\]

from the left to compute \(A \vec{u}\).

To better clarify the role of multiplying the transposed gradients, consider the following inner product

\[(\nabla \vec v, \nabla \vec u) = \vec v^T A \vec u.\]

The kernel omits the final procedure for multiplying \(\vec v^T\) from the above. For more detailed formulations, see [3].

Organizations

This article starts with an “as-is” tune-0, and in tune-1, it improves the SIMD conversion. The next tune-2 avoids gathering instructions by making continuous access, and tune-3 promotes software pipelining by collapsing loops. Finally, in tune-4, we improve the memory access performance by enabling prefetch instruction insertion. In tune-1a, an equivalent SIMD conversion to tune-1 without using directives is also introduced.

As-is performance (tune-0)

Below is the baseline code.

axhelm-0.cpp
 49extern "C" void axhelm_bk_v0(
 50        dlong Nelements,                   // dlong=int, dfloat=double
 51        const dlong & offset,              // not used    
 52        const dfloat* __restrict__ ggeo,   // [Nelements][7][8][8][8]
 53        const dfloat* __restrict__ D,      // [8][8]
 54        const dfloat* __restrict__ lambda, // not used
 55        const dfloat* __restrict__ q,      // [Nelements][8][8][8]
 56        dfloat* __restrict__ Aq )          // [Nelements][8][8][8]
 57{
 58    dfloat s_q  [p_Nq][p_Nq][p_Nq]; // #define p_Nq 8
 59    dfloat s_Gqr[p_Nq][p_Nq][p_Nq];
 60    dfloat s_Gqs[p_Nq][p_Nq][p_Nq];
 61    dfloat s_Gqt[p_Nq][p_Nq][p_Nq];
 62
 63    dfloat s_D[p_Nq][p_Nq];
 64
 65    for(int j = 0; j < p_Nq; ++j)
 66        for(int i = 0; i < p_Nq; ++i)
 67            s_D[j][i] = D[j * p_Nq + i];
 68
 69#pragma omp parallel for private(s_q, s_Gqr, s_Gqs, s_Gqt) firstprivate(s_D)
 70    for(dlong e = 0; e < Nelements; ++e) {
 71        const dlong element = e;
 72        for(int k = 0; k < p_Nq; ++k) for(int j = 0; j < p_Nq; ++j) // loop-1
 73            for(int i = 0; i < p_Nq; ++i) {
 74                const dlong base = i + j * p_Nq + k * p_Nq * p_Nq + element * p_Np;
 75                const dfloat qbase = q[base];
 76                s_q[k][j][i] = qbase;
 77            }
 78
 79        for(int k = 0; k < p_Nq; ++k) for(int j = 0; j < p_Nq; ++j) // loop-2
 80        {
 81            for(int i = 0; i < p_Nq; ++i) {
 82                // const dlong id = i + j * p_Nq + k * p_Nq * p_Nq + element * p_Np;
 83                const dlong gbase = element * p_Nggeo * p_Np + k * p_Nq * p_Nq + j * p_Nq + i;
 84                const dfloat r_G00 = ggeo[gbase + p_G00ID * p_Np]; // #define p_G00ID 1
 85                const dfloat r_G01 = ggeo[gbase + p_G01ID * p_Np]; // #define p_G01ID 2
 86                const dfloat r_G11 = ggeo[gbase + p_G11ID * p_Np]; // #define p_G11ID 3
 87                const dfloat r_G12 = ggeo[gbase + p_G12ID * p_Np]; // #define p_G12ID 4
 88                const dfloat r_G02 = ggeo[gbase + p_G02ID * p_Np]; // #define p_G02ID 5
 89                const dfloat r_G22 = ggeo[gbase + p_G22ID * p_Np]; // #define p_G22ID 6
 90
 91                dfloat qr = 0.f;
 92                dfloat qs = 0.f;
 93                dfloat qt = 0.f;
 94                // loop-2a (gradient)
 95                for(int m = 0; m < p_Nq; ++m) {
 96                    qr += s_D[i][m] * s_q[k][j][m];
 97                    qs += s_D[j][m] * s_q[k][m][i];
 98                    qt += s_D[k][m] * s_q[m][j][i];
 99                }
100                // loop-2b (geometry factor)
101                s_Gqr[k][j][i] = r_G00 * qr + r_G01 * qs + r_G02 * qt;
102                s_Gqs[k][j][i] = r_G01 * qr + r_G11 * qs + r_G12 * qt;
103                s_Gqt[k][j][i] = r_G02 * qr + r_G12 * qs + r_G22 * qt;
104            }
105        }
106
107        for(int k = 0; k < p_Nq; ++k) for(int j = 0; j < p_Nq; ++j) // loop-3 (grad^T)
108        {
109            for(int i = 0; i < p_Nq; ++i) {
110                const dlong id = element * p_Np + k * p_Nq * p_Nq + j * p_Nq + i;
111                dfloat r_Aqr = 0, r_Aqs = 0, r_Aqt = 0;
112
113                for(int m = 0; m < p_Nq; ++m) {
114                    r_Aqr += s_D[m][i] * s_Gqr[k][j][m];
115                    r_Aqs += s_D[m][j] * s_Gqs[k][m][i];
116                    r_Aqt += s_D[m][k] * s_Gqt[m][j][i];
117                }
118
119                Aq[id] = r_Aqr + r_Aqs + r_Aqt; 
120            }
121        }
122    }
123}

Compiling with a Fujitsu compiler 4.8.0 (tcsds-1.2.35) in a trad mode with options -Kfast,openmp,ocl,optmsg=2, and running at a 2.0 GHz frequency results in a performance of 126 GFLOP/s per node.

We use a standard combination of numbers, 4 MPI processes per node, and 12 threads per process. The problem size is 7680 elements. The GFLOP/s number is calculated by the program from the theoretical operation count and can differ from what the profiler outputs.

Let us first examine SIMD conversion.

SIMD conversion (tune-1)

In the last code, the compiler outputs the following optimization messages (excerpt):

       jwd8222o-i  "axhelm-0.cpp", line 78: The prefetch instructions were generated because the number of prefetch required in this loop is greater than the allowable number of hardware-prefetch.
       jwd8662o-i  "axhelm-0.cpp", line 78: This loop is not software pipelined because no schedule is obtained.
       jwd8203o-i  "axhelm-0.cpp", line 80: Loop full unrolling is applied to this loop.
       jwd6004s-i  "axhelm-0.cpp", line 94: SIMD conversion is applied to this loop with the loop variable 'm'. The loop contains a reduction operation.
       jwd8209o-i  "axhelm-0.cpp", line 100: Evaluation order of polynomial expression is changed according to commutative law of addition and multiplication.
       jwd8668o-i  "axhelm-0.cpp", line 106: This loop is not software pipelined because the loop contains too many instructions.
       jwd8203o-i  "axhelm-0.cpp", line 106: Loop full unrolling is applied to this loop.
       jwd8203o-i  "axhelm-0.cpp", line 108: Loop full unrolling is applied to this loop.
       jwd6004s-i  "axhelm-0.cpp", line 112: SIMD conversion is applied to this loop with the loop variable 'm'. The loop contains a reduction operation.

We can see that the compiler applies SIMD conversion for the innermost loop index m. Although vectorization for the innermost loop is standard behavior for most compiles, this is not an optimal SIMD conversion for the following reasons.

  • One multiplication and one horizontal summation instruction are needed for a dot product calculation of vectors with length 8. This is not efficient (if the vector length is sufficiently long, only one horizontal summation is needed after several SIMD multiply-add operations, and it is adequate).

  • SIMD conversion is only applied for the gradient calculation part, and no SIMD instruction is used for the geometry factor part.

For this reason, we focus on applying SIMD conversion, not for the innermost loop index m, but one more outer loop index i. A simple way to do this is to unroll the innermost m-loop. Fortunately, a compiler directive is available for this purpose, and we do not need to hand-unroll it. A modified loop writes:

axhelm-1.cpp
94#pragma loop fullunroll_pre_simd
95                for(int m = 0; m < p_Nq; ++m) {
96                    qr += s_D[i][m] * s_q[k][j][m];
97                    qs += s_D[j][m] * s_q[k][m][i];
98                    qt += s_D[k][m] * s_q[m][j][i];
99                }
113#pragma loop fullunroll_pre_simd
114                for(int m = 0; m < p_Nq; ++m) {
115                    r_Aqr += s_D[m][i] * s_Gqr[k][j][m];
116                    r_Aqs += s_D[m][j] * s_Gqs[k][m][i];
117                    r_Aqt += s_D[m][k] * s_Gqt[m][j][i];
118                }

For these two parts, the compiler first unrolls the innermost m-loops, then applies SIMD conversion for the outer i-loops. The compiler message is as follows.

  jwd8203o-i  "axhelm-1.cpp", line 71: Loop full unrolling is applied to this loop.
  jwd6001s-i  "axhelm-1.cpp", line 72: SIMD conversion is applied to this loop with the loop variable 'i'.
  jwd8668o-i  "axhelm-1.cpp", line 78: This loop is not software pipelined because the loop contains too many instructions.
  jwd8203o-i  "axhelm-1.cpp", line 78: Loop full unrolling is applied to this loop.
  jwd6001s-i  "axhelm-1.cpp", line 80: SIMD conversion is applied to this loop with the loop variable 'i'.
  jwd8203o-i  "axhelm-1.cpp", line 95: Loop full unrolling is applied to this loop.
  jwd8662o-i  "axhelm-1.cpp", line 107: This loop is not software pipelined because no schedule is obtained.
  jwd8203o-i  "axhelm-1.cpp", line 107: Loop full unrolling is applied to this loop.
  jwd6001s-i  "axhelm-1.cpp", line 109: SIMD conversion is applied to this loop with the loop variable 'i'.
  jwd8203o-i  "axhelm-1.cpp", line 114: Loop full unrolling is applied to this loop.

This change yields a performance of 329 GFLOP/s.

An equivalent SIMD conversion without directives is described in Alternative solution (tune-1a).

Here is a comparison of statistics and time breakdown with tune-0 by Fujitsu Advanced Performance Profiler (FAPP). The statistics per CMG (1 MPI process, 12 threads) are as follows.

Statistics

Execution
time (s)

Gflops

Floating-
point
operation
peak ratio
(%)
Memory
throughput
(GB/s)
Memory
throughput
peak ratio
(%)
Effective
instruction
Floating-
point
operation
SIMD
instruction
rate (%)
(/Effective
instruction)
SVE
operation
rate (%)
Floating-
point
pipeline
Active
element rate

IPC

GIPS

Tune-0

3.13E-01

65.71

8.56%

24.95

9.75%

7.97E+09

2.05E+10

57.06%

91.87%

92.29%

1.06

25.50

Tune-1

1.09E-01

96.30

12.54%

70.31

27.42%

2.08E+09

1.05E+10

75.41%

100.0%

91.77%

0.79

19.06

Execution time is the most objective indicator and is reduced by almost one-third. The increase in Gflops number is relatively moderate. This is because tune-0 counts more floating-point operations than it should due to the horizontal summation instruction. The number of floating-point operations in the subsequent tunings below does not increase or reduce. As a whole, reducing the number of instructions by the SIMD conversion of almost all floating-point operations in tune-1 contributes to decreasing total execution time.

Breakdowns of execution times are as follows.

_images/tune0_035.png

Breakdown of tune-0

_images/tune1_035.png

Breakdown of tune-1

So far, we have seen that the reduction in the total operation count has reduced execution time. However, there remains several rooms for improvement in tune-1

  • Light purple-colored “floating-point operation wait” time occupies a significant fraction of the total time. Compiler-based scheduling, like software pipelining, can improve this (tune-3).

  • Red colored “floating-point load memory access wait” has also a certain fraction in the total execution time. Ultimately, this kernel will end up with a memory-bound problem. However, the current memory busy rate is only 24.42%. Insertion of software prefetch may improve this part (tune-4).

Avoid gathering by continuous access (tune-2)

In the previous SIMD conversion, some inefficient memory access remain.

axhelm-1.cpp
95                for(int m = 0; m < p_Nq; ++m) {
96                    qr += s_D[i][m] * s_q[k][j][m];
97                    qs += s_D[j][m] * s_q[k][m][i];
98                    qt += s_D[k][m] * s_q[m][j][i];
99                }

This is a vectorization on the index [i], so the first line is a vector-scalar multiplication, and the other two lines are scalar-vector multiplication. The array s_D[i][m] in the first line includes the index [i] and can be vectorized. However, the index resides not the innermost, resulting in a stride or gathering memory access. The gather load instruction can be found from the compiler output as:

/*     96 */    index   z17.d, 0, 8
/*     96 */    ld1d    {z16.d}, p0/z, [x16, z17.d, lsl #3]     //  "s_D"

Without reading an assembly output, one can also check the issues of gathering instructions from a FAPP report.

_images/tune1_xls.png

FAPP report for tune-1

Fortunately, this matrix is tiny, 8×8, and shared for all elements. Thus, we can prepare a transposed matrix at the function’s entry. The following are code changes for such tuning.

axhelm-2.cpp
62    dfloat s_D[p_Nq][p_Nq], s_T[p_Nq][p_Nq];
63
64    for(int j = 0; j < p_Nq; ++j)
65        for(int i = 0; i < p_Nq; ++i){
66            s_D[j][i] = D[j * p_Nq + i];
67            s_T[i][j] = D[j * p_Nq + i];
68        }
69
70#pragma omp parallel for private(s_q, s_Gqr, s_Gqs, s_Gqt) firstprivate(s_D, s_T)
 97                for(int m = 0; m < p_Nq; ++m) {
 98                    qr += s_T[m][i] * s_q[k][j][m];
 99                    qs += s_D[j][m] * s_q[k][m][i];
100                    qt += s_D[k][m] * s_q[m][j][i];
101                }

A new array s_T[i][j] contains a transpose of the original matrix s_D[j][i]. This change yeelded a performance of 358 GFLOPS/s (+8.8% from tune-1).

Software pipelining by loop collapsing (tune-3)

The inner i-loop with the SIMD length 8 is efficiently vectorized. Now we focus on one outer j-loop for better instruction scheduling, like software pipelining. The j-loop has a fixed size of 8, known at the compile time. Thus, for the current version, the compiler applies full unrolling of the j-loop. For the k-loop, the compiler gives no modification because its loop body is big enough.

Here, we collapse the j-loop and k-loop to generate a single loop with a length of 64 to improve the scheduling. OpenMP supports loop collapsing for cases where the length of the outermost loop is not long enough for thread parallelization. Still, there seems to be no directive for loop collapsing without thread parallelization. We do it by hand.

axhelm-3.cpp
84        for(int kj = 0; kj < p_Nq*p_Nq; ++kj) {
85            const int j = kj % p_Nq;
86            const int k = kj / p_Nq;

The same change is applied for three parts, and the compiler messages are:

  jwd8204o-i  "axhelm-3.cpp", line 73: This loop is software pipelined.
  jwd8205o-i  "axhelm-3.cpp", line 73: The software-pipelined loop is chosen at run time when the iteration count is greater than or equal to 64.
  jwd6001s-i  "axhelm-3.cpp", line 77: SIMD conversion is applied to this loop with the loop variable 'i'.
  jwd8204o-i  "axhelm-3.cpp", line 84: This loop is software pipelined.
  jwd8205o-i  "axhelm-3.cpp", line 84: The software-pipelined loop is chosen at run time when the iteration count is greater than or equal to 48.
  jwd6001s-i  "axhelm-3.cpp", line 88: SIMD conversion is applied to this loop with the loop variable 'i'.
  jwd8203o-i  "axhelm-3.cpp", line 103: Loop full unrolling is applied to this loop.
  jwd8204o-i  "axhelm-3.cpp", line 115: This loop is software pipelined.
  jwd8205o-i  "axhelm-3.cpp", line 115: The software-pipelined loop is chosen at run time when the iteration count is greater than or equal to 48.
  jwd6001s-i  "axhelm-3.cpp", line 119: SIMD conversion is applied to this loop with the loop variable 'i'.
  jwd8203o-i  "axhelm-3.cpp", line 124: Loop full unrolling is applied to this loop.

This modification resulted in 374 GFLOP/s (+4.5% from tune-2).

Let us compare the FAPP report with tune-2.

_images/tune2_09.png

Breakdown of tune-2

_images/tune3_09.png

Breakdown of tune-3

At a glance, the light purple-colored “Floating-point operation wait” is much reduced, but we have more blue-colored “4 instruction commit” instead, resulting in a slight improvement in the total time. This is due to the increased integer instructions in the software pipelined version. For example, the IPC number changes to 1.01→2.01, and the GIPS number per core 2.03→4.02, that is, one more instruction for every CPU cycle. This does not mean that the increase of the integer instruction canceled the improvement by the software pipelining. Still, the density of floating-point operations increases slightly with the concurrent execution of the additional integer instructions.

The limited scheduling improvement is because full unrolling has already been performed on the loop variable j for its fixed length 8 in tune-1, and moderate scheduling has already been available. Although the performance gain in this tuning is limited, the loop collapse introduced here is essential in the subsequent tuning.

Enabling software prefetch (tune-4)

A redundant address calclation remains in the code up to this point.

axhelm-3.cpp
84        for(int kj = 0; kj < p_Nq*p_Nq; ++kj) {
85            const int j = kj % p_Nq;
86            const int k = kj / p_Nq;
87
88            for(int i = 0; i < p_Nq; ++i) {
89                // const dlong id = i + j * p_Nq + k * p_Nq * p_Nq + element * p_Np;
90                const dlong gbase = element * p_Nggeo * p_Np + k * p_Nq * p_Nq + j * p_Nq + i;
91                const dfloat r_G00 = ggeo[gbase + p_G00ID * p_Np];
92                const dfloat r_G01 = ggeo[gbase + p_G01ID * p_Np];
93                const dfloat r_G11 = ggeo[gbase + p_G11ID * p_Np];
94                const dfloat r_G12 = ggeo[gbase + p_G12ID * p_Np];
95                const dfloat r_G02 = ggeo[gbase + p_G02ID * p_Np];
96                const dfloat r_G22 = ggeo[gbase + p_G22ID * p_Np];

Here, ggeo corresponds to a multidimensional array double ggeo[Nelements][7][512] (p_Np=512). After the loops are collapsed, the address can be directly calculated from the loop variable kj (since k * p_Nq * p_Nq + j * p_Nq == kj * p_Nq).

axhelm-4.cpp
84        for(int kj = 0; kj < p_Nq*p_Nq; ++kj) {
85                        const int j = kj % p_Nq;
86                        const int k = kj / p_Nq;
87
88            for(int i = 0; i < p_Nq; ++i) {
89                // const dlong id = i + j * p_Nq + k * p_Nq * p_Nq + element * p_Np;
90                const dlong gbase = element * p_Nggeo * p_Np + kj * p_Nq + i;

Although It appears to be a tiny saving in integer operations, the performance is significant, reaching 606 GFLOP/s (+62% from tune-3). This is because the address calculation becomes just linear to the loop variable kj, enabling the compiler to output software prefetch instructions. This time, we could not find any report for the insertion of the prefetch instructions from the optimization message of the compiler, but it can be confirmed from the assembly output.

Excerpt from “$ grep -n prfm axhelm-4.s”
     1313:/*     91 */       prfm    2, [x30, 824]  // PLDL2KEEP for 00010
     1320:/*     91 */       prfm    0, [x30, 312]  // PLDL1KEEP for 00000
     4324:/*    130 */       prfm    18, [x12, 760] // PSTL2KEEP for 10010
     4325:/*    130 */       prfm    16, [x12, 248] // PSTL1KEEP for 10000

For the detail of the instructions, see [4].

In the original literature [1] [2], prefetch instructions were manually inserted in the final stage of tunings, while the compiler inserts them in this version.

The FAPP report for this version is as follows.

_images/tune4_09.png

Breakdown of tune-4

The red-colored memory waiting time is improved. Also, memory busy time (red vertical bar on the left) is reduced to below 3.0E-02, indicating that memory accesses are reduced due to prefetching.


Summary

Below are the kernels, tunings used, and their performances.

Kernel

Tuning

GFLOP/s

tune-0

as-is

126

tune-1

SIMD conversion

329

tune-1a

SIMD (alternative)

321

tune-2

Continuous access

358

tune-3

Software pipelining

374

tune-4

Prefetch

606

The most significant improvement was achieved in the first tuning; SIMD conversion in an adequate loop index. The final tuning enabling the compiler prefetching is also highly effective, which owes to the loop collapsing in tune-3.

References