3. Tuning details and results

This chapter describes the details and results of the tuning performed according to the procedure in Section 2.2 (Procedure of tuning).

3.1. Evaluation of the performance

To compare with the performance after performing the tuning, the execution time of the Application was measured before performing the tuning. In this tuning work, the Application was divided into measurement regions by reference to the log files that were output by the Application, and the execution time of each measurement region was also measured to make it easier to evaluate the effects of tuning. When discussing the execution time of the Application in this document, both the execution time of the entire Application and the execution time of each measurement region must be described.

The measurement regions are composed of three portions: “Solver”, “Limiter functions processes”, and “Remnants of the entire Application”. Execution time of “Remnants of the entire Application” is defined as execution time of the entire Application minus execution time of “Solver” and “Limiter functions processes”.

“Solver” was further divided into four portions: “Solving the system of equations”, “Making the system of equations”, “Processing the equations other than system of equations”, and “Remnants of Solver”. Execution time of “Remnants of Solver” is defined as execution time of “Solver” minus execution time of other 3 portions.

The following table shows the execution time of initial version and the execution time of each measurement region. As seen in the table, “Solving the system of equations” is the largest measurement region, and it is 43% of the entire Application.

../_images/table1.png

3.2. Cost of each function

In order to focus on the target functions for tuning, the cost of each function, which is proportional to the execution time of each function, in the initial version was measured by sampling analysis using fipp. As a result, in the initial version with the condition of the tuning, 1645 functions and their costs were output by fipp.

The following table represents the top ten functions by the result of cost information by fipp as samples. Each column in the table represents the following:

  • Function: the name of the function

  • Measurement region: the name of the measurement region that includes the function in the column “Function”

  • Cost: the cost of each function output by fipp

  • Percentage of the cost: the percentage of each function’s cost in relation to the total cost of the Application

  • Cumulative percentage of the cost: the cumulative percentage of the cost for each function from the first one

Function

Measurement region

Cost

Percentage of the cost [%]

Cumulative percentage of the cost [%]

0

(the entire Application)

10145395

100.00

1

calc_function_1

Solving the system of equations

2359630

23.26

23.26

2

function_of_MPI_1

(Related to wait time of an MPI communications)

1818309

17.92

41.18

3

function_of_MPI_2

(Related to wait time of an MPI communications)

1090237

10.75

51.93

4

make_function_1

Making the system of equations

359086

3.54

55.47

5

make_function_2

Making the system of equations

323755

3.19

58.66

6

limiter_function_1

Limiter functions processes

290204

2.86

61.52

7

calc_function_2

Solving the system of equations

187961

1.85

63.37

8

make_function_3

Making the system of equations

176418

1.74

65.11

9

calc_function_3

Solving the system of equation

165032

1.63

66.74

10

make_function_4

Making the system of equations

156562

1.54

68.28

As seen in the table , each cost of the top three functions were larger than 10%, and the sum of their costs is more than 50% of the entire Application.

The function with the highest cost was function “calc_function_1”, which was in the measurement region “Solving the system of equations”, and the percentage of the cost was 23% of the total. The functions “function_of_MPI_1” and “function_of_MPI_2” followed “calc_function_1”. However, they were related to wait time of an MPI communication, hence it was not possible to tune these functions directly.

As seen in the cost of functions other than the top three functions, the percentage of the cost of the fourth function was 3.54% of the total, and the tenth function was only 1.54%. It means that the percentage of the cost of most functions was less than a few percent in the initial version.

3.3. Tuning of the Application

This section describes the tuning items and the Application performance measured after performing the tuning.

3.3.1. Tuning items

The following table represents the tuning details and target functions of all tuning items. Each column in the table represents the following:

  • Tuning #: item number for tuning items (Tuning items are assigned numbers in the order in which they were performed.)

  • Tuning outline: outline of each tuning item

  • Tuning method: the method for performing the tuning, such as specifying OCL(s) (Optimization Control Line) or changing compiler options

  • Classification of tuning: classification by reference to the “Programming Guide (Tuning)” that is posted on the user portal site

  • Target function: the name of target function of each tuning item

  • Measurement region: the name of the measurement region that includes the function in the column “Target function”

  • Section #: section number where the details of the tuning item are described

Tuning #

Tuning outline

Tuning method

Classification of tuning

Target function

Measurement region

Section #

1

Loop collapse, and unrolling for loops with small iteration counts

Change the source code

Reduction in the number of instructions

calc_function_1

Solving the system of equations

2

Specifying the prefetch instructions

Specify the OCL only

Improved data access waiting by hiding latency

calc_function_1

Solving the system of equations

3

Sequential access of addition operations in loops

Change the source code

Improved data access waiting by hiding latency

calc_function_1

Solving the system of equations

4

Loop unrolling

Change the source code

Reduction in the number of instructions, and Improved instruction scheduling with loop optimization

calc_function_1

Solving the system of equations

5

Loop unrolling

Change the source code

Reduction in the number of instructions, and Improved instruction scheduling with loop optimization

calc_function_1

Solving the system of equations

6

Suppressing the faddv instructions

Add extra compile options

Reduction in the number of instructions

calc_function_1

Solving the system of equations

7

Loop unrolling

Specify the OCL only

Reduction in the number of instructions, and Improved instruction scheduling with loop optimization

make_function_2

Making the system of equations

make_function_3

Making the system of equations

8

Reordering off-diagonal elements in matrixes

Change the source code

Sequential access

calc_function_1

Solving the system of equations

9

Suppression of SIMDization for loops with small iteration counts

Specify the OCL only

Reduction in the number of instructions

calc_function_3

Solving the system of equations

make_function_1

Making the system of equations

make_function_4

Making the system of equations

make_function_5

Making the system of equations

make_function_6

Making the system of equations

10

SIMDization of division operations and suppression of SIMDization for loops with small iteration counts

Change the source code

Improved operation wait for facilitation of SIMDization

calc_function_3

Solving the system of equations

Section 4.1

11

Reducing load and store operations of data by loop unrolling

Change the source code

Reduction in the number of instructions, and Improved instruction scheduling with loop optimization

calc_function_1

Solving the system of equations

Section 4.2

12

Loop unswitching

Specify the OCL only

Improved data access waiting by hiding latency

make_function_5

Making the system of equations

13

Loop unrolling

Specify the OCL only

Reduction in the number of instructions, and Improved instruction scheduling with loop optimization

make_function_2

Making the system of equations

14

Changing the parameters of domain decomposition of input models

Change settings at execution

Improved the load balance between MPI processes

15

Removing extra type conversion instructions

Change the source code

Reduction in the number of instructions

make_function_2

Making the system of equations

16

SIMDization by loop collapse

Change the source code

Improved operation wait for facilitation of SIMDization

make_function_6

Making the system of equations

Section 4.3

17

SIMDization by loop fission

Change the source code

Improved operation wait for facilitation of SIMDization

make_function_2

Making the system of equations

18

Inline expansion

Change the source code

Reduction in the number of instructions

limiter_function_1

Limiter functions processes

make_function_1

Making the system of equations

make_function_4

Making the system of equations

make_function_9

Making the system of equations

calc_function_2

Solving the system of equations

calc_function_3

Solving the system of equations

calc_function_5

Solving the system of equations

19

Changing the access direction of arrays

Change the source code

Improved data access waiting by hiding latency

othSolv_function_3

Processing the equations other than system of equations

Section 4.4

20

Movement of invariant expressions

Change the source code

Improve waiting for operations by hiding latency

make_function_8

Making the system of equations

21

Loop fission

Change the source code

Reduction in the number of instructions, and Improved instruction scheduling with loop optimization

make_function_8

Making the system of equations

22

Allocating some arrays in loops to register

Change the source code

Improved data access waiting by hiding latency

make_function_8

Making the system of equations

23

Reducing the extra cost in calculation operations

Change the source code

Reduction in the number of instructions

make_function_8

Making the system of equations

24

SIMDization by SVE ACLE

Change the source code

Improved operation wait for facilitation of SIMDization

calc_function_4

Solving the system of equations

Section 4.5

25

Built-in prefetch

Change the source code

Improved data access waiting by hiding latency

make_function_2

Making the system of equations

Section 4.6

make_function_3

Making the system of equations

make_function_7

Making the system of equations

26

Allocating some arrays to static arrays for SIMDization

Change the source code

Improved operation wait for facilitation of SIMDization

calc_function_1

Solving the system of equations

27

Using CLONE specifier, and loop unrolling

Change the source code

Improved data access waiting by hiding latency

make_function_6

Making the system of equations

make_function_12

Making the system of equations

28

Moving division operations to outside of the loop, and applying SIMDization to the division operations

Change the source code

Improved operation wait for facilitation of SIMDization

make_function_7

Making the system of equations

Section 4.7

29

Built-in prefetch

Change the source code

Improved data access waiting by hiding latency

othSolv_function_1

Processing the equations other than system of equations

othSolv_function_2

Processing the equations other than system of equations

othSolv_function_5

Processing the equations other than system of equations

30

SIMDization by inline expansion

Change the source code

Improved operation wait for facilitation of SIMDization

function_1

Processing the equations other than system of equations

function_2

Processing the equations other than system of equations

31

Thread parallelization

Change the source code

Thread parallelization

calc_function_1

Solving the system of equations

make_function_7

Making the system of equations

32

Improving load instructions scheduling

Change the source code

Improved instruction scheduling with loop optimization

othSolv_function_4

Processing the equations other

33

Moving invariant expressions to outside of the loop

Change the source code

Improved instruction scheduling with loop optimization

calc_function_2

Solving the system of equations

Section 4.8

34

Loop unrolling manually instead of OCLs

Change the source code

Reduction in the number of instructions, and Improved instruction scheduling with loop optimization

calc_function_4

Solving the system of equations

Section 4.9

35

Allocating some arrays in loops to register

Change the source code

Improved data access waiting by hiding latency

calc_function_4

Solving the system of equations

calc_function_5

Solving the system of equations

36

Loop interleaving

Change the source code

Improve waiting for operations by hiding latency

calc_function_1

Solving the system of equations

37

Removing extra type conversion instruction

Change the source code

Reduction in the number of instructions

calc_function_1

Solving the system of equations

38

Specifying the prefetch instructions

Change the source code

Improved data access waiting by hiding latency

calc_function_1

Solving the system of equations

39

Loop collapse

Change the source code

Improved instruction scheduling with loop optimization

make_function_6

Making the system of equations

make_function_5

Making the system of equations

40

Using CLONE specifier, and inline expansion

Change the source code

Reduction in the number of instructions, and Improved instruction scheduling with loop optimization

make_function_6

Making the system of equations

make_function_5

Making the system of equations

make_function_7

Making the system of equations

41

Improving the memory placement of two-dimensional arrays for sequential access

Change the source code

Sequential access

allocate_array

(the entire Application)

Section 4.10

clear_array

(the entire Application)

deallocate_array

(the entire Application)

reallocate_array

(the entire Application)

allocate_array_2

(the entire Application)

deallocate_array_2

(the entire Application)

reallocate_array_2

(the entire Application)

42

SIMDization based on the Tuning #41

Change the source code

Improved operation wait for facilitation of SIMDization

calc_function_1

Solving the system of equations

43

Using CLONE specifier

Change the source code

Improved instruction scheduling with loop optimization

make_function_7

Making the system of equations

44

Suppression of SIMDization for loops which has built-in prefetch functions

Specify the OCL only

Reduction in the number of instructions

make_function_7

Making the system of equations

This document describes ten of forty-four tuning items as samples in Chapter 4 (Tuning items). The details of these ten tuning items are as follows:

3.3.2. Tuning results

The following table represents the execution time of the initial version and the tuned version (performed tuning items of #1 to #44), and the performance improvement rate comparison between the initial version and the tuned version. As seen in the table, the performance improvement rate of the entire Application is 58%, in other words, the execution time of the tuned version was less than half of the initial version, thus the target performance was achieved.

../_images/table4.png

In this tuning work, the performance of the Application was measured 13 times in the process of performing the 44 tuning items, and the following each graph (Figure 1 or Figure 2) shows these results. In Figure 1, the horizontal axis represents the n-th tuning item as listed in the table in Section 3.3.1 (Tuning items), and the vertical one represents the execution time of the Application measured just after the n-th tuning. Note that height of the vertical bar and the number at the top indicate the execution time of the entire Application. In Figure 2, the horizontal axis represents the same as Figure 1, and the vertical one represents the performance improvement rate, comparing the initial version and after performing the n-th tuning.

For example, the number “8”on the horizontal axis indicates that the Application measured just after the 8th tuning item as listed in the table in Section 3.3.1 (Tuning items). Hence, the data at the vertical axis “8” in Figure 1 shows the execution time of the Application after performing the tuning items #1 to #8. Note that the data at the position of the horizontal axis 0 shows the data of initial version.

Figure 1: The execution time of the entire Application measured just after performing the n-th tuning.

../_images/image1_1.png

Figure 2: The performance improvement rate, comparing the initial version and after performing the n-th tuning.

../_images/image1_2.png

As seen in the Figure 2, the entire Application was improved by about 34% and the performance of the “Solving the system of equations“ was improved by about 68% from the 1st to the 13th tuning item. The first 13 tuning items were targeted for the top eight highest cost functions, and especially 7 of them were targeted for the function “calc_function_1”, which was the function with highest cost in the initial version.

In Figure 2, the graph shows a steep increase of “the entire Application” (about 34% to 40%) from the 13th to the 14th tuning item. The 14th tuning item improved the load imbalance between processes by changing execution parameters of the domain decomposition of the Application, and it was performed according to the suggestion given by the ISV who developed the Application.

Additionally, the performance of the entire Application further improved by about 18% by performing the 15th to the 44th tuning item, which was targeted for lower-cost (other than the top three functions in the table in Section 3.2) functions. Therefore, each performance improvement rate of the tuning item to the entire Application was smaller than those of the 1st to the 14th tuning item.

Focusing on the performance improvement rate for each measurement region in Figure 2, tuning items #1 to #13 contribute significantly to the performance improvement of “Solving the system of equations”. Similarly, #14 contributes to “Remnants of Solver”, #15 to #26 contribute to “Limiter functions processes”, and #28 to #30 contribute to “Processing the equations other than system of equations”.

In summary, 44 tuning items were performed, which led to the reduction of the execution time of the entire Application from 202.9 seconds to 85.0 seconds (about 2.4 times faster) and the achievement of the target performance (reduction of the execution time to less than half). The details are as follows:

  • 40 items were targeted for the top 30 functions that account for about 52% (except functions related to the MPI communications) of the entire Application.

  • 2 items were targeted for the low-cost functions that were called from various parts of this Application (one of which was described in Section 4.10 (Improving the memory placement of two-dimensional arrays for sequential access)).

  • 1 item: improvement of load balance between processes

  • 1 item: implementation of thread parallelization

Column: For large-scale simulations at Fugaku

The tuning items #14 and #31 are especially important to execute the large-scale simulations using hundreds of thousands of CPU cores, which are required by users.

#14: Changing the parameters of domain decomposition of input models

In this tuning item, the parameters of decomposition were changed to improve the load balance between MPI processes. Improving the load balance between MPI processes leads to reduce the communication latency between MPI processes. Also, the impact on the latency will be getting larger as the number of processes increases. Therefore, it is important to balance the amount of operations performed by each process.

#31: Thread parallelization

The initial version of the Application did not support thread parallelization. However, thread parallelism is crucial for executing the large-scale simulations using hundreds of thousands of CPU cores more efficiently. Therefore, thread parallelization was performed. This is the first time that thread parallelization has been performed on the Application. In this tuning item, thread parallelization was performed only for the functions “cal_function_1” and “make_function_7”, which do not include factors inhibiting thread parallelization such as data conflicts, were therefore easy to implement. The percentage of the cost of the two functions accounted for about 25% in the initial version.

The execution conditions, such as the model and parallel number, in this tuning work are not large enough to evaluate the performance of large-scale simulations using hundreds of thousands of CPU cores. Therefore, after the 44 tuning items were performed, a simulation of the larger-scale model with about 800 million elements was carried out on Fugaku to evaluate the effect of the tuning. The simulation was executed using over 4000 compute nodes, with hybrid MPI-OpenMP parallelism (with 4 threads). As a result, it completed with up to about 220,000 CPU cores, and also the speed-up was observed up to about 200,000 CPU cores. The execution of the much larger-scale simulations is expected by further improvements, such as thread parallelization of the other loops.