5. Summary

This document explained tuning techniques applicable to other programs in common, based on real cases presented in “Meetings for application code tuning on A64FX computer systems”. These techniques showed the following speedup for loops in applied examples.

Objective

Technique

Speedup for applied loop

Promoting Vectorization

Interchange of Innermost Loop with Data Dependency

3.04 x

Interchange of Innermost Loop with Small Iteration Count

2.02 x

Fission of Imperfectly Nested Loops

1.19 x

Reduction of Waiting Time for Calculation

Fission of Loop with Large Loop Body

1.78 x

Striping of Innermost Loop with Small Iteration Count

1.37 x

Reduction of Waiting Time for Cache Access

Full-Unrolling of Innermost Loop with Non-Contiguous Data Access

3.35 x

Interchange of Array Dimension for AoS Layout

1.51 x

Specifying CONTIGUOUS Attribute to Array Pointer

1.79 x

Readers considering speedup of their program are recommended to look for applicable ones from these techniques, which may match the program, referring to the program’s profiling data such as CPU performance reports.