4.2. Interchange of Array Dimension for AoS Layout

4.2.1. Motivation

Array of Structures is a type of data layout, where structures (in C language) are allocated as an array, as shown in the example below.

Declaration of an AoS array (in C language)
struct Particle {double x; double y; double z;};
struct Particle particles[N];
_images/cachewait-aos.png

Order of memory allocation for an AoS array

Similar data structure can be declared with a multi-dimension array as follows:

Declaration of an AoS-type 2D array (in Fortran)
real, dimension(3,N) :: particles_aos
_images/cachewait-2d-aos.png

Order of memory allocation for an AoS-type 2D array

For the above example, if N is much larger than 3, vectorizing a loop, whose do-variable corresponds to the array dimension of size N, is expected to promote compiler optimizations and show better execution performance as a result. However, since the array dimension of size N is not the first dimension (for Fortran programs) of the array, as explained for the previous example, the array accesses become non-contiguous and lead to more busy time for cache access.

In such cases, interchanging the first and the second dimensions in this example realizes Structure of Arrays (SoA) layout shown below and reduces busy time for cache access when vectorizing the loop corresponding to the array dimension of size N.

Declaration of an SoA-type 2D array (in Fortran)
real, dimension(N,3) :: particles_soa
_images/cachewait-2d-soa.png

Order of memory allocation for an SoA-type 2D array

As a result, while compiler optimizations are promoted, busy time for cache access is also reduced and it might lead to reduction of execution time.

4.2.2. Applied Example

Referring to an example presented in “Meetings for application code tuning on A64FX computer systems”, performance improvement by applying this technique is shown below. In this example, temporary arrays x, y, z and m were introduced to replace non-contiguous accesses to an AoS-type array “body” with their contiguous accesses in a loop for for-variable j.

Original
  for(int i=0; i<n; i++){
    const float xi=body[i].x, yi=body[i].y, zi=body[i].z;
    float ax=0, ay=0, az=0;
    for(int j=0; j<n; j++){
      float dx = xi - body[j].x;
      float dy = yi - body[j].y;
      float dz = zi - body[j].z;
      float r2 = eps2 + dx*dx;
      r2 += dy*dy;
      r2 += dz*dz;
      float ri = 1.f / sqrtf(r2);
      float mri = body[j].m * ri;
      float ri2 = ri * ri;
      float mri3 = mri * ri2;
      ax -= mri3 * dx;
      ay -= mri3 * dy;
      az -= mri3 * dz;
    }
    acc[i] = {ax, ay, az};
  }
Technique applied
  float x[n], y[n], z[n], m[n];
#pragma loop norecurrence
  for(int i=0; i<n; i++){
    x[i] = body[i].x;
    y[i] = body[i].y;
    z[i] = body[i].z;
    m[i] = body[i].m;
  }
  for(int i=0; i<n; i++){
    const float xi=body[i].x, yi=body[i].y, zi=body[i].z;
    float ax=0, ay=0, az=0;
    for(int j=0; j<n; j++){
      float dx = xi - x[j];
      float dy = yi - y[j];
      float dz = zi - z[j];
      float r2 = eps2 + dx*dx;
      r2 += dy*dy;
      r2 += dz*dz;
      float ri = 1.f / sqrtf(r2);
      float mri = m[j] * ri;
      float ri2 = ri * ri;
      float mri3 = mri * ri2;
      ax -= mri3 * dx;
      ay -= mri3 * dy;
      az -= mri3 * dz;
    }
    acc[i] = {ax, ay, az};
  }

Results of cycle accounting for executions before/after applying the technique are shown in graphs below. A parameter for the loop execution is as follows:

n = 2048

Comparing the right graph for the technique applied to the left graph for the original, busy time for L1D cache access was reduced dramatically and execution time was reduced by 34%. In this example, an appearance of waiting time for floating-point calculation is considered to be due to many calculations chained after an array reference within a loop.

_images/nbody.29503716.0.png _images/nbody.29503716.2.png

4.2.3. Real Cases

Real cases related to this technique are presented in “Meetings for application code tuning on A64FX computer systems” as follows:

4.2.4. References

Notice: Access rights for Fugaku User Portal are required to read the above documents.