2.3. Fission of Imperfectly Nested Loops

2.3.1. Motivation

Fujitsu Fortran/C/C++ compilers vectorize innermost loops. Therefore, if nested loops are not perfectly nested, i.e. there is an executed statement outside the innermost loop, calculations outside the innermost loop are not vectorized and executed sequentially.

Perfectly nested loops
DO j = 1, n
  DO i = 1, n
    y(j) = y(j) + a(i,j) * x(i)
  END DO
END DO
Imperfectly nested loops
DO j = 1, n
  DO i = 1, n
    y(j) = y(j) + a(i,j) * x(i)
  END DO
  y(j) = y(j) / a(j,j)
END DO

When ratio of calculations outside the innermost loop is not ignorable, fission of imperfectly nested loops, which makes multiple perfectly nested loops, might improve ratio of vectorized calculations.

After loop fission
DO j = 1, n
  DO i = 1, n
    y(j) = y(j) + a(i,j) * x(i)
  END DO
END DO
DO j = 1, n
  y(j) = y(j) / a(j,j)
END DO

As a result, more calculations are executed simultaneously and it might lead to reduction of execution time.

2.3.2. Applied Example

Referring to an example presented in “Meetings for application code tuning on A64FX computer systems”, performance improvement by applying this technique is shown below. In this example, a loop for for-variable i, which is the outer loop for imperfectly nested loops, is distributed into a loop containing one for for-variable m and another for the remainder part.

Original
  for(int k = 0; k < p_Nq; ++k) {
    for(int j = 0; j < p_Nq; ++j) {
      for(int i = 0; i < p_Nq; ++i) {
	const int gbase
	  = element * p_Nggeo * p_Np + k * p_Nq * p_Nq + j * p_Nq + i;
	const double r_G00 = ggeo[gbase + p_G00ID * p_Np];
	const double r_G01 = ggeo[gbase + p_G01ID * p_Np];
	const double r_G11 = ggeo[gbase + p_G11ID * p_Np];
	const double r_G12 = ggeo[gbase + p_G12ID * p_Np];
	const double r_G02 = ggeo[gbase + p_G02ID * p_Np];
	const double r_G22 = ggeo[gbase + p_G22ID * p_Np];
	double qr = 0.f;
	double qs = 0.f;
	double qt = 0.f;
	for(int m = 0; m < p_Nq; m++) {
	  qr += s_D[i][m] * s_q[k][j][m];
	  qs += s_D[j][m] * s_q[k][m][i];
	  qt += s_D[k][m] * s_q[m][j][i];
	}
	s_Gqr[k][j][i] = r_G00 * qr + r_G01 * qs + r_G02 * qt;
	s_Gqs[k][j][i] = r_G01 * qr + r_G11 * qs + r_G12 * qt;
	s_Gqt[k][j][i] = r_G02 * qr + r_G12 * qs + r_G22 * qt;
      }
    }
  }
Technique applied
  double	qr[p_Nq];
  double	qs[p_Nq];
  double	qt[p_Nq];

  for(int k = 0; k < p_Nq; ++k) {
    for(int j = 0; j < p_Nq; ++j) {
      for(int i = 0; i < p_Nq; ++i) {
	qr[i] = 0.f;
	qs[i] = 0.f;
	qt[i] = 0.f;
	for(int m = 0; m < p_Nq; m++) {
	  qr[i] += s_D[i][m] * s_q[k][j][m];
	  qs[i] += s_D[j][m] * s_q[k][m][i];
	  qt[i] += s_D[k][m] * s_q[m][j][i];
	}
      }
#pragma loop norecurrence
      for(int i = 0; i < p_Nq; ++i) {
	const int gbase
	  = element * p_Nggeo * p_Np + k * p_Nq * p_Nq + j * p_Nq + i;
	const double r_G00 = ggeo[gbase + p_G00ID * p_Np];
	const double r_G01 = ggeo[gbase + p_G01ID * p_Np];
	const double r_G11 = ggeo[gbase + p_G11ID * p_Np];
	const double r_G12 = ggeo[gbase + p_G12ID * p_Np];
	const double r_G02 = ggeo[gbase + p_G02ID * p_Np];
	const double r_G22 = ggeo[gbase + p_G22ID * p_Np];
	s_Gqr[k][j][i] = r_G00 * qr[i] + r_G01 * qs[i] + r_G02 * qt[i];
	s_Gqs[k][j][i] = r_G01 * qr[i] + r_G11 * qs[i] + r_G12 * qt[i];
	s_Gqt[k][j][i] = r_G02 * qr[i] + r_G12 * qs[i] + r_G22 * qt[i];
      }
    }
  }

Ratios of SIMD instructions and results of cycle accounting for executions before/after applying the technique are shown in graphs below. A parameter for the loop execution is as follows:

p_Nq = 8

Comparing the lower graph for the technique applied to the upper graph for the original, ratio of SIMD instructions was improved from 53% to 76% and execution time was reduced by 16%.

_images/axhelm2.29503716.0.simd.png _images/axhelm2.29503716.0.png _images/axhelm2.29503716.1.simd.png _images/axhelm2.29503716.1.png

2.3.3. Real Cases

A real case related to this technique is presented in “Meetings for application code tuning on A64FX computer systems” as follows:

2.3.4. References