The Chronicles of Phi - part 5 - Plesiochronous phasing barrier

For the next optimization, I knew what I wanted to do; I just didn’t know what to call it. In looking for words that describes loosely-synchronous, I came across plesiochronous:

In telecommunications, a plesiochronous system is one where different parts of the system are almost, but not quite, perfectly synchronized.

In any large threaded application, synchronizing all threads by way of thread pool wide barriers results in significant waste of processing power. In the diffusion simulation program, as well as possibly your application, the simulation integrates multiple time intervals before doing something with the results. In the example program, it is the entire simulation run time. In a typical simulation program, the simulator would be configured to advance N intervals, then log or display the current state. And then proceed for another N intervals, etc.

In the tiled_HT2 code, small model (256x256x256), and 4 threads per phalanx, we have 64 squad slices across z by 256 steps along y. This represents 16384 squad “drill holes” along x. Each core (60), partitions out: 16384/60 (273.0666…) squad “drill holes” along x. Not being a whole number, all cores are not assigned the same number of “drill holes”. 56 of the cores are scheduled 273 “drill holes” and 4 cores are scheduled 274 “drill holes”. The work disparity is small, only 1/273. “The cost of doing business” so to say.

If only “The cost of doing business” were that simple. In practice, with 240 threads in 60 cores, you will not get all threads to start a parallel region at the same time, nor will all threads finish the region at the same time even if they are notionally performing the same amount of work. As mentioned earlier in this article, adding instrumentation to measure the total thread time in do-work section and at-barrier section, for the small problem (256x256x256) indicated approximately 25% of the computation time was spent at the barrier. As a conscientious programmer, the “The cost of doing business” as too high

Now then, what to do about lost time at barrier?

Constructing a simplified diagram for a two core, 4 threads/core, 16x16x16 array (ignore vectors for the purpose of this illustration). In the tiled_HT2 perfect world with zero skew at the start of a parallel region and identical compute times (resulting in zero time at the barrier), the time when the first thread reaches the barrier would look like this:

Blue is core 0 (threads 0:3 across width of each stripe) and green is core 1 (threads 4:7 across width of each stripe). Ideally with both core starting at the same time and finishing at the same time.

The real picture is more like this:

Where the white cells are the yet to be processed cells.

In the above depiction, the first thread to reach the barrier (green lower right corner) waits for the remaining threads. The second thread reaching the barrier, additionally wastes time waiting for the other threads, and so-on. The more out of synch the threads get, the longer the wasted burn time at the barrier.

In an extreme case, the situation can even look like:

Where one of the threads (one of the blue core threads) has yet to begin performing its work when the first thread (one of the green core) reaches the barrier. This situation is usually due to some other process running on that hardware thread. The O/S may be doing some housekeeping too.

In either the normal case, or unusual case, significant time is wasted at the barrier.

Let’s do something to recover this wasted time.

The technique used is to assign the cores to a slice that is one count of z (in phalanx-wide groupings of 4 in this case), and ny counts of y (as well as nx of x).

The blue and green represents the work area assigned for a first pass by each core (of the simplified 2-core 4-HT/core setup).

When the first thread finishing work state may look like:

Now, by using a core barrier, in place of thread pool barrier, when the sibling threads of the core who’s thread finished first, finish, the state may look like:

At this point, instead of waiting for all the threads of the thread pool to reach the barrier (all 4 threads of blue in this simplified diagram), the first core completing the core barrier can pick the next available z, and can begin processing before the other core(s) reach the barrier. At the time core 0 completes the core barrier, the state may be:

The process of: As a core finishes, pick next z, continues in a modulus counter fashion (0 follows nz-1). The modulus counter contains the number of overflows. In actuality the next z picks the next nHT's per core of z.

The main benefit of the plesiochronous barrier is realized at the moment when, and where in the code, the traditional thread pool barrier appears. This is at the end of the whole array iteration (often called frame).

In the original OpenMP program this was where the end of the #pragma omp parallel for implicit barrier was located, and in our revised tiled_HT1 and tiled_HT2 program where the explicit #pragma omp barrier was located.

The plesiochronous barrier technique permits the cores completing one frame, to enter the next frame provided that the dependencies are satisfied:

In the above, grey indicates dependencies are fulfilled (prior iteration completed from perspective of picking thread). The right most stripe of blue illustrates a core completing a frame (no more un-chosen stripes to right of green), then blue core passing through frame barrier while green core has yet to reach the frame barrier. Due to the next column (being first column) having reached the state of frame-1 (for core), and the column to the right (2nd column) having reached frame-1 or frame of blue core's current frame (left column time state), the blue core was permitted to enter the column.

CAUTION: A caveat to this is a thread cannot progress unless the prior phase of the picked z, and its adjacent z's (z-1, z, and z+1) are at least current phase-1. If (when) this is not true, then progressing might potentially use data that is not up-to date. This is where the plesiochronous barrier is placed.

Had the core processing the second column in the prior chart not finished, then the core (upper left blue) would have had to wait at the plesiochronous barrier.

In the above, the blue core has allocated the first column for the next frame, yet has to wait at the plesiochronous barrier for the magenta core to finish its column. The magenta core being a hypothetical 3rd core injected into the diagram in order to present the plesiochronous barrier wait scenario.

The plesiochronous barrier technique should exhibit the best improvement when the number of core columns (Hyper-Thread Phalanx columns) exceed the number of cores (Phalanx’s) by a factor of two or more. In the large model it does (512 / 4 = 128 phalanx columns / 60 cores = 2.1333). This suggest that the large model will benefit more than the small model. In observations of test runs, substantial benefit is observed under all circumstances.

Code exemplifying the plesiochronous barrier technique:

#if defined(__MIC__)
#define WAIT_A_BIT _mm_delay_32(10)
#else
#define WAIT_A_BIT _mm_pause();
#endif
...
void diffusion_tiled_aligned(
... (same as for tiled_HT2)
}

diffusion_tiled(REAL *restrict f1, REAL *restrict f2, int nx, int ny, int nz,
              REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
              REAL cb, REAL cc, REAL dt, int count) {

  // zCountCompleted[nz] is a shared array indicating the iteration counts
  // completed for the z index. N.B. Each thead processes all [x,y]'s for given z
  volatile int zCountCompleted[nz];
  for(int i = 0; i < nz; ++i)
    zCountCompleted[i] = -1;  // "completed" one before first (0)

  // shared next Phalanx number
  volatile int NextPick = 0;

  // CorePick[nCores] stores the NextPicked'd Phalanx number for core
  volatile int CorePick[nCores];
  for(int i = 0; i < nCores; ++i)
    CorePick[i] = -1;  // initialize to a value known to be less than our next pick

#pragma omp parallel
  {
    REAL *f1_t;
    REAL *f2_t;

    int priorCount = -1;
    int myCount = -1;
    int myPick = -1; // initialize myPick (prior pick for 1st iteration of loop)
    int nSquadsZ = (nz + nHTs - 1) / nHTs; // place squads across z dimension
    for(;;) {
      if(myHT == 0) {
        // team member 0 picks the next Squad
        CorePick[myCore] = myPick = __sync_fetch_and_add(&NextPick, 1);
      } else {
        // other team members wait until pick made by member 0
        while(CorePick[myCore] == myPick)
          WAIT_A_BIT;
        myPick = CorePick[myCore]; // pick up new pick
      } // myHT != 0

      myCount = myPick / nSquadsZ; // determine count interval for myPick
      // see if iteration count reached
      if(myCount >= count)
        break; // exit for(;;) loop

      // determine which buffers are in and out
      if(myCount & 1)
      {
        f1_t = f2;
        f2_t = f1;
      }
      else
      {
        f1_t = f1;
        f2_t = f2;
      }

      int z0 = (myPick % nSquadsZ) * nHTs;// home z for 0'th team member for next squad
      int z = z0 + myHT;          // z for this team member
      int y = 0;
      // assure we are within z
      if(z < nz)
      {
        // perform plesiochronous barrier
        priorCount = myCount - 1;
        if(z) // then there is a z-1
          while(zCountCompleted[z-1] < priorCount) // wait for z-1
            WAIT_A_BIT;
        while(zCountCompleted[z] < priorCount)     // wait for z
          WAIT_A_BIT;
        if(z + 1 < nz) // then there is a z+1
          while(zCountCompleted[z+1] < priorCount) // wait for z+1
            WAIT_A_BIT;
        int x = 0;
        int c, n, s, b, t;
        c =  x + y * nx + z * nx * ny;
        n = (y == 0)    ? c : c - nx;
        s = (y == ny-1) ? c : c + nx;
        b = (z == 0)    ? c : c - nx * ny;
        t = (z == nz-1) ? c : c + nx * ny;
        // compute nx by ny cells for picked z
        diffusion_tiled_aligned(
   &f2_t[c], // aligned
   &f1_t[c], // aligned
   &f1_t[c-1], // unaligned
   &f1_t[c+1], // unaligned
   &f1_t[s], // aligned
   &f1_t[n], // aligned
   &f1_t[b], // aligned
   &f1_t[t], // aligned
                        ce, cw, cn, cs, ct, cb, cc, nx, ny);
        // Inform other threads that this [z] column is complete
        zCountCompleted[z] = myCount;

        // perform equivilent of Core barrier
        int zEnd = (z0 + nHTs < nz) ? z0 + nHTs : nz;
        for(int i = z0; i < zEnd; ++i)
          while(zCountCompleted[i] < myCount)
            WAIT_A_BIT;

      } // if(z < nz)
    } // for(;;)
  } // parallel
  return;
}

Now let’s see what happened:

This shows an additional 20% gained for small solution. For the large solution it recovered the expected gains missing from the HT1 and HT2 code changes.

At this point of optimization, we are now attaining 14x to 15x the performance of a simplified OpenMP approach to the problem. And 45% faster than that attained in the example from chapter 4 of Intel^® Xeon Phi™ Coprocessor High-Performance Programming, by Jim Jeffers, and James Reinders - Morgan Kaufman publications.

What else can be done to improve performance?

This article does not have a next part to cover the next phase of the programming enhancements such as guided prefetch and clevict #pragmas, as well as inserting compiler directives to hand align code. I will leave this up to the reader.

At this point, it would be appropriate to consider what happens when problem size is increased to the point where the memory capacity of a single Xeon Phi is insufficient. The plesiochronous barrier technique you learned here can equally apply to larger problems that do not fit within the memory constraints of a single coprocessor.

If there is sufficient interest in having me continue this column to include dual coprocessors (same system) I will do so. Keep in mind that this would be a fair amount of effort.

Please post your comments.

The source code and make file can be found attached.

Jim Dempsey
Consultant
QuickThread Programming, LLC