share memory

Title: Efficient Shared-Memory Implementation of HPCG and Its application to Unstructured Matrices

HPL:bound by floating-point compute capability
HPCG:bound by the memory bandwidth
因为HPCG内核涉及的都是大型稀疏矩阵,但是这些矩阵都不能很好的fit in cache的size。

SYMGS

  1. Task Scheduling with P2P Synchronization:
    j->i smoothing the i variable(row) depends on smoothing the j variable(row)

    data dependency:在分解得到的上三角or下三角矩阵的非零元素的(i,j)对应的行都有依赖性,run a and c in parallel even with a transitive dependency a → b → c as long as a and c are not directly connected

  2. Block Multi-Color Reordering

  3. Running Multiple MPI Ranks per Node