mpi introduction MPI Communication NCCL Ref

Contents

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed by a group of researchers from academia and industry to function on a wide variety of parallel computing architectures.

  • OpenMPI

  • MPICH

  • Diff of OpenMPI and MPICH

    It is important to recognize how MPICH and Open-MPI are different, i.e. that they are designed to meet different needs. MPICH is supposed to be high-quality reference implementation of the latest MPI standard and the basis for derivative implementations to meet special purpose needs. Open-MPI targets the common case, both in terms of usage and network conduits.

  • Others

    • MPICH 3.0.4 and later on Mac, Linux SMPs and SGI SMPs.
    • MVAPICH2 2.0a and later on Linux InfiniBand clusters.
    • CrayMPI 6.1.0 and later on Cray XC30.
    • SGI MPT 2.09 on SGI SMPs.
    • OpenMPI development version on Mac.

ARMCI

ARMCI: Aggregate Remote Memory Copy Interface
The purpose of the Aggregate Remote Memory Copy (ARMCI) library is to provide a general-purpose, efficient, and widely portable remote memory access (RMA) operations (one-sided communication) optimized for contiguous and noncontiguous (strided, scatter/gather, I/O vector) data transfers.
ARMCI is compatible with MPI.
ARMCI is a standalone system that could be used to support user-level
libraries and applications that use MPI or PVM(Parrallel Virtual Machine).

1
2
MPI (Message Passing Interface) is a description of an interface that can be used to communicate between computational nodes to coordinate calculations. MPI alone is not a framework that can be used, you need an implementation of this Interface, which is available for a lot of systems and programming languages.
PVM is a complete system that can be used to distribute a calculation on multiple computers, from what I know about PVM it uses a messaging protocol, similar to MPI, internally. I think an advantage of PVM is that it handles a lot of overhead, that you would have to implement manually with MPI.

UCX

MPI Communication

Point-to-point communication

  • MPI_Send

    1
    2
    3
    4
    5
    6
    7
    MPI_Send(
    void* data,
    int count,
    MPI_Datatype datatype,
    int destination,
    int tag,
    MPI_Comm communicator)
    • tag: indicate a specific message

      Sometimes there are cases when A might have to send many different types of messages to B. Instead of B having to go through extra measures to differentiate all these messages, MPI allows senders and receivers to also specify message IDs with the message (known as tags). When process B only requests a message with a certain tag number, messages with different tags will be buffered by the network until B is ready for them.

  • MPI_Recv

    1
    2
    3
    4
    5
    6
    7
    8
    MPI_Recv(
    void* data,
    int count,
    MPI_Datatype datatype,
    int source,
    int tag,
    MPI_Comm communicator,
    MPI_Status* status)

Even though the message is routed to B, process B still has to acknowledge that it wants to receive A’s data. Once it does this, the data has been transmitted. Process A is acknowledged that the data has been transmitted and may go back to work.

Broadcast

  • MPI_Bcast: one process(root) sends same data to all processes in a communicator.
    • One of the main uses of broadcasting is to send out user input to a parallel program, or send out configuration parameters to all processes.
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
              +-> 0 (A)
      |
      +-> 1 (A)
      0(A) => |
      +-> 2 (A)
      |
      +-> 3 (A)

      MPI_Bcast(
      void* data,
      int count,
      MPI_Datatype datatype
      int root,
      MPI_Comm communicator)

Scatter

  • MPI_Scatter: similar as MPI_Bcast, but sends chunks of an array to different processes in the order of process rank.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
                  +-> 0 (A)
    |
    +-> 1 (B)
    0(A,B,C,D) => |
    +-> 2 (C)
    |
    +-> 3 (D)

    MPI_Scatter(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)

Gather

  • MPI_Gather: inverse of MPI_Scatter, gather data from all processes to single one process(root).
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    0 (A) --+
    |
    1 (B) --+
    | => 0(A,B,C,D)
    2 (C) --+
    |
    3 (D) --+

    MPI_Gather(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)

Allgather

  • MPI_Allgather: like MPI_Gather + MPI_Bcast, similar as MPI_Gather without the root process.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    0 (A) --+    +-- 0 (A,B,C,D)
    | |
    1 (B) --+ +-- 1 (A,B,C,D)
    | => |
    2 (C) --+ +-- 2 (A,B,C,D)
    | |
    3 (D) --+ +-- 3 (A,B,C,D)

    MPI_Allgather(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    MPI_Comm communicator)

Reduce

  • MPI_Reduce: similar to MPI_Gather with the reduced result of all data.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    0 (3) --+
    |
    1 (2) --+ (MPI_SUM)
    | ========> 0(12)
    2 (5) --+
    |
    3 (2) --+

    0 (3, 1) --+
    |
    1 (2, 3) --+ (MPI_SUM)
    | ========> 0(12, 9)
    2 (5, 4) --+
    |
    3 (2, 1) --+

    MPI_Reduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    int root,
    MPI_Comm communicator)

Allreduce

  • MPI_Allreduce: reduce the values and distribute the results to all processes.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    0 (3, 1) --+           +-- 0 (12, 9)
    | |
    1 (2, 3) --+ (MPI_SUM) +-- 1 (12, 9)
    | ========> |
    2 (5, 4) --+ +-- 2 (12, 9)
    | |
    3 (2, 1) --+ +-- 3 (12, 9)

    MPI_Allreduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    MPI_Comm communicator)

NCCL

Ref