Partitioned Communication

Ali Farazdaghi

ELEC 873

What is
Partitioned Communication?

A Solution to a Problem

The Problem:

How to fit threads in MPI communication model?

Background

In the Beginning There
was

were Clusters with Single-core Nodes

And then there was a wall
Multi-core CPUs
MPI+X!

Endpoints!

*Spoiler Alert*

Didn't make it into MPI standard
Has severe performance issues
Gives each thread ability to communicate
→ Each thread can send/recv messages
→ Each thread has an Endpoint

Expands Rank-space

Before Endpoints

With Endpoints

Allreduce Example
              
int main(int argc, char **argv) {
  int world_rank, tl;
  int max_threads = omp_get_max_threads();
  MPI_Comm ep_comm[max_threads];
  MPI_Init_thread(&argc, &argv, MULTIPLE, &tl);
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  #pragma omp parallel
  { int nt = omp_get_num_threads();
    int tn = omp_get_thread_num();
    int ep_rank;
  #pragma omp master
    MPI_Comm_create_endpoints(
      MPI_COMM_WORLD, nt, MPI_INFO_NULL, ep_comm);
  #pragma omp barrier
    MPI_Comm_rank(ep_comm[tn], &ep_rank);
    ... // divide up work based on ’ep_rank’
    MPI_Allreduce(..., ep_comm[tn]); // MPI_Send/Recv
    MPI_Comm_free(&ep_comm[tn]);
  }
  MPI_Finalize();
}
              
            

More Threads → Less Msg Rate → Use less threads!

*Spoiler Alert*

Message Matching is the Bottleneck
Can't Do Efficient Message Matching for Threads

How Message Matching Works in MPICH

Two Queues On Receiver

  • Posted Receive Queue (PRQ)
  • Unexpected Message Queue (UMQ)

Without Threads

Recv() called first

Msg arrived first

With Threads

Instead of giving each thread communication capabilities...

Give them ownership over a chunk of memory

From: Implementation and Evaluation of MPI 4.0 Partitioned Communication Libraries
Early-bird Communication!
You can use Partitioned Communication in
  • Open MPI 5
  • MPICH 4
              
MPI_Psend_init(&buffer, partitions, count,
        datatype, dest, tag, comm, info, &request);
for (i = 0; i < num_ite rations; i++) {
  MPI_Start(&request) ;
  /* Parallel loop with some number of threads */
  parallel for (partition = 0; partition < 10; partition++) {
  /* Do work to fill partition # portion of buffer */
  MPI_Pready(partition, &request ) ;
}
MPI_Wait(&request);
}
             
            
From: Implementation and Evaluation of MPI 4.0 Partitioned Communication Libraries
              
MPI_Precv_init(&buffer, partitions, count,
         datatype, source, tag, comm, info, &request); // lazy
for (i = 0; i < num_iterations; i++) {
  MPI_Start(&request);
  parallel for (partition = 0; partition < 10; partition++) {
    /* do compute work */
    MPI_Parrived(&request, partition, &flag);
    /* do work on early arrivals if available */
    /* if not goto next iteration or Parravied() again */
  }
  MPI_Test(&request, &flag, MPI_STATUS_IGNORE);  // for whole buffer
}
             
            
From: Implementation and Evaluation of MPI 4.0 Partitioned Communication Libraries
Difference between MPI Partitioned Communication and Finepoints

Partitioned Communication has lazy initialization on receiver side

Benefits of this method

  • Persistant Communication
  • No Matching
  • RDMA put
  • Sync happens at the end

Other papers

GPUs?

Q/A

Refs are hyperlinked

Link to presentation repo