Notes on MPIUp to Table of Contents
This section will present notes about different tests performed to investigate the possibilities of RDMA (Remote Direct Memory Access) and one sided communication using MPI on Stallo.
A lot of info can be found at http://www.open-mpi.org , but here we will fokus on an end user point of view.
Test: send without that the receiver node is in a MPI statement (i.e. eager send)
We have first tested how large chunks of data can be send eagerly with MPI_Send (i.e. which in effect does not block), even if the receiver process is busy (not within a MPI call).
In order to obtain meaningful results the code must include a successful communication between the actual processes before the eager send. This is necessary to initialize MPI for eager send.
Otherwise the send cannot be eager. It is also after this initialization, that resources are reserved in memory (MPI_Init is not sufficient). A barrier is not enough: for example assume rank 0,1,2,3,4,5,6,7 are on one node and rank 8,9,10,11,12,13,14,15 on another node, a call to MPI_barrier will initialize the connection between rank 0 and 8, or between rank 1 and 9, but not between 0 and 9.
For send between infiniband nodes the default limit for eager send is 12 kb and is determined by the parameter mca btl_openib_eager_limit. This value can be changed by the user, for example:
mpirun --mca btl_openib_eager_limit 65000 -pernode a.out
btl_openib_eager_limit must be <= btl_openib_max_send_size (64kB default). But this value can also be set differently, for example:
mpirun --mca btl_openib_eager_limit 1000000 --mca btl_openib_max_send_size 1000000 -pernode a.out
For eager send within a node using MPI shared memory (sm), the default limit is 4 kB, can be changed with for example:
mpirun --mca btl_sm_eager_limit 1000000 -pernode a.out
For a send to itself, the default limit is 128kB, can be changed with for example:
mpirun --mca btl_self_eager_limit 2000000000 -np 2 a.out
OpenMPI will reserve about 600 times more memory than the value from mca btl_openib_eager_limit. That means that for sending 10 MB eagerly between nodes, about 6GB must be reserved exclusively for OpenMPI buffers. (this doeshowever not seem to increase proportionnaly with the number of nodes).
btl_openib_free_list_max must be large (>260) and this number times mca btl_openib_eager_limit x 2 is reserved.
In the case of non-blocking send (MPI_Isend), the data will not necessarily be sent before the next MPI call is reached in the caller process. It will be sent properly betwen nodes, but not necessarily within a node. The connection has also to be initialized (see above).
The limit of how much can be sent within a node with MPI_Isend is determined by btl_sm_eager_limit. This parameter can be increased without special memory penalty.
The best way to take advantage of RDMA is to use non-blocking send (MPI_Isend) in conjunction with sufficiently high value of btl_sm_eager_limit.
Here is a simple fortran program used to test eager send limits.
The Intel MPI library has different rules. The default limit for eager send is 64kB. This can be increased with an important limitation, see below.
Syntax for increasing limits:
module unload openmpi/1.4 module unload intel-compiler/11.1 module load intel-compiler/12.1.0 module load intel-mpi mpirun -np 16 -env I_MPI_EAGER_THRESHOLD 25600 -env I_MPI_INTRANODE_EAGER_THRESHOLD 2560 a.out
There are fundamental differences, compared to openMPI:
The "eager send" does not work as long as the receiving process is not in a MPI call (not necessarily the corresponding receiving call).
Increasing the threshold, does not have an important memory penalty as in OpenMPI