Ubuntu Manpage: lamssi_rpi - overview of LAM's RPI SSI modules

NAME

       lamssi_rpi - overview of LAM's RPI SSI modules

DESCRIPTION

       The  "kind"  for  RPI  SSI  modules is "rpi".  Specifically, the string "rpi" (without the
       quotes) should be used to specify which RPI should be used on the mpirun command line with
       the -ssi switch.  For example:

       mpirun -ssi rpi tcp C my_mpi_program
           Specifies  to  use the tcp RPI (and to launch a single copy of the executable "foo" on
           each node).

       The "rpi" string is also used as a prefix send parameters to specific  RPI  modules.   For
       example:

       mpirun -ssi rpi tcp -ssi rpi_tcp_short 131072 C my_mpi_program
           Specifies  to  use the tcp RPI, and to pass in the value of 131072 (128K) as the short
           message length for TCP messages.  See each RPI section below for a full description of
           parameters that are accepted by each RPI.

       LAM currently supports five different RPI SSI modules: gm, lamd, tcp, sysv, usysv.

SELECTING AN RPI MODULE

       Only  one RPI module may be selected per command execution.  The selection of which module
       occurs during MPI_INIT, and is used for the duration of the MPI process.  It is  erroneous
       to select different RPI modules for different processes.

       The kind for selecting an RPI is "rpi".  For example:

       mpriun -ssi rpi tcp C my_mpi_program
           Selects to use the tcp RPI and run a single copy of the foo exectuable on each node.

AVAILABLE MODULES

As with all SSI modules, it is possible to pass parameters at run time. This section
discusses the built-in LAM RPI modules, as well as the run-time parameters that they
accept.

In the discussion below, the parameters are discussed in terms of kind and name. The kind
and name may be specified as command line arguments to the mpirun command with the -ssi
switch, or they may be set in environment variables of the form LAM_MPI_SSI_name=value.
Note that using the -ssi command line switch will take precendence over any environment
variables.

If the RPI that is selected is unable to run (e.g., attempting to use the gm RPI when gm
support was not compiled into LAM, or if no gm hardware is available on the nodes), an
appropriate error message will be printed and execution will abort.

crtcp RPI
The crtcp RPI is a checkpoint/restart-able version of the tcp RPI (see below). It is
separate from the tcp RPI because the current implementation imposes a slight performance
penalty to enable the ability to checkpoint and restart MPI jobs. Its tunable parameters
are the same as the tcp RPI. This RPI probably only needs to be used when the ability to
checkpoint and restart MPI jobs is required.

See the LAM/MPI User's Guide for more details on the crtcp RPI as well as the
checkpoint/restart capabilities of LAM/MPI. The lamssi_cr(7) manual page also contains
additional information.

gm RPI
The gm RPI is used with native Myrinet networks. Please note that the gm RPI exists, but
has not yet been optimized. It gives significantly better performance than TCP over
Myrinet networks, but has not yet been properly tuned and instrumented in LAM.

That being said, there are several tunable parameters in the gm RPI:

rpi_gm_maxport N
If rpi_gm_port is not specified, LAM will attempt to find an open GM port to use for
MPI communications starting with port 1 and ending with the N value speified by the
rpi_gm_maxport parameter. If unspecified, LAM will try all existing GM ports.

rpi_gm_port N
LAM will attempt to use gm port N for MPI communications.

rpi_gm_tinymsglen N
Specifies the maximum message size (in bytes) for "tiny" messages (i.e., messages that
are sent entirely in one gm message). Tiny messages are memcpy'ed into the header
before it is sent to the destination, and memcpy'ed out of the header into the
destination buffer on the receiver. Hence, it is not advisable to make this value too
large.

rpi_gm_fast 1
Specifies to use the "fast" protocol for sending short gm messages. Unreliable in the
presence of GM errors or timeouts; this parameter is not advised for MPI applications
that essentially do not make continual progress within MPI.

rpi_gm_cr 1
Enable checkpoint/restart behavior for gm. This can only be enabled if the gm rpi
module was compiled with support for the gm_get() function, which is disabled by
default. See the LAM Installation and User's Guides for more information on this
parameter before you use it.

lamd RPI
The lamd RPI uses LAM's "out-of-band" communication mechanism for passing MPI messages.
Specifically, MPI messages are sent from the user process to the local LAM daemon, then to
the remote LAM daemon (if the destination process is on a different node), and then to the
destination process.

While this adds latency to message passing because of the extra hops that each message
must travel, it allows for true asynchronous message passing. Since the LAM daemon is
running in its own execution space, it can make progress on message passing regardless of
the state / status of the user's program. This can be an overall net savings in
performance and execution time for some classes of MPI programs.

It is expected that this RPI will someday become obsolete when LAM becomes multi-threaded
and allows progress to be made on message passing in separate threads rather than in
separate processes.

The lamd RPI has no tunable parameters.

tcp RPI
The tcp RPI uses pure TCP for all MPI message passing. TCP sockets are opened between MPI
processes and are used for all MPI traffic.

The tcp RPI has one tunable parameter:

rpi_tcp_short <bytes>
Tells the tcp RPI the smallest size (in bytes) for a message to be considered "long".
Short messages are sent eagerly (even if the receiving side is not expecting them).
Long messages use a rendevouz protocol (i.e., a three-way handshake) such that the
message is not actually sent until the receiver is expecting it. This value defaults
to 64k.

sysv RPI
The sysv RPI uses shared memory for communication between MPI processes on the same node,
and TCP sockets for communication between MPI processes on different nodes. System V
semaphores are used to lock the shared memory pools. This RPI is best used when running
multiple MPI processes on uniprocessors (or oversubscribed SMPs) because of the blocking /
yielding nature of semaphores.

The sysv RPI has the following tunable parameters:

rpi_tcp_short <bytes>
Since the sysv RPI uses parts of the tcp RPI for off-node communication, this
parameter also has relevance to the sysv RPI. The meaning of this parameter is
discussed in the tcp RPI section.

rpi_sysv_short <bytes>
Tells the sysv RPI the smallest size (in bytes) for a message to be considered "long".
Short shared memory messages are sent using a small "postbox" protocol; long messages
use a more general shared memory pool method. This value defaults to 8k.

rpi_sysv_pollyield <bool>
If set to a nonzero number, force the use of a system call to yield the processor.
The system call will be yield(), sched_yield(), or select() (with a 1ms timeout),
depending what LAM's configure script finds at configuration time. This value
defaults to 1.

rpi_sysv_shmpoolsize <bytes>
The size of the shared memory pool that is used for long message transfers. It is
allocated once on each node for each MPI parallel job. Specifically, if multiple MPI
processes from the same parallel job are spawned on a single node, this pool will only
be allocated once.

The configure script will try to determine a default size for the pool if none is
explicitly specified (you should always check this to see if it is reasonable).
Larger values should improve performance especially when an application passes large
messages, but will also increase the system resources used by each task.

rpi_sysv_shmmaxalloc <bytes>
To prevent a single large message transfer from monopolizing the global pool,
allocations from the pool are actually restricted to a maximum of rpi_sysv_shmmaxalloc
bytes each. Even with this restriction, it is possible for the global pool to
temporarily become exhausted. In this case, the transport will fall back to using the
postbox area to transfer the message. Performance will be degraded, but the
application will progress.

The configure script will try to determine a default size for the maximum atomic
transfer size if none is explicitly specified (you should always check this to see if
it is reasonable). Larger values should improve performance especially when an
application passes large messages, but will also increase the system resources used by
each task.

usysv RPI
The usysv RPI uses shared memory for communication between MPI processes on the same node,
and TCP sockets for communication between MPI processes on different nodes. Spin locks
are used to lock the shared memory pools. This RPI is best used when the multiple of MPI
processes on a single node is less than or equal to the number of processors because it
allows LAM to fully occupy the processor while waiting for a message and never be swapped
out.

The usysv RPI has many of the same tunable parameters as the sysv RPI:

rpi_tcp_short <bytes>
Same meaning as in the sysv RPI.

rpi_usysv_short <bytes>
Same meaning as rpi_sysv_short in the sysv RPI.

rpi_usysv_pollyield <bool>
Same meaning as rpi_sysv_pollyield in the sysv RPI.

rpi_usysv_shmpoolsize <bytes>
Same meaning as rpi_sysv_shmpoolsize in the sysv RPI.

rpi_usysv_shmmaxalloc <bytes>
Same meaning as rpi_sysv_shmmaxalloc in the sysv RPI.

rpi_usysv_readlockpoll <iterations>
Number of iterations to spin before yielding the processor while waiting to read.
This value defaults to 10,000.

rpi_usysv_writelockpoll <iterations>
Number of iterations to spin before yielding the processor while waiting to write.
This value defaults to 10.

NAME

DESCRIPTION

SELECTING AN RPI MODULE

AVAILABLE MODULES

SEE ALSO