oracular (7) lamssi_rpi.7.gz

Provided by: lam-runtime_7.1.4-7.2_amd64 bug

NAME

       lamssi_rpi - overview of LAM's RPI SSI modules

DESCRIPTION

       The  "kind"  for  RPI  SSI  modules is "rpi".  Specifically, the string "rpi" (without the
       quotes) should be used to specify which RPI should be used on the mpirun command line with
       the -ssi switch.  For example:

       mpirun -ssi rpi tcp C my_mpi_program
           Specifies  to  use the tcp RPI (and to launch a single copy of the executable "foo" on
           each node).

       The "rpi" string is also used as a prefix send parameters to specific  RPI  modules.   For
       example:

       mpirun -ssi rpi tcp -ssi rpi_tcp_short 131072 C my_mpi_program
           Specifies  to  use the tcp RPI, and to pass in the value of 131072 (128K) as the short
           message length for TCP messages.  See each RPI section below for a full description of
           parameters that are accepted by each RPI.

       LAM currently supports five different RPI SSI modules: gm, lamd, tcp, sysv, usysv.

SELECTING AN RPI MODULE

       Only  one RPI module may be selected per command execution.  The selection of which module
       occurs during MPI_INIT, and is used for the duration of the MPI process.  It is  erroneous
       to select different RPI modules for different processes.

       The kind for selecting an RPI is "rpi".  For example:

       mpriun -ssi rpi tcp C my_mpi_program
           Selects to use the tcp RPI and run a single copy of the foo exectuable on each node.

AVAILABLE MODULES

       As  with  all  SSI  modules,  it is possible to pass parameters at run time.  This section
       discusses the built-in LAM RPI modules, as well  as  the  run-time  parameters  that  they
       accept.

       In the discussion below, the parameters are discussed in terms of kind and name.  The kind
       and name may be specified as command line arguments to the mpirun command  with  the  -ssi
       switch,  or  they  may be set in environment variables of the form LAM_MPI_SSI_name=value.
       Note that using the -ssi command line switch will take precendence  over  any  environment
       variables.

       If  the  RPI that is selected is unable to run (e.g., attempting to use the gm RPI when gm
       support was not compiled into LAM, or if no gm hardware is available  on  the  nodes),  an
       appropriate error message will be printed and execution will abort.

   crtcp RPI
       The  crtcp  RPI  is  a  checkpoint/restart-able version of the tcp RPI (see below).  It is
       separate from the tcp RPI because the current implementation imposes a slight  performance
       penalty  to enable the ability to checkpoint and restart MPI jobs.  Its tunable parameters
       are the same as the tcp RPI.  This RPI probably only needs to be used when the ability  to
       checkpoint and restart MPI jobs is required.

       See  the  LAM/MPI  User's  Guide  for  more  details  on  the  crtcp  RPI  as  well as the
       checkpoint/restart capabilities of LAM/MPI.  The lamssi_cr(7) manual  page  also  contains
       additional information.

   gm RPI
       The  gm RPI is used with native Myrinet networks.  Please note that the gm RPI exists, but
       has not yet been optimized.  It gives  significantly  better  performance  than  TCP  over
       Myrinet networks, but has not yet been properly tuned and instrumented in LAM.

       That being said, there are several tunable parameters in the gm RPI:

       rpi_gm_maxport N
           If  rpi_gm_port  is not specified, LAM will attempt to find an open GM port to use for
           MPI communications starting with port 1 and ending with the N value  speified  by  the
           rpi_gm_maxport parameter.  If unspecified, LAM will try all existing GM ports.

       rpi_gm_port N
           LAM will attempt to use gm port N for MPI communications.

       rpi_gm_tinymsglen N
           Specifies the maximum message size (in bytes) for "tiny" messages (i.e., messages that
           are sent entirely in one gm message).  Tiny messages are  memcpy'ed  into  the  header
           before  it  is  sent  to  the  destination,  and  memcpy'ed out of the header into the
           destination buffer on the receiver.  Hence, it is not advisable to make this value too
           large.

       rpi_gm_fast 1
           Specifies to use the "fast" protocol for sending short gm messages.  Unreliable in the
           presence of GM errors or timeouts; this parameter is not advised for MPI  applications
           that essentially do not make continual progress within MPI.

       rpi_gm_cr 1
           Enable  checkpoint/restart  behavior  for  gm.  This can only be enabled if the gm rpi
           module was compiled with support for the  gm_get()  function,  which  is  disabled  by
           default.   See  the  LAM  Installation  and User's Guides for more information on this
           parameter before you use it.

   lamd RPI
       The lamd RPI uses LAM's "out-of-band" communication mechanism for  passing  MPI  messages.
       Specifically, MPI messages are sent from the user process to the local LAM daemon, then to
       the remote LAM daemon (if the destination process is on a different node), and then to the
       destination process.

       While  this  adds  latency  to message passing because of the extra hops that each message
       must travel, it allows for true asynchronous message passing.  Since  the  LAM  daemon  is
       running  in its own execution space, it can make progress on message passing regardless of
       the state / status of the  user's  program.   This  can  be  an  overall  net  savings  in
       performance and execution time for some classes of MPI programs.

       It  is expected that this RPI will someday become obsolete when LAM becomes multi-threaded
       and allows progress to be made on message passing  in  separate  threads  rather  than  in
       separate processes.

       The lamd RPI has no tunable parameters.

   tcp RPI
       The tcp RPI uses pure TCP for all MPI message passing.  TCP sockets are opened between MPI
       processes and are used for all MPI traffic.

       The tcp RPI has one tunable parameter:

       rpi_tcp_short <bytes>
           Tells the tcp RPI the smallest size (in bytes) for a message to be considered  "long".
           Short  messages  are  sent eagerly (even if the receiving side is not expecting them).
           Long messages use a rendevouz protocol (i.e., a three-way  handshake)  such  that  the
           message  is not actually sent until the receiver is expecting it.  This value defaults
           to 64k.

   sysv RPI
       The sysv RPI uses shared memory for communication between MPI processes on the same  node,
       and  TCP  sockets  for  communication  between MPI processes on different nodes.  System V
       semaphores are used to lock the shared memory pools.  This RPI is best used  when  running
       multiple MPI processes on uniprocessors (or oversubscribed SMPs) because of the blocking /
       yielding nature of semaphores.

       The sysv RPI has the following tunable parameters:

       rpi_tcp_short <bytes>
           Since the sysv RPI uses  parts  of  the  tcp  RPI  for  off-node  communication,  this
           parameter  also  has  relevance  to  the  sysv  RPI.  The meaning of this parameter is
           discussed in the tcp RPI section.

       rpi_sysv_short <bytes>
           Tells the sysv RPI the smallest size (in bytes) for a message to be considered "long".
           Short  shared memory messages are sent using a small "postbox" protocol; long messages
           use a more general shared memory pool method.  This value defaults to 8k.

       rpi_sysv_pollyield <bool>
           If set to a nonzero number, force the use of a system call  to  yield  the  processor.
           The  system  call  will  be  yield(), sched_yield(), or select() (with a 1ms timeout),
           depending what LAM's  configure  script  finds  at  configuration  time.   This  value
           defaults to 1.

       rpi_sysv_shmpoolsize <bytes>
           The  size  of  the  shared memory pool that is used for long message transfers.  It is
           allocated once on each node for each MPI parallel job.  Specifically, if multiple  MPI
           processes from the same parallel job are spawned on a single node, this pool will only
           be allocated once.

           The configure script will try to determine a default size for  the  pool  if  none  is
           explicitly  specified  (you  should  always  check  this  to see if it is reasonable).
           Larger values should improve performance especially when an application  passes  large
           messages, but will also increase the system resources used by each task.

       rpi_sysv_shmmaxalloc <bytes>
           To  prevent  a  single  large  message  transfer  from  monopolizing  the global pool,
           allocations from the pool are actually restricted to a maximum of rpi_sysv_shmmaxalloc
           bytes  each.   Even  with  this  restriction,  it  is  possible for the global pool to
           temporarily become exhausted. In this case, the transport will fall back to using  the
           postbox  area  to  transfer  the  message.  Performance  will  be  degraded,  but  the
           application will progress.

           The configure script will try to determine a  default  size  for  the  maximum  atomic
           transfer  size if none is explicitly specified (you should always check this to see if
           it is reasonable).  Larger  values  should  improve  performance  especially  when  an
           application passes large messages, but will also increase the system resources used by
           each task.

   usysv RPI
       The usysv RPI uses shared memory for communication between MPI processes on the same node,
       and  TCP  sockets  for communication between MPI processes on different nodes.  Spin locks
       are used to lock the shared memory pools.  This RPI is best used when the multiple of  MPI
       processes  on  a  single node is less than or equal to the number of processors because it
       allows LAM to fully occupy the processor while waiting for a message and never be  swapped
       out.

       The usysv RPI has many of the same tunable parameters as the sysv RPI:

       rpi_tcp_short <bytes>
           Same meaning as in the sysv RPI.

       rpi_usysv_short <bytes>
           Same meaning as rpi_sysv_short in the sysv RPI.

       rpi_usysv_pollyield <bool>
           Same meaning as rpi_sysv_pollyield in the sysv RPI.

       rpi_usysv_shmpoolsize <bytes>
           Same meaning as rpi_sysv_shmpoolsize in the sysv RPI.

       rpi_usysv_shmmaxalloc <bytes>
           Same meaning as rpi_sysv_shmmaxalloc in the sysv RPI.

       rpi_usysv_readlockpoll <iterations>
           Number  of  iterations  to  spin  before yielding the processor while waiting to read.
           This value defaults to 10,000.

       rpi_usysv_writelockpoll <iterations>
           Number of iterations to spin before yielding the processor  while  waiting  to  write.
           This value defaults to 10.

SEE ALSO

       lamssi(7), lamssi_cr(7), mpirun(1), LAM User's Guide