Ubuntu Manpage: mpirun - Run MPI programs on LAM nodes.

Provided by: lam-runtime_7.1.4-6build2_amd64

NAME

       mpirun - Run MPI programs on LAM nodes.

SYNOPSIS

       mpirun  [-fhvO]  [-c  #  | -np #] [-D | -wd dir] [-ger | -nger] [-sigs | -nsigs] [-ssi key
       value] [-nw | -w] [-nx]  [-pty  |  -npty]  [-s  node]  [-t  |  -toff  |  -ton]  [-tv]  [-x
       VAR1[=VALUE1][,VAR2[=VALUE2],...]]   [[-p  prefix_str]  [-sa  |  -sf]] [where] program [--
       args]

       Note: Although each are individually optional, at least one of where, -np, or -c  must  be
              specified in the above form (i.e., when a schema is not used).

       mpirun  [-fhvO] [-D | -wd dir] [-ger | -nger] [-sigs | -nsigs] [-ssi key value] [-nw | -w]
              [-nx]    [-pty    |    -npty]    [-t    |    -toff    |     -ton]     [-tv]     [-x
              VAR1[=VALUE1][,VAR2[=VALUE2],...]]  schema

       Note:  The  -c2c  and  -lamd  options  are now obsolete.  Use -ssi instead.  See the "SSI"
              section, below.

QUICK SUMMARY

       If you're simply looking for how to run an MPI application, you probably want to  use  the
       following command line:

              % mpirun C my_mpi_application

       This  will  run  one  copy of my_mpi_application on every CPU in the current LAM universe.
       Alternatively,  "N"  can  be  used  in  place  of  "C",  indicating  that  one   copy   of
       my_mpi_application  should  be  run  on  every node (as opposed to CPU) in the current LAM
       universe.  Finally:

              % mpirun -np 4 my_mpi_application

       can be used to tell LAM to explicitly run four copies of my_mpi_application, scheduling in
       a  round-robin  fashion  by  CPU  in the LAM universe.  See the rest of this page for more
       details, particularly the "Location Nomenclature" section.

OPTIONS

There are two forms of the mpirun command -- one for programs (i.e., SPMD-style
applications), and one for application schemas (see appschema(5)). Both forms of mpirun
use the following options by default: -nger -w. These may each be overriden by their
counterpart options, described below.

Additionally, mpirun will send the name of the directory where it was invoked on the local
node to each of the remote nodes, and attempt to change to that directory. See the
"Current Working Directory" section, below.

-c # Synonym for -np (see below).

-D Use the executable program location as the current working directory for created
processes. The current working directory of the created processes will be set
before the user's program is invoked. This option is mutually exclusive with
-wd.

-f Do not configure standard I/O file descriptors - use defaults.

-h Print useful information on this command.

-ger Enable GER (Guaranteed Envelope Resources) communication protocol and error
reporting. See MPI(7) for a description of GER. This option is mutually
exclusive with -nger.

-nger Disable GER (Guaranteed Envelope Resources). This option is mutually exclusive
with -ger.

-nsigs Do not have LAM catch signals in the user application. This is the default, and
is mutually exclusive with -sigs.

-np # Run this many copies of the program on the given nodes. This option indicates
that the specified file is an executable program and not an application schema.
If no nodes are specified, all LAM nodes are considered for scheduling; LAM will
schedule the programs in a round-robin fashion, "wrapping around" (and
scheduling multiple copies on a single node) if necessary.

-npty Disable pseudo-tty support. Unless you are having problems with pseudo-tty
support, you probably do not need this option. Mutually exlclusive with -pty.

-nw Do not wait for all processes to complete before exiting mpirun. This option is
mutually exclusive with -w.

-nx Do not automatically export LAM_MPI_*, LAM_IMPI_*, or IMPI_* environment
variables to the remote nodes.

-O Multicomputer is homogeneous. Do no data conversion when passing messages.
THIS FLAG IS NOW OBSOLETE.

-pty Enable pseudo-tty support. Among other things, this enabled line-buffered
output (which is probably what you want). This is the default. Mutually
exclusive with -npty.

-s node Load the program from this node. This option is not valid on the command line
if an application schema is specified.

-sigs Have LAM catch signals in the user process. This options is mutually exclusive
with -nsigs.

-ssi key value
Send arguments to various SSI modules. See the "SSI" section, below.

-t, -ton Enable execution trace generation for all processes. Trace generation will
proceed with no further action. These options are mutually exclusive with
-toff.

-toff Enable execution trace generation for all processes. Trace generation for
message passing traffic will begin after processes collectively call
MPIL_Trace_on(2). Note that trace generation for datatypes and communicators
will proceed regardless of whether trace generation is enabled for messages or
not. This option is mutually exclusive with -t and -ton.

-tv Launch processes under the TotalView Debugger.

-v Be verbose; report on important steps as they are done.

-w Wait for all applications to exit before mpirun exits.

-wd dir Change to the directory dir before the user's program executes. Note that if
the -wd option appears both on the command line and in an application schema,
the schema will take precendence over the command line. This option is mutually
exclusive with -D.

-x Export the specified environment variables to the remote nodes before executing
the program. Existing environment variables can be specified (see the Examples
section, below), or new variable names specified with corresponding values. The
parser for the -x option is not very sophisticated; it does not even understand
quoted values. Users are advised to set variables in the environment, and then
use -x to export (not define) them.

-sa Display the exit status of all MPI processes irrespecive of whether they fail or
run successfully.

-sf Display the exit status of all processes only if one of them fails.

-p prefix_str
Prefixes each process status line displayed by [-sa] and [-sf] by the
prefix_str.

where A set of node and/or CPU identifiers indicating where to start program. See
bhost(5) for a description of the node and CPU identifiers. mpirun will
schedule adjoining ranks in MPI_COMM_WORLD on the same node when CPU identifiers
are used. For example, if LAM was booted with a CPU count of 4 on n0 and a CPU
count of 2 on n1 and where is C, ranks 0 through 3 will be placed on n0, and
ranks 4 and 5 will be placed on n1.

args Pass these runtime arguments to every new process. These must always be the
last arguments to mpirun. This option is not valid on the command line if an
application schema is specified.

DESCRIPTION

One invocation of mpirun starts an MPI application running under LAM. If the application
is simply SPMD, the application can be specified on the mpirun command line. If the
application is MIMD, comprising multiple programs, an application schema is required in a
separate file. See appschema(5) for a description of the application schema syntax, but
it essentially contains multiple mpirun command lines, less the command name itself. The
ability to specify different options for different instantiations of a program is another
reason to use an application schema.

Location Nomenclature
As described above, mpirun can specify arbitrary locations in the current LAM universe.
Locations can be specified either by CPU or by node (noted by the "where" in the SYNTAX
section, above). Note that LAM does not bind processes to CPUs -- specifying a location
"by CPU" is really a convenience mechanism for SMPs that ultimately maps down to a
specific node.

Note that LAM effectively numbers MPI_COMM_WORLD ranks from left-to-right in the where,
regardless of which nomenclature is used. This can be important because typical MPI
programs tend to communicate more with their immediate neighbors (i.e., myrank +/- X) than
distant neighbors. When neighbors end up on the same node, the shmem RPIs can be used for
communication rather than the network RPIs, which can result in faster MPI performance.

Specifying locations by node will launch one copy of an executable per specified node.
Using a capitol "N" tells LAM to use all available nodes that were lambooted (see
lamboot(1)). Ranges of specific nodes can also be specified in the form "nR[,R]*", where
R specifies either a single node number or a valid range of node numbers in the range of
[0, num_nodes). For example:

mpirun N a.out
Runs one copy of the the executable a.out on all available nodes in the LAM universe.
MPI_COMM_WORLD rank 0 will be on n0, rank 1 will be on n1, etc.

mpirun n0-3 a.out
Runs one copy of the the executable a.out on nodes 0 through 3. MPI_COMM_WORLD rank 0
will be on n0, rank 1 will be on n1, etc.

mpirun n0-3,8-11,15 a.out
Runs one copy of the the executable a.out on nodes 0 through 3, 8 through 11, and 15.
MPI_COMM_WORLD ranks will be ordered as follows: (0, n0), (1, n1), (2, n2), (3, n3),
(4, n8), (5, n9), (6, n10), (7, n11), (8, n15).

Specifying by CPU is the preferred method of launching MPI jobs. The intent is that the
boot schema used with lamboot(1) will indicate how many CPUs are available on each node,
and then a single, simple mpirun command can be used to launch across all of them. As
noted above, specifying CPUs does not actually bind processes to CPUs -- it is only a
convenience mechanism for launching on SMPs. Otherwise, the by-CPU notation is the same
as the by-node notation, except that "C" and "c" are used instead of "N" and "n".

Assume in the following example that the LAM universe consists of four 4-way SMPs. So
c0-3 are on n0, c4-7 are on n1, c8-11 are on n2, and 13-15 are on n3.

mpirun C a.out
Runs one copy of the the executable a.out on all available CPUs in the LAM universe.
This is typically the simplest (and preferred) method of launching all MPI jobs (even
if it resolves to one process per node). MPI_COMM_WORLD ranks 0-3 will be on n0,
ranks 4-7 will be on n1, ranks 8-11 will be on n2, and ranks 13-15 will be on n3.

mpirun c0-3 a.out
Runs one copy of the the executable a.out on CPUs 0 through 3. All four ranks of
MPI_COMM_WORLD will be on MPI_COMM_WORLD.

mpirun c0-3,8-11,15 a.out
Runs one copy of the the executable a.out on CPUs 0 through 3, 8 through 11, and 15.
MPI_COMM_WORLD ranks 0-3 will be on n0, 4-7 will be on n2, and 8 will be on n3.

The reason that the by-CPU nomenclature is preferred over the by-node nomenclature is best
shown through example. Consider trying to run the first CPU example (with the same
MPI_COMM_WORLD mapping) with the by-node nomenclature -- run one copy of a.out for every
available CPU, and maximize the number of local neighbors to potentially maximize MPI
performance. One solution would be to use the following command:

mpirun n0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 a.out

This works, but is definitely klunky to type. It is typically easier to use the by-CPU
notation. One might think that the following is equivalent:

mpirun N -np 16 a.out

This is not equivalent because the MPI_COMM_WORLD rank mappings will be assigned by node
rather than by CPU. Hence rank 0 will be on n0, rank 1 will be on n1, etc. Note that the
following, however, is equivalent, because LAM interprets lack of a where as "C":

mpirun -np 16 a.out

However, a "C" can tend to be more convenient, especially for batch-queuing scripts
because the exact number of processes may vary between queue submissions. Since the batch
system will determine the final number of CPUs available, having a generic script that
effectively says "run on everything you gave me" may lead to more portable / re-usable
scripts.

Finally, it should be noted that specifying multiple where clauses are perfectly
acceptable. As such, mixing of the by-node and by-CPU syntax is also valid, albiet
typically not useful. For example:

mpirun C N a.out

However, in some cases, specifying multiple where clauses can be useful. Consider a
parallel application where MPI_COMM_WORLD rank 0 will be a "manager" and therefore consume
very few CPU cycles because it is usually waiting for "worker" processes to return
results. Hence, it is probably desirable to run one "worker" process on all available
CPUs, and run one extra process that will be the "manager":

mpirun c0 C manager-worker-program

Application Schema or Executable Program?
To distinguish the two different forms, mpirun looks on the command line for where or the
-c option. If neither is specified, then the file named on the command line is assumed to
be an application schema. If either one or both are specified, then the file is assumed
to be an executable program. If where and -c both are specified, then copies of the
program are started on the specified nodes/CPUs according to an internal LAM scheduling
policy. Specifying just one node effectively forces LAM to run all copies of the program
in one place. If -c is given, but not where, then all available CPUs on all LAM nodes are
used. If where is given, but not -c, then one copy of the program is run on each node.

Program Transfer
By default, LAM searches for executable programs on the target node where a particular
instantiation will run. If the file system is not shared, the target nodes are
homogeneous, and the program is frequently recompiled, it can be convenient to have LAM
transfer the program from a source node (usually the local node) to each target node. The
-s option specifies this behavior and identifies the single source node.

Locating Files
LAM looks for an executable program by searching the directories in the user's PATH
environment variable as defined on the source node(s). This behavior is consistent with
logging into the source node and executing the program from the shell. On remote nodes,
the "." path is the home directory.

LAM looks for an application schema in three directories: the local directory, the value
of the LAMAPPLDIR environment variable, and laminstalldir/boot, where "laminstalldir" is
the directory where LAM/MPI was installed.

Standard I/O
LAM directs UNIX standard input to /dev/null on all remote nodes. On the local node that
invoked mpirun, standard input is inherited from mpirun. The default is what used to be
the -w option to prevent conflicting access to the terminal.

LAM directs UNIX standard output and error to the LAM daemon on all remote nodes. LAM
ships all captured output/error to the node that invoked mpirun and prints it on the
standard output/error of mpirun. Local processes inherit the standard output/error of
mpirun and transfer to it directly.

Thus it is possible to redirect standard I/O for LAM applications by using the typical
shell redirection procedure on mpirun.

% mpirun C my_app my_input my_output

Note that in this example only the local node (i.e., the node where mpirun was invoked
from) will receive the stream from my_input on stdin. The stdin on all the other nodes
will be tied to /dev/null. However, the stdout from all nodes will be collected into the
my_output file.

The -f option avoids all the setup required to support standard I/O described above.
Remote processes are completely directed to /dev/null and local processes inherit file
descriptors from lamboot(1).

Pseudo-tty support
The -pty option enabled pseudo-tty support for process output (it is also enabled by
default). This allows, among other things, for line buffered output from remote nodes
(which is probably what you want). This option can be disabled with the -npty switch.

Process Termination / Signal Handling
During the run of an MPI application, if any rank dies abnormally (either exiting before
invoking MPI_FINALIZE, or dying as the result of a signal), mpirun will print out an error
message and kill the rest of the MPI application.

By default, LAM/MPI only installs a signal handler for one signal in user programs
(SIGUSR2 by default, but this can be overridden when LAM is configured and built).
Therefore, it is safe for users to install their own signal handlers in LAM/MPI programs
(LAM notices death-by-signal cases by examining the process' return status provided by the
operating system).

User signal handlers should probably avoid trying to cleanup MPI state -- LAM is neither
thread-safe nor async-signal-safe. For example, if a seg fault occurs in MPI_SEND
(perhaps because a bad buffer was passed in) and a user signal handler is invoked, if this
user handler attempts to invoke MPI_FINALIZE, Bad Things could happen since LAM/MPI was
already "in" MPI when the error occurred. Since mpirun will notice that the process died
due to a signal, it is probably not necessary (and safest) for the user to only clean up
non-MPI state.

If the -sigs option is used with mpirun, LAM/MPI will install several signal handlers to
locally on each rank to catch signals, print out error messages, and kill the rest of the
MPI application. This is somewhat redundant behavior since this is now all handled by
mpirun, but it has been left for backwards compatability.

Process Exit Statuses
The -sa, -sf, and -p parameters can be used to display the exist statuses of the
individual MPI processes as they terminate. -sa forces the exit statuses to be displayed
for all processes; -sf only displays the exist statuses if at least one process terminates
either by a signal or a non-zero exit status (note that exiting before invoking
MPI_FINALIZE will cause a non-zero exit status).

The status of each process is printed out, one per line, in the following format:

prefix_string node pid killed status

If killed is 1, then status is the signal number. If killed is 0, then status is the exit
status of the process.

The default prefix_string is "mpirun:", but the -p option can be used override this
string.

Current Working Directory
The default behavior of mpirun has changed with respect to the directory that processes
will be started in.

The -wd option to mpirun allows the user to change to an arbitrary directory before their
program is invoked. It can also be used in application schema files to specify working
directories on specific nodes and/or for specific applications.

If the -wd option appears both in a schema file and on the command line, the schema file
directory will override the command line value.

The -D option will change the current working directory to the directory where the
executable resides. It cannot be used in application schema files. -wd is mutually
exclusive with -D.

If neither -wd nor -D are specified, the local node will send the directory name where
mpirun was invoked from to each of the remote nodes. The remote nodes will then try to
change to that directory. If they fail (e.g., if the directory does not exists on that
node), they will start with from the user's home directory.

All directory changing occurs before the user's program is invoked; it does not wait until
MPI_INIT is called.

Process Environment
Processes in the MPI application inherit their environment from the LAM daemon upon the
node on which they are running. The environment of a LAM daemon is fixed upon booting of
the LAM with lamboot(1) and is typically inherited from the user's shell. On the origin
node, this will be the shell from which lamboot(1) was invoked; on remote nodes, the exact
environment is determined by the boot SSI module used by lamboot(1). The rsh boot module,
for example, uses either rsh/ssh to launch the LAM daemon on remote nodes, and typically
executes one or more of the user's shell-setup files before launching the LAM daemon.
When running dynamically linked applications which require the LD_LIBRARY_PATH environment
variable to be set, care must be taken to ensure that it is correctly set when booting the
LAM.

Exported Environment Variables
All environment variables that are named in the form LAM_MPI_*, LAM_IMPI_*, or IMPI_* will
automatically be exported to new processes on the local and remote nodes. This exporting
may be inhibited with the -nx option.

Additionally, the -x option to mpirun can be used to export specific environment variables
to the new processes. While the syntax of the -x option allows the definition of new
variables, note that the parser for this option is currently not very sophisticated - it
does not even understand quoted values. Users are advised to set variables in the
environment and use -x to export them; not to define them.

Trace Generation
Two switches control trace generation from processes running under LAM and both must be in
the on position for traces to actually be generated. The first switch is controlled by
mpirun and the second switch is initially set by mpirun but can be toggled at runtime with
MPIL_Trace_on(2) and MPIL_Trace_off(2). The -t (-ton is equivalent) and -toff options all
turn on the first switch. Otherwise the first switch is off and calls to MPIL_Trace_on(2)
in the application program are ineffective. The -t option also turns on the second
switch. The -toff option turns off the second switch. See MPIL_Trace_on(2) and
lamtrace(1) for more details.

MPI Data Conversion
LAM's MPI library converts MPI messages from local representation to LAM representation
upon sending them and then back to local representation upon receiving them. If the case
of a LAM consisting of a homogeneous network of machines where the local representation
differs from the LAM representation this can result in unnecessary conversions.

The -O switch used to be necessary to indicate to LAM whether the mulitcomputer was
homogeneous or not. LAM now automatically determines whether a given MPI job is
homogeneous or not. The -O flag will silently be accepted for backwards compatability,
but it is ignored.

SSI (System Services Interface)
The -ssi switch allows the passing of parameters to various SSI modules. LAM's SSI
modules are described in detail in lamssi(7). SSI modules have direct impact on MPI
programs because they allow tunable parameters to be set at run time (such as which RPI
communication device driver to use, what parameters to pass to that RPI, etc.).

The -ssi switch takes two arguments: key and value. The key argument generally specifies
which SSI module will receive the value. For example, the key "rpi" is used to select
which RPI to be used for transporting MPI messages. The value argument is the value that
is passed. For example:

mpirun -ssi rpi lamd N foo
Tells LAM to use the "lamd" RPI and to run a single copy of "foo" on every node.

mpirun -ssi rpi tcp N foo
Tells LAM to use the "tcp" RPI.

mpirun -ssi rpi sysv N foo
Tells LAM to use the "sysv" RPI.

And so on. LAM's RPI SSI modules are described in lamssi_rpi(7).

The -ssi switch can be used multiple times to specify different key and/or value
arguments. If the same key is specified more than once, the values are concatenated with
a comma (",") separating them.

Note that the -ssi switch is simply a shortcut for setting environment variables. The
same effect may be accomplished by setting corresponding environment variables before
running mpirun. The form of the environment variables that LAM sets are:
LAM_MPI_SSI_key=value.

Note that the -ssi switch overrides any previously set environment variables. Also note
that unknown key arguments are still set as environment variable -- they are not checked
(by mpirun) for correctness. Illegal or incorrect value arguments may or may not be
reported -- it depends on the specific SSI module.

The -ssi switch obsoletes the old -c2c and -lamd switches. These switches used to be
relevant because LAM could only have two RPI's available at a time: the lamd RPI and one
of the C2C RPIs. This is no longer true -- all RPI's are now available and choosable at
run-time. Selecting the lamd RPI is shown in the examples above. The -c2c switch has no
direct translation since "C2C" used to refer to all other RPI's that were not the lamd
RPI. As such, -ssi rpi value must be used to select the specific desired RPI (whether it
is "lamd" or one of the other RPI's).

Guaranteed Envelope Resources
By default, LAM will guarantee a minimum amount of message envelope buffering to each MPI
process pair and will impede or report an error to a process that attempts to overflow
this system resource. This robustness and debugging feature is implemented in a machine
specific manner when direct communication is used. For normal LAM communication via the
LAM daemon, a protocol is used. The -nger option disables GER and the measures taken to
support it. The minimum GER is configured by the system administrator when LAM is
installed. See MPI(7) for more details.

EXAMPLES

Be sure to also see the examples in the "Location Nomenclature" section, above.

mpirun N prog1
Load and execute prog1 on all nodes. Search the user's $PATH for the executable file
on each node.

mpirun -c 8 prog1
Run 8 copies of prog1 wherever LAM wants to run them.

mpirun n8-10 -v -nw -s n3 prog1 -q
Load and execute prog1 on nodes 8, 9, and 10. Search for prog1 on node 3 and transfer
it to the three target nodes. Report as each process is created. Give "-q" as a
command line to each new process. Do not wait for the processes to complete before
exiting mpirun.

mpirun -v myapp
Parse the application schema, myapp, and start all processes specified in it. Report
as each process is created.

mpirun -npty -wd /work/output -x DISPLAY C my_application

Start one copy of "my_application" on each available CPU. The number of available
CPUs on each node was previously specified when LAM was booted with lamboot(1). As
noted above, mpirun will schedule adjoining rank in MPI_COMM_WORLD on the same node
where possible. For example, if n0 has a CPU count of 8, and n1 has a CPU count of 4,
mpirun will place MPI_COMM_WORLD ranks 0 through 7 on n0, and 8 through 11 on n1.
This tends to maximize on-node communication for many parallel applications; when used
in conjunction with the multi-protocol network/shared memory RPIs in LAM (see the
RELEASE_NOTES and INSTALL files with the LAM distribution), overall communication
performance can be quite good. Also disable pseudo-tty support, change directory to
/work/output, and export the DISPLAY variable to the new processes (perhaps
my_application will invoke an X application such as xv to display output).

DIAGNOSTICS

       mpirun: Exec format error
           This usually means that either a number of processes or an  appropriate  where  clause
           was  not  specified, indicating that LAM does not know how many processes to run.  See
           the EXAMPLES and "Location Nomenclature" sections,  above,  for  examples  on  how  to
           specify  how  many  processes  to run, and/or where to run them.  However, it can also
           mean that a non-ASCII character was detected  in  the  application  schema.   This  is
           usually a command line usage error where mpirun is expecting an application schema and
           an executable file was given.

       mpirun: syntax error in application schema, line XXX
           The application schema cannot be parsed because of a usage  or  syntax  error  on  the
           given line in the file.

       filename: No such file or directory
           This  error can occur in two cases.  Either the named file cannot be located or it has
           been found but the user does not have sufficient permissions to execute the program or
           read the application schema.

RETURN VALUE

       mpirun  returns  0 if all ranks started by mpirun exit after calling MPI_FINALIZE.  A non-
       zero value is returned if an internal error occurred in  mpirun,  or  one  or  more  ranks
       exited  before  calling  MPI_FINALIZE.   If  an  internal  error  occurred  in mpirun, the
       corresponding error code is returned.  In the event that one or  more  ranks  exit  before
       calling  MPI_FINALIZE,  the  return  value  of  the  rank of the process that mpirun first
       notices died before calling MPI_FINALIZE will be returned.  Note that,  in  general,  this
       will be the first rank that died but is not guaranteed to be so.

       However,  note  that  if  the  -nw  switch  is used, the return value from mpirun does not
       indicate the exit status of the ranks.