xenial (1) dmtcp_discover_rm.1.gz

Provided by: dmtcp_2.3.1-6_amd64 bug

NAME

       dmtcp - Distributed MultiThreaded Checkpointing

SYNOPSIS

       dmtcp_coordinator [port]

       dmtcp_launch command [args...]

       dmtcp_restart ckpt_FILE1.dmtcp [ckpt_FILE2.dmtcp...]

       dmtcp_command coordinatorCommand

DESCRIPTION

       DMTCP  is a tool to transparently checkpointing the state of an arbitrary group of programs spread across
       many machines and connected by sockets. It does not modify the user's program nor the  operating  system.
       MTCP is a standalone component of DMTCP available as a checkpointing library for a single process.

OPTIONS

       For  each  command,  the --help or -h flag will show the command-line options.  Most command line options
       can also be controlled through environment variables.  These can be set in bash with "export  NAME=value"
       or in tcsh with "setenv NAME value".

       DMTCP_CHECKPOINT_INTERVAL=integer
              Time  in  seconds  between  automatic  checkpoints.  Checkpoints can also be initiated manually by
              typing 'c' into the coordinator. (default: 0, disabled; dmtcp_coordinator only)

       DMTCP_HOST=string
              Hostname where  the  cluster-wide  coordinator  is  running.  (default:  localhost;  dmtcp_launch,
              dmtcp_restart only)

       DMTCP_PORT=integer
              The port the cluster-wide coordinator listens on. (default: 7779)

       DMTCP_GZIP=(1|0)
              Set  to  "0"  to  disable  compression  of  checkpoint  images.  (default: 1, compression enabled;
              dmtcp_launch only) WARNING:  gzip adds seconds.  Without gzip, ckpt/restart is often less than 1 s

       DMTCP_CHECKPOINT_DIR=path
              Directory to store checkpoint images in. (default: ./)

       DMTCP_SIGCKPT=integer
              Internal signal number to use for checkpointing.  Must not be used by the user program.  (default:
              SIGUSR2; dmtcp_launch only)

DMTCP_COORDINATOR

       Each computation to be checkpointed must include a DMTCP coordinator process.  One can explicitly start a
       coordinator through dmtcp_coordinator, or allow one to be started  implicitly  in  background  by  either
       dmtcp_launch  or  dmtcp_restart to operate.  The address of the unique coordinator should be specified by
       dmtcp_launch, dmtcp_restart, and dmtcp_command either through the --host and --port command-line flags or
       through the the DMTCP_HOST and DMTCP_PORT environment variables.  If neither is given, the host-port pair
       defaults to localhost-7779.  The host-port pair associated with a particular coordinator is given by  the
       command-line flags used in the dmtcp_coordinator command, or the environment variables then in effect, or
       the default of localhost-7779.

       The coordinator is stateless and is not checkpointed.  On restart, one can  use  an  existing  or  a  new
       coordinator.   Multiple  computations  under  DMTCP control can coexist by providing a unique coordinator
       (with a unique host-port pair) for each such computation.

       The coordinator initiates a checkpoint for all processes in its computation group.  Checkpoints  can  be:
       performed  automatically  on  an interval (see DMTCP_CHECKPOINT_INTERVAL above); or initiated manually on
       the standard input of the coordinator (see next paragraph); or initiated directly under  program  control
       by the comptuation through the dmtcpaware API (see below).

       The coordinator accepts the following commands on its standard input.  Each command should be followed by
       the <return> key.  The commands are:
         l : List connected nodes
         s : Print status message
         c : Checkpoint all nodes
         f : Force a restart even if there are missing nodes (debugging)
         k : Kill all nodes
         q : Kill all nodes and quit
         ? : Show this message

       Coordinator commands can also be issued remotely using dmtcp_command.

EXAMPLE USAGE

       1. In a separate terminal window, start the dmtcp_coodinator.
              (See previous section.)

               dmtcp_coordinator

       2. In separate terminal(s), replace each command(s) with "dmtcp_launch
              [command]".  The checkpointed program will connect to the coordinator specified by DMTCP_HOST  and
              DMTCP_PORT.   New  threads  will  be  checkpointed  as  part of the process.  Child processes will
              automatically be checkpointed.  Remote processes started via ssh will automatically  checkpointed.
              (Internally, DMTCP modifies the ssh command line to call dmtcp_launch on the remote host.)

               dmtcp_launch ./myprogram

       3. To manually initiate a checkpoint, either run the command below
              or  type "c" followed by <return> into the coordinator.  Checkpoint files for each process will be
              written to DMTCP_CHECKPOINT_DIR. The dmtcp_coordinator will write "dmtcp_restart_script.sh" to its
              working  directory.   This  script  contains  the  necessary calls to dmtcp_restart to restart the
              entire computation, including remote processes created via ssh.

                   dmtcp_command -c
              OR:  dmtcp_command --checkpoint

       4. To restart, one should execute dmtcp_restart_script.sh, which is
              created by the dmtcp_coordinator in its working directory at  the  time  of  checkpoint.  One  can
              optionally  edit  this  script  to  migrate  processes  to  different hosts.  By default, only one
              restarted process will be restarted in the foreground and receive the standard input.  The  script
              may be edited to choose which process will be restarted in the foreground.

               ./dmtcp_restart_script.sh

DMTCPAWARE API

       DMTCP provides a programming interface to allow checkpointed applications to interact with dmtcp.  In the
       source distribution, see dmtcpaware/dmtcpaware.h for the functions available.  See test/dmtcpaware[123].c
       for three example applications.  For an example of its usage, try:

        cd test; rm dmtcpaware1; make dmtcpaware1; ./autotest -v dmtcpaware1

       The  user  application  should  link  with  libdmtcpaware.so  (-ldmtcpaware)  and  use  the  header  file
       dmtcp/dmtcpaware.h.

DMTCP PLUGIN MODULES

       The source distribution includes a top-level plugin directory, with examples of how  to  write  a  plugin
       module  for DMTCP.  Further examples are in the test/plugin directory.  The plugin feature adds three new
       user-programmable capabilities.  A plugin may: add wrappers around system calls; take special actions  at
       during  certain  events  (e.g. pre-checkpoint, resume/post-checkpoint, restart); and may insert key-value
       pairs into a database at restart time that is then available to be queried by the restarted processes  of
       a  computation.  (The events available to the plugin feature form a superset of the events available with
       the dmtcpaware interface.)  One or more plugins are  invoked  via  a  list  of  colon-separated  absolute
       pathnames.

         dmtcp_launch --with-plugin PLUGIN1[:PLUGIN2]...

RETURN CODE

       A  target program under DMTCP control normally returns the same return code as if executed without DMTCP.
       However, if DMTCP fails (as opposed to the target program failing), DMTCP returns a DMTCP-specific return
       code, rc (or rc+1, rc+2 for two special cases), where rc is the integer value of the environment variable
       DMTCP_FAIL_RC if set, or else the default value, 99.

SEE ALSO

       Full documentation is available from http://dmtcp.sourceforge.net/

AUTHORS

       DMTCP and its standalone single-process compontent MTCP (MultiThreaded CheckPointing)  were  created  and
       are  maintained  by  Jason  Ansel,  Kapil Arya, Gene Cooperman, Artem Y. Polyakov, Mike Rieker, Ana-Maria
       Visan, and a series of newer contributors including Alex Brick,  Tyler  Denniston,  Rohan  Garg,  Gregory
       Kerr, and others.