Provided by: coop-computing-tools_7.0.22-1ubuntu1_amd64 bug

NAME

       resource_monitor - monitors the cpu, memory, io, and disk usage of a tree of processes.

SYNOPSIS

       resource_monitor [options] -- command [command-options]

DESCRIPTION

       resource_monitor  is  a  tool  to  monitor the computational resources used by the process created by the
       command given as an argument, and all its descendants.  The monitor works while a  process  was  running,
       therefore all the information reported should be considered just as an estimate (this is in contrast with
       direct methods, such as ptrace). It works on Linux, and can be used automatically by  makeflow  and  work
       queue applications.

       Additionally,  the  user  can specify maximum resource limits in the form of a file, or a string given at
       the command line. If one of the resources goes over the limit specified, then the monitor terminates  the
       task, and reports which resource went over the respective limits.

       In systems that support it, resource_monitor wraps some libc functions to obtain a better estimate of the
       resources used.

       Currently, the monitor does not support interactive applications. That is, if a  process  issues  a  read
       call  from  standard  input,  and  standard  input  has  not  been  redirected,  then the tree process is
       terminated. This is likely to change in future versions of the tool.

       resource_monitor generates up to three log files: a summary file encoded as json with the maximum  values
       of  resource  used,  a  time-series  that shows the resources used at given time intervals, and a list of
       files that were opened during execution.

       The summary file is a JSON document with the following fields. Unless indicated, all fields are an  array
       with  two  values,  a  number that describes the measurement, and a string describing the units (e.g.,  [
       measurement).

               command:                  the command line given as an argument
               start:                    time at start of execution, since the epoch
               end:                      time at end of execution, since the epoch
               exit_type:                one of "normal", "signal" or "limit" (a string)
               signal:                   number of the signal that terminated the process
                                         Only present if exit_type is signal
               cores:                    maximum number of cores used
               cores_avg:                number of cores as cpu_time/wall_time
               exit_status:              final status of the parent process
               max_concurrent_processes: the maximum number of processes running concurrently
               total_processes:          count of all of the processes created
               wall_time:                duration of execution, end - start
               cpu_time:                 user+system time of the execution
               virtual_memory:           maximum virtual memory across all processes
               memory:                   maximum resident size across all processes
               swap_memory:              maximum swap usage across all processes
               bytes_read:               amount of data read from disk
               bytes_written:            amount of data written to disk
               bytes_received:           amount of data read from network interfaces
               bytes_sent:               amount of data written to network interfaces
               bandwidth:                maximum bandwidth used
               total_files:              total maximum number of files and directories of
                                         all the working directories in the tree
               disk:                     size of all working directories in the tree
               limits_exceeded:          resources over the limit with -l, -L options (JSON object)
               peak_times:               seconds from start when a maximum occured (JSON object)
               snapshots:                List of intermediate measurements, identified by
                                         snapshot_name (JSON object)

       The time-series log has a row per time sample. For each row, the columns have the following meaning  (all
       columns are integers):

               wall_clock                the sample time, since the epoch, in microseconds
               cpu_time                  accumulated user + kernel time, in microseconds
               cores                     current number of cores used
               max_concurrent_processes  concurrent processes at the time of the sample
               virtual_memory            current virtual memory size, in MB
               memory                    current resident memory size, in MB
               swap_memory               current swap usage, in MB
               bytes_read                accumulated number of bytes read, in bytes
               bytes_written             accumulated number of bytes written, in bytes
               bytes_received            accumulated number of bytes received, in bytes
               bytes_sent                accumulated number of bytes sent, in bytes
               bandwidth                 current bandwidth, in bps
               total_files               current number of files and directories, across all
                                         working directories in the tree
               disk                      current size of working directories in the tree, in MB

OPTIONS

       -d, --debug=<subsystem>
              Enable debugging for this subsystem.

       -o, --debug-file=<file>
              Write  debugging output to this file. By default, debugging is sent to stderr (":stderr"). You may
              specify logs be sent to stdout (":stdout"), to the system syslog (":syslog"), or  to  the  systemd
              journal (":journal").

        -v,--version
              Show version string.

        -h,--help
              Show help text.

       -i, --interval=<n>
              Maximum interval between observations, in seconds (default=1).

        --pid=pid
              Track pid instead of executing a command line (warning: less precise measurements).

        --accurate-short-processes
              Accurately measure short running processes (adds overhead).

       -c, --sh=<str>
              Read command line from str, and execute as '/bin/sh -c str'.

       -l, --limits-file=<file>
              Use maxfile with list of var: value pairs for resource limits.

       -L, --limits=<string>
              String  of  the  form  "var:  value,  var:  value to  specify resource limits. (Could be specified
              multiple times.)

        -f, --child-in-foreground
              Keep the monitored process in foreground (for interactive use).

       -O, --with-output-files=<template>
              Specify template for log files (default=resource-pid).

        --with-time-series
              Write resource time series to template.series.

        --with-inotify
              Write inotify statistics of opened files to default=template.files.

       -V, --verbatim-to-summary=<str>
              Include this string verbatim in a line in the summary. (Could be specified multiple times.)

        --measure-dir=dir
              Follow the size of dir. By default the directory at the start of execution  is  followed.  Can  be
              specified multiple times. See --without-disk-footprint below.

        --follow-chdir
              Follow processes' current working directories.

        --without-disk-footprint
              Do not measure working directory footprint. Overrides --measure-dir.

        --no-pprint
              Do not pretty-print summaries.

        --snapshot-events=file
              Configuration file for snapshots on file patterns. See below.

              The limits file should contain lines of the form:

               resource: max_value

       It may contain any of the following fields, in the same units as defined for the summary file:

       max_concurrent_processes,  wall_time, cpu_time, virtual_memory, resident_memory, swap_memory, bytes_read,
       bytes_written, workdir_number_files_dirs, workdir_footprint

ENVIRONMENT VARIABLES

CCTOOLS_RESOURCE_MONITOR_HELPER Location of the desired helper library to wrap  libc  calls.  If  not
           provided

EXIT STATUS

       •
            0 The command exit status was 0, and the monitor process ran without errors.

       •
            1 The command exit status was non-zero, and the monitor process ran without errors.

       •
            2 The command was terminated because it ran out of resources  (see options -l, -L).

       •
            3 The command did not run succesfully because the monitor process had an error.

           To obtain the exit status of the original command, see the generated file with extension .summary.

SNAPSHOTS

       The  resource_monitor   can  be  directed  to take snapshots of the resources used according to the files
       created by the processes monitored. The typical use of monitoring snapshots is to set a watch  on  a  log
       file,  and  generate  a  snapshot  when  a  line  in  the log matches a pattern. To activate the snapshot
       facility, use the command line argument --snapshot-events=file, in which file is a JSON-encoded  document
       with the following format:

                   {
                       "FILENAME": {
                           "from-start":boolean,
                           "from-start-if-truncated":boolean,
                           "delete-if-found":boolean,
                           "events": [
                               {
                                   "label":"EVENT_NAME",
                                   "on-create":boolean,
                                   "on-truncate":boolean,
                                   "on-pattern":"REGEXP",
                                   "count":integer
                               },
                               {
                                   "label":"EVENT_NAME",
                                   ...
                               }
                           ]
                       },
                       "FILENAME": {
                           ...
                   }

       All fields but label are optional.

                   .IP • 4
        FILENAME:                  Name of a file to watch.
                   .IP • 4
        from-start:boolean          If  FILENAME  exits  when  the  monitor starts running, process from line 1.
       Default: false, as monitored processes may be appending to already existing files.
                   .IP • 4
        from-start-if-truncated    If FILENAME is truncated, process from line 1. Default: true, to account  for
       log rotations.
                   .IP • 4
        delete-if-found            Delete FILENAME when found. Default: false

                   .IP • 4
        events:

                   .IP • 4
        label        Name that identifies the snapshot. Only alphanumeric, -,
                                and _ characters are allowed.
                   .IP • 4
        on-create    Take a snapshot every time the file is created. Default: false
                   .IP • 4
        on-delete    Take a snapshot every time the file is deleted. Default: false
                   .IP • 4
        on-truncate  Take a snapshot when the file is truncated.    Default: false
                   .IP • 4
        on-pattern   Take a snapshot when a line matches the regexp pattern.    Default: none
                   .IP • 4
        count        Maximum number of snapshots for this label. Default: -1 (no limit)

       The snapshots are recorded both in the main resource summary file under the key snapshots, and as a JSON-
       encoded document, with the extension are identified with  the  key  "snapshot_name",  which  is  a  comma
       separated  string  of  label(count) elements. A label corresponds to a name that identifies the snapshot,
       and the count is the number of times an event was triggered since  last  check  (several  events  may  be
       triggered,  for example, when several matching lines are written to the log). Several events may have the
       same label, and exactly one of on-create, on-truncate, and on-pattern should be specified per event.

EXAMPLES

       To monitor 'sleep 10', at 2 second intervals, with output to sleep-log.summary, and with a monitor  alarm
       at 5 seconds:

               % resource_monitor --interval=2 -L"wall_time: 5" -o sleep-log -- sleep 10

       Execute 'date' and redirect its output to a file:

               % resource_monitor --sh 'date > date.output'

       It can also be run automatically from makeflow, by specifying the '-M' flag:

               % makeflow --monitor=some-log-dir Makeflow

       In  this case, makeflow wraps every command line rule with the monitor, and writes the resulting logs per
       rule in the some-log-dir directory

       Additionally, it can be run automatically from Work Queue:

               q = work_queue_create_monitoring(port);
               work_queue_enable_monitoring(q, some-log-dir, /*kill tasks on exhaustion*/ 1);

       wraps every task with the monitor and writes the resulting summaries in some-log-file.

SNAPSHOTS EXAMPLES

       Generate a snapshot when "my.log" is created:

               {
                   "my.log":
                       {
                           "events":[
                               {
                                   "label":"MY_LOG_STARTED",
                                   "on-create:true
                               }
                           ]
                       }
               }

       Generate snapshots every time a line is added to "my.log":

               {
                   "my.log":
                       {
                           "events":[
                               {
                                   "label":"MY_LOG_LINE",
                                   "on-pattern":"^.*$"
                               }
                           ]
                       }
               }

       Generate snapshots on particular lines of "my.log":

               {
                   "my.log":
                       {
                           "events":[
                               {
                                   "label":"started",
                                   "on-pattern":"^# START"
                               },
                               {
                                   "label":"end-of-start",
                                   "on-pattern":"^# PROCESSING"
                               }
                               {
                                   "label":"end-of-processing",
                                   "on-pattern":"^# ANALYSIS"
                               }
                           ]
                       }
               }

       The monitor can also generate a snapshot when a particular file is created. The monitor can detected this
       file,  generate  a  snapshot,  and  delete  the file to get ready for the next snapshot. In the following
       example the monitor takes a snapshot everytime the file please-take-a-snapshot is created:

               {
                   "please-take-a-snapshot":
                       {
                           "delete-if-found":true,
                           "events":[
                               {
                                   "label":"manual-snapshot",
                                   "on-create":true
                               }
                           ]
                       }
               }

BUGS AND KNOWN ISSUES

       •   The monitor cannot track the children of statically linked executables.

       •   The option --snapshot-events assumes that the watched files are written by appending  to  them.  File
           truncation  may not be detected if between checks the size of the file is larger or equal to the size
           after truncation. File checks are fixed at intervals of 1 second.

COPYRIGHT

       The Cooperative Computing Tools are Copyright (C) 2003-2004 Douglas Thain and Copyright (C) 2005-2015 The
       University  of  Notre  Dame.  This software is distributed under the GNU General Public License.  See the
       file COPYING for details.