oracular (8) mon.8.gz

Provided by: mon_1.4.0-1_amd64 bug

NAME

       mon - monitor services for availability, sending alarms upon failures.

SYNOPSIS

       mon [-dfhlMSv] [-a dir] [-A authfile] [-b dir] [-B dir] [-c config] [-D dir] [-i secs] [-k
       num] [-l [statetype]] [-L dir] [-m num] [-p num] [-P pidfile] [-r delay] [-s dir]

DESCRIPTION

       mon is a general-purpose scheduler for  monitoring  service  availability  and  triggering
       alerts upon detecting failures.  mon was designed to be open in the sense that it supports
       arbitrary monitoring facilities and alert methods via a common interface, which are easily
       implemented  through  programs (in C, Perl, shell, etc.), SNMP traps, and special Mon (UDP
       packet) traps.

OPTIONS

       -a dir Path to alert scripts.  Default  is  /usr/local/lib/mon/alert.d:alert.d.   Multiple
              alert  paths  may be specified by separating them with a colon.  Non-absolute paths
              are taken to be relative to the base directory (/usr/lib/mon by default).

       -b dir Base directory for mon. scriptdir, alertdir, and statedir are all relative to  this
              directory unless specified from /.  Default is /usr/lib/mon.

       -B dir Configuration  file  base  directory.  All config files are located here, including
              mon.cf, monusers.cf, and auth.cf.

       -A authfile
              Authentication configuration file. By  default  this  is  /etc/mon/auth.cf  if  the
              /etc/mon directory exists, or /usr/lib/mon/auth.cf otherwise.

       -c file
              Read  configuration  from  file.   This  defaults  to IR /etc/mon/mon.cf " if the "
              /etc/mon directory exists, otherwise to /etc/mon.cf.

       -d     Enable debugging mode.

       -D dir Path to state directory.  Default is the first of /var/state/mon, /var/lib/mon, and
              /usr/lib/mon/state.d which exists.

       -f     Fork and run as a daemon process. This is the preferred way to run mon.

       -h     Print help information.

       -i secs
              Sleep  interval,  in seconds. Defaults to 1. This shouldn't need to be adjusted for
              any reason.

       -k num Set log history to a maximum of num entries. Defaults to 100.

       -l statetype
              Load state from the last saved state file. The  supported  saved  state  types  are
              disabled  for disabled watches, services, and hosts, opstatus for failure/alert/ack
              status of all services, and all for both.  If no statetype is provided, disabled is
              assumed.

       -L dir Sets  the  log  dir.  See  also  logdir  in the configuration file.  The default is
              /var/log/mon if that directory exists, otherwise log.d in the base directory.

       -M     Pre-process the configuration file with the macro expansion package m4.

       -m num Set the throttle for the maximum number of processes to num.

       -p num Make server listen on port num.  This defaults to 2583.

       -S     Start with the scheduler stopped.

       -P pidfile
              Store the server's pid in pidfile, the default is the  first  of  /run/mon/mon.pid,
              /run/mon.pid,  and  /etc/mon.pid  whose directory exists.  An empty value tells mon
              not to use a pid file.

       -r delay
              Sets the number of seconds used to randomize the startup delay before each  service
              is scheduled. Refer to the global randstart variable in the configuration file.

       -s dir Path to monitor scripts. Default is /usr/local/lib/mon/mon.d:mon.d.  Multiple alert
              paths may be specified by separating them with a  colon.   Non-absolute  paths  are
              taken to be relative to the base directory (/usr/lib/mon by default).

       -v     Print version information.

DEFINITIONS

       monitor
              A  program  which  tests for a certain condition, returns either true or false, and
              optionally produces output to be passed back to  the  scheduler.   Common  monitors
              detect host reachability via ICMP echo messages, or connection to TCP services.

       period A period in time as interpreted by the Time::Period module.

       alert  A program which sends a message when invoked by the scheduler.  The scheduler calls
              upon an alert when it detects a failure from a monitor.  An alert program accepts a
              set  of command-line arguments from the scheduler, in addition to data via standard
              input.

       hostgroup
              A single host or list of hosts, specified as names or IP addresses.

       service
              A collection of parameters used to deal with monitoring a particular resource which
              is  provided  by a group. Services are usually modeled after things such as an SMTP
              server, ICMP echo capability, server disk space availability, or SNMP events.

       view   A collection of hostgroups, used to filter mon output for client display.   i.e.  a
              'network-services'  view  might  be  defined so your network staff can see just the
              hostgroups which matter to them, without having to see all  hostgroups  defined  in
              Mon.

       watch  A collection of services which apply to a particular group.

OPERATION

       When  the mon scheduler starts, it reads a configuration file to determine the services it
       needs to monitor. The configuration file defaults to /etc/mon.cf,  and  can  be  specified
       using the -c parameter. If the -M option is specified, then the configuration file is pre-
       processed with m4.  If the configuration file ends with .m4, the file is also processed by
       m4 automatically.

       The  scheduler  enters  a  loop which handles client connections, monitor invocations, and
       failure alerts. Each service has a timer, specified  in  the  configuration  file  as  the
       interval  variable,  which tells the scheduler how frequently to invoke a monitor process.
       The scheduler may be temporarily  stopped.  While  it  is  stopped,  client  access  still
       functions,  but  it  just  doesn't  schedule  things.  This is useful in conjunction while
       resetting the server, because you can do this: save  the  hosts  and  services  which  are
       disabled,  reset  the  server  with  the  scheduler  stopped,  re-disabled those hosts and
       services, then start the scheduler. It also allows making atomic  changes  across  several
       client connections.  See the moncmd man page for more information.

MONITOR PROGRAMS

       Monitor  processes  are  invoked  with  the arguments specified in the configuration file,
       appended by the hosts from the applicable host group. For example, if the watch  group  is
       "servers",  which  contain  the  hostnames  "smtp", "nntp", and "ns", and the monitor line
       reads as follows,
        monitor fping.monitor -t 4000 -r 2
       then the exectuable "fping.monitor" will be executed with these parameters:
        MONITOR_DIR/fping.monitor -t 4000 -r 2 smtp nntp ns

       MONITOR_DIR  is  actually  a  search  path,  by  default   /usr/local/lib/mon/mon.d   then
       /usr/lib/mon/mon.d,  but  it  can  be  overridden by the -s option or in the configuration
       file.  If all hosts in the hostgroup have been disabled, then a warning is sent to  syslog
       and  the  monitor is not run. This behavior may be overridden with the "allow_empty_group"
       option in the service definition.  If the final argument to the "monitor" line is ";;" (it
       must  be preceded by whitespace), then the host list will not be appended to the parameter
       list.

       In addition to environment variables defined by the user in the  service  definition,  mon
       passes certain variables to monitor process.

       MON_LAST_SUMMARY
              The  first  line  of the output from the last time the monitor exited.  This is not
              the summary of the current monitor run, but the previous one.  This may be used  by
              an alert script to provide historical context in an alert.

       MON_LAST_OUTPUT
              The  entire  output  of  the monitor from the last time it exited.  This is not the
              output of the current monitor run, but the previous one.  This may be  used  by  an
              alert script to provide historical context in an alert.

       MON_LAST_FAILURE
              The time(2) of the last failure for this service.

       MON_FIRST_FAILURE
              The time(2) of the first time this service failed.

       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.

       MON_DESCRIPTION
              The  description  of  this  service, as defined in the configuration file using the
              description tag.

       MON_DEPEND_STATUS
              The depend status, "o" if dependency failure, "1" otherwise.

       MON_LOGDIR
              The directory log files should  be  placed,  as  indicated  by  the  logdir  global
              configuration variable.

       MON_STATEDIR
              The directory where state files should be kept, as indicated by the statedir global
              configuration variable.

       MON_CFBASEDIR
              The directory where configuration  files  should  be  kept,  as  indicated  by  the
              cfbasedir global configuration variable.

       "fping.monitor"  should  return an exit status of 0 if it completed successfully (found no
       problems), or nonzero if a problem was detected. The first line of output from the monitor
       script has a special meaning: it is used as a brief summary of the exact failure which was
       detected, and is passed to the alert program. All remaining output is also passed  to  the
       alert program, but it has no required interpretation.

       If  a monitor for a particular service is still running, and the time comes for mon to run
       another monitor for that service, it will not start another monitor. For example,  if  the
       interval  is 10s, and the monitor does not finish running within 10 seconds, then mon will
       wait until the first monitor exits before running another one.

ALERT DECISION LOGIC

       Upon  a  non-zero  or  zero  exit  status,  the  associated  alert  or   upalert   program
       (respectively)  is  started,  pending the following conditions: If an alert for a specific
       service is disabled, do not send an alert.  If dep_behavior is set to 'a', or  alertdepend
       is  set,  and  a  parent dependency is failing, then suppress the alert.  If the alert has
       previously been acknowledged, do not send the alert, unless it is an upalert.  If an alert
       is  not  within  the specified period, record the failure via syslog(3) and do not send an
       alert.  If the failure does not fall within a defined period, do not send  an  alert.   No
       upalerts  are  sent without corresponding down alerts, unless no_comp_alerts is defined in
       the period section. An upalert will only be sent if the previous state is a  failure.   If
       an  alert was already sent within the last alertevery interval, do not send another alert,
       unless the summary output from the current monitor program differs from the  last  monitor
       process.   Otherwise,  send  an alert using each alert program listed for that period. The
       observe_detail argument to alertevery affects this behavior by observing  the  changes  in
       the  detail  part  of  the  output  in  addition  to  the  summary line.  If a monitor has
       successive failures and the summary output changes in each of them,  alertevery  will  not
       suppress  multiple  consecutive  alerts.   The  reasoning  is  that  if the summary output
       changes, then a significant event occurred and the user should be alerted.   The  "strict"
       argument  to  alertevery will suppress both comparing the output from the previous monitor
       run to the current and prevent a successful return value of the monitor from resetting the
       alertevery  timer.  For  example, "alertevery 24h strict" will only send out an alert once
       every 24 hours, regardless of whether the monitor output changes, or if the service  stops
       and then starts failing.

ALERT PROGRAMS

       Alert  programs  are  found  in  the  path  supplied  with  the  -a  parameter,  or in the
       /usr/local/lib/mon/alert.d and directories if not specified.  They are  invoked  with  the
       following command-line parameters:

       -s service
              Service tag from the configuration file.

       -g group
              Host group name from the configuration file.

       -h hosts
              The expanded version of the host group, space delimited, but contained in one shell
              "word".

       -l alertevery
              The number of seconds until the next alarm will be sent.

       -O     This option  is  supplied  to an alert only if the alert is being  generated  as  a
              result of an expected traap timing out

       -t time
              The time (in time(2) format) of when this failure condition was detected.

       -T     This option is supplied to an alert only if the alert was triggered by a trap

       -u     This option is supplied to an alert only if it is being called as an upalert.

       The  remaining  arguments  are  supplied from the trailing parameters in the configuration
       file, after the "alert" service parameter.

       As with monitor programs, alert programs are invoked with environment variables defined by
       the  user in the service definition, in addition to the following which are explicitly set
       by the server:

       MON_LAST_SUMMARY
              The first line of the output from the last time the monitor exited.

       MON_LAST_OUTPUT
              The entire output of the monitor from the last time it exited.

       MON_LAST_FAILURE
              The time(2) of the last failure for this service.

       MON_FIRST_FAILURE
              The time(2) of the first time this service failed.

       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.

       MON_DESCRIPTION
              The description of this service, as defined in the  configuration  file  using  the
              description tag.

       MON_GROUP
              The watch group which triggered this alarm

       MON_SERVICE
              The service heading which generated this alert

       MON_RETVAL
              The  exit  value  of the failed monitor program, or return value as accepted from a
              trap.

       MON_OPSTATUS
              The operational status of the service.

       MON_ALERTTYPE
              Has  one  of  the  following  values:  "failure",  "up",  "startup",   "trap",   or
              "traptimeout", and signifies the type of alert which was triggered.

       MON_TRAP_INTENDED
              This  is  only  set  when  an  unknown  mon  trap  is  received  and  caught by the
              default/defaut watch/service. This contains colon separated entries of  the  trap's
              intended watch group and service name.

       MON_LOGDIR
              The  directory  log  files  should  be  placed,  as  indicated by the logdir global
              configuration variable.

       MON_STATEDIR
              The directory where state files should be kept, as indicated by the statedir global
              configuration variable.

       MON_CFBASEDIR
              The  directory  where  configuration  files  should  be  kept,  as indicated by the
              cfbasedir global configuration variable.

       The first line from standard input must be  used  as  a  brief  summary  of  the  problem,
       normally  supplied as the subject line of an email, or text sent to an alphanumeric pager.
       Interpretation of all subsequent lines read from stdin is left up to the alerting program.
       The  usual  parameters  are  a  list  of  recipients  to deliver the notification to.  The
       interpretation of the recipients is not specified, and is up to the alert program.

CONFIGURATION FILE

       The configuration file consists of zero or more global variable definitions, zero or  more
       hostgroup  definitions,  and one or more watch definitions. Each watch definition may have
       one or more service definitions. A watch definition is terminated by a blank line, another
       definition,  or the end of the file. A line beginning with optional leading whitespace and
       a pound ("#") is regarded as a comment, and is ignored.

       Lines are parsed as they are read. Long lines may be  continued  by  ending  them  with  a
       backslash  ("\").   If  a  line  is continued, then the backslash, the trailing whitespace
       after the backslash, and the leading whitespace of the following line are removed. The end
       result is assembled into a single line.

       Typically the configuration file has the following layout:

       1. Global variable definitions

       2. Hostgroup definitions

       3. Watch definitions

       See the "etc/example.cf" file which comes for the distribution for an example.

   Global Variables
       The  following variables may be set to override compiled-in defaults. Command-line options
       will have a higher precedence than these definitions.

       alertdir = dir
              dir is the full path to the alert scripts. This is the value set by the -a command-
              line parameter.

              Multiple  alert  paths  may  be  specified  by  separating them with a colon.  Non-
              absolute paths are taken to be relative to  the  base  directory  (/usr/lib/mon  by
              default).

              When  the  configuration file is read, all alerts referenced from the configuration
              will be looked up in each of these paths, and the full path to the  first  instance
              of the alert found is stored in a hash. This hash is only generated upon startup or
              after a "reset" command, so newly added alert scripts will not be recognized  until
              a "reset" is performed.

       mondir = dir
              dir  is  the full path to the monitor scripts. This value may also be set by the -s
              command-line parameter. If this path does not begin with a "/", it will be relative
              to basedir.

              Multiple  alert  paths  may be specified by separating them with a colon. All paths
              must be absolute.

              When the configuration file is read, all monitors referenced from the configuration
              will  be  looked up in each of these paths, and the full path to the first instance
              of the monitor found is stored in a hash. This hash is only generated upon  startup
              or  after  a "reset" command, so newly added monitor scripts will not be recognized
              until a "reset" is performed.

       statedir = dir
              dir is the full path to the state directory.   mon  uses  this  directory  to  save
              various  state  information.  If  this  path  does not begin with a "/", it will be
              relative to basedir.

       logdir = dir
              dir is the full path to the log directory.  mon uses this directory to save various
              logs,  including  the downtime log. If this path does not begin with a "/", it will
              be relative to basedir.

       basedir = dir
              dir is the full path for the state, log, monitor, and alert directories.

       cfbasedir = dir
              dir is the full path where all the config files can be found (monusers.cf, auth.cf,
              etc.).

       authfile = file
              file is the path to the authentication file. If the path does not begin with a "/",
              it will be relative to cfbasedir.

       authtype = type [type...]
              type is the type of authentication to use. A space-separated list of types  may  be
              specified,  and  they  will  be  checked  the  order  they are listed. As soon as a
              successful authentication is performed, the user is considered authenticated by mon
              for the duration of the session and no more authentication checks are performed.

              If  type is getpwnam, then the standard Unix passwd file authentication method will
              be used (calls getpwnam(3) on the user and compares the crypt(3)ed version  of  the
              password  with  what it gets from getpwnam). This will not work if shadow passwords
              are enabled on the system.

              If type is userfile, then usernames and hashed passwords are  read  from  userfile,
              which is defined via the userfile configuration variable.

              If  type  is  pam,  then  PAM  (pluggable  authentication modules) will be used for
              authentication.  The service specified by the pamservice global will be used. If no
              global is given, the PAM passwd service will be used.

              If  type  is  trustlocal,  then  if  the client connection comes from locahost, the
              username passed from the client will be trusted, and the password will be  ignored.
              This  can  be  used  when you want the client to handle the authentication for you.
              I.e. a CGI script using one of the many apache authentication methods.

       userfile = file
              This file is used when authtype is set to userfile.  It consists of a  sequence  of
              lines of the format 'username : password'.  password is stored as the hash returned
              by the standard  Unix  crypt(3)  function.   NOTE:  the  format  of  this  file  is
              compatible with the Apache file based username/password file format. It is possible
              to use the htpasswd program supplied with Apache to manage the mon userfile.

              Blank lines and lines beginning with # are ignored.

       pamservice = service
              The PAM service used for authentication.  This  is  applicable  only  if  "pam"  is
              specified as a parameter to the authtype setting. If this global is not defined, it
              defaults to passwd.

       serverbind = addr

       trapbind = addr

              serverbind and trapbind specify which address to bind the server and trap ports to,
              respectively.   If  these are not defined, the default address is INADDR_ANY, which
              allows connections on all interfaces. For security reasons, it could be a good idea
              to bind only to the loopback interface.

       dtlogfile = file
              file  is  a  file which will be used to record the downtime log. Whenever a service
              fails for some amount of time and then stop failing, this event is written  to  the
              log. If this parameter is not set, no logging is done. The format of the file is as
              follows (# is a comment and may be ignored):

              timenoticed group service firstfail downtime interval summary.

              timenoticed is the time(2) the service came back up.

              group service is the group and service which failed.

              firstfail is the time(2) when the service began to fail.

              downtime is the number of seconds the service failed.

              interval is the frequency (in seconds) that the service is polled.

              summary is the summary line from when the service was failing.

       monerrfile = filename
              By default, when mon daemonizes itself, it connects stdout and stderr to /dev/null.
              If  monerrfile  is  set  to a file, then stdout and stderr will be appended to that
              file. In all cases stdin is connected to /dev/null. If mon is told to  run  in  the
              foreground   and   to   not   daemonize,   then   none   of   this  applies,  since
              stdin/stdout/stderr stay connected to whatever they were at the time of invocation.

       dtlogging = yes/no

              Turns downtime logging on or off. The default is off.

       histlength = num
              num is the the maximum number of events to be retained in history list. The default
              is 100.  This value may also be set by the -k command-line parameter.

       historicfile = file
              If this variable is set, then alerts are logged to file, and upon startup, some (or
              all) of the past history is read into memory.

       historictime = timeval
              num is the amount of the history file to read upon startup.   "Now"  -  timeval  is
              read.  See  the  explanation of interval in the "Service Definitions" section for a
              description of timeval.

       serverport = port
              port is the TCP port number that the server should bind to. This value may also  be
              set  by  the  -p  command-line  parameter.  Normally  this  port  is  looked up via
              getservbyname(3), and it defaults to 2583.

       trapport = port
              port is the UDP port number that the trap server should  bind  to.   Normally  this
              port is looked up via getservbyname(3), and it defaults to 2583.

       pidfile = path
              path  is  the  file the sever will store its pid in.  This value may also be set by
              the -P command-line parameter.

       maxprocs = num
              Throttles the number of concurrently forked processes to num.   The  intent  is  to
              provide  a  safety  net for the unlikely situation when the server tries to take on
              too many tasks at once.  Note that this situation has only been reported to  happen
              when  trying  to  use a garbled configuration file! You don't want to use a garbled
              configuration file now, do you?

       cltimeout = secs
              Sets the client inactivity timeout to secs.  This is meant to help thwart denial of
              service  attacks  or  recover  from  crashed  clients.   secs  is  interpreted as a
              "1h/1m/1s" string, where "1m" = 60 seconds.

       randstart = interval
              When the server starts, normally all services  will  not  be  scheduled  until  the
              interval  defined  in  the  respective service section.  This can cause long delays
              before the first check of a service, and possibly a high  load  on  the  server  if
              multiple  things  are  scheduled  at  the  same  intervals.  This option is used to
              randomize the scheduling of the first test for  all  services  during  the  startup
              period,  and  immediately  after  the  reset command.  If randstart is defined, the
              scheduled run time of all services of all watch groups  will  be  a  random  number
              between zero and randstart seconds.

       dep_recur_limit = depth
              Limit  dependency  recursion level to depth.  If dependency recursion (dependencies
              which depend on other dependencies) tries to go beyond depth, then the recursion is
              aborted and a messages is logged to syslog.  The default limit is 10.

       dep_behavior = {a|m|hm}
              dep_behavior  controls  whether  the  dependency  expression suppresses one of: the
              running of alerts, the running of monitors, or the passing of individual  hosts  to
              the  monitors.   Read  more about the behavior in the "Service Definitions" section
              below.

              This is a global setting which controls  the  default  settings  for  the  service-
              specified variable.

       dep_memory = timeval
              If set, dep_memory will cause dependencies to continue to prevent alerts/monitoring
              for a period of time after the service returns to a normal state.  This can be used
              to  prevent  over-eager alerting when a machine is rebooting, for example.  See the
              explanation of interval in the "Service Definitions" section for a  description  of
              timeval.

              This  is  a  global  setting  which  controls the default settings for the service-
              specified variable.

       syslog_facility = facility
              Specifies the syslog facility used for logging.  daemon is the default.

       startupalerts_on_reset = {yes|no}

              If set to "yes", startupalerts will be invoked when the  reset  client  command  is
              executed. The default is "no".

       monremote = program

              If  set,  this  external program will be called by Mon when various client requests
              are processed.  This can be used to propagate those changes from one Mon server  to
              another, if you have multiple monitoring machines.  An example script, monremote.pl
              is available in the clients directory.

   Hostgroup Entries
       Hostgroup entries begin with the keyword hostgroup, and are followed by  a  hostgroup  tag
       and one or more hostnames or IP addresses, separated by whitespace. The hostgroup tag must
       be composed of alphanumeric characters, a dash ("-"), a period  ("."),  or  an  underscore
       ("_").  Non-blank  lines  following  the  first  hostgroup  line  are  interpreted as more
       hostnames.  The hostgroup definition ends with a blank line. For example:

              hostgroup servers nameserver smtpserver nntpserver
                   nfsserver httpserver smbserver

              hostgroup router_group cisco7000 agsplus

   View Entries
       View entries begin with the keyword view, and are followed by a view tag and the names  of
       one  or more hostgroups.  The view tag must be composed of alphanumeric characters, a dash
       ("-"), a period ("."), or an underscore ("_"). Non-blank lines following  the  first  view
       line are interpreted as more hostgroup names.  The view definition ends with a blank line.
       For example:

              view servers dns-servers web-servers file-servers
                   mail-servers

              view network-services routers switches vpn-servers

   Watch Group Entries
       Watch entries begin with a line that starts with the keyword watch, followed by whitespace
       and  a single word which normally refers to a pre-defined hostgroup. If the second word is
       not recognized as a hostgroup tag, a new hostgroup is created whose tag is that word,  and
       that word is its only member.

       Watch entries consist of one or more service definitions.

       A  watch  group  is  terminated  by  a blank line, the end of the file, or by a subsequent
       definition, "watch", "hostgroup", or otherwise.

       There may be a special watch group entry called "default". If a  default  watch  group  is
       defined  with  a  service  entry  named  "default",  then  this definition will be used in
       handling traps received for an unrecognized watch and service.

   Service Definitions
       service servicename
              A service definition begins with they keyword service followed by a word  which  is
              the  tag for this service.  This word must be unique among all services defined for
              the same watch group.

              The components of a service are an interval, monitor, and one or more  time  period
              definitions, as defined below.

              If  a  service  name  of "default" is defined within a watch group called "dafault"
              (see above), then the default/default definition will be used for handling  unknown
              mon traps.

              The   following  configuration  parameters  are  valid  only  following  a  service
              definition:

       VARIABLE=value
              Environment variables may be defined for each service, which will  be  included  in
              the  environment of monitors and alerts. Variables must be specified in all capital
              letters, must begin with an alphabetical character or an underscore, and there must
              be no spaces to the left of the equal sign.

       interval timeval
              The  keyword  interval  followed  by  a  time  value specifies the frequency that a
              monitor script will be triggered.  Time values are defined as "30s", "5m", "1h", or
              "1d", meaning 30 seconds, 5 minutes, 1 hour, or 1 day. The numeric portion may be a
              fraction, such as "1.5h" or an hour and a half. This format of a time specification
              will be referred to as timeval.

       failure_interval timeval
              Adjusts  the  polling interval to timeval when the service check is failing. Resets
              the interval to the original when the service succeeds.

       traptimeout timeval
              This keyword takes the same time specification argument as interval, and makes  the
              service  expect  a trap from an external source at least that often, else a failure
              will be registered. This is used for a heartbeat-style service.

       trapduration timeval
              If a trap is received, the status of the service the trap  was  delivered  to  will
              normally  remain  constant. If trapduration is specified, the status of the service
              will remain in a failure state for the duration specified by timeval, and  then  it
              will be reset to "success".

       randskew timeval
              Rather  than  schedule  the  monitor  script  to run at the start of each interval,
              randomly adjust the interval specified by the interval parameter  by  plus-or-minus
              randskew  .   The  skew  value is specified as the interval parameter: "30s", "5m",
              etc...  For example if interval is 1m, and randskew is "5s", then mon will schedule
              the  monitor  script some time between every 55 seconds and 65 seconds.  The intent
              is to help distribute the load on the server when many services  are  scheduled  at
              the same intervals.

       monitor monitor-name [arg...]
              The  keyword  monitor followed by a script name and arguments specifies the monitor
              to run when the timer expires. Shell-like quoting  conventions  are  followed  when
              specifying the arguments to send to the monitor script.  The script is invoked from
              the directory given with the -s argument, and all following words are  supplied  as
              arguments  to  the  monitor  program,  followed  by  the list of hosts in the group
              referred to by the current watch group.  If the monitor line ends with  ";;"  as  a
              separate  word,  the  host  groups  are  not appended to the argument list when the
              program is invoked.

       allow_empty_group
              The allow_empty_group option will allow a monitor  to  be  invoked  even  when  the
              hostgroup  for  that watch is empty because of disabled hosts. The default behavior
              is not to invoke the monitor when all hosts in a hostgroup have been disabled.

       description descriptiontext
              The text following description is queried by client programs, passed to alerts  and
              monitors  via an environment variable. It should contain a brief description of the
              service, suitable for inclusion in an email or on a web page.

       exclude_hosts host [host...]
              Any hosts listed after exclude_hosts will be excluded from the service check.

       exclude_period periodspec
              Do not run a scheduled monitor during the time identified by periodspec.

       depend dependexpression
              The depend keyword is used to specify a dependency expression, which  evaluates  to
              either  true  of  false,  in  the  boolean  sense.   Dependencies  are  actual Perl
              expressions, and must obey all syntactical rules. The expressions are evaluated  in
              their  own  package space so as to not accidentally have some unwanted side-effect.
              If a syntax error is found when evaluating the expression, it is logged via syslog.

              Before evaluation, the following substitutions on  the  expression  occur:  phrases
              which  look  like  "group:service"  are  substituted  with the value of the current
              operational status of that specified  service.  These  opstatus  substitutions  are
              computed recursively, so if service A depends upon service B, and service B depends
              upon service C, then service A  depends  upon  service  C.  Successful  operational
              statuses (which evaluate to "1") are "STAT_OK", "STAT_COLDSTART", "STAT_WARMSTART",
              and "STAT_UNKNOWN".  The word "SELF" (in all caps) can be used for the group  (e.g.
              "SELF:service"), and is an abbreviation for the current watch group.

              This  feature  can  be  used  to control alerts for services which are dependent on
              other services, e.g. an SMTP test which is dependent upon the machine  being  ping-
              reachable.

       dep_behavior {a|m|hm}
              The  evaluation  of  the  dependency  graphs  specified  via the depend keyword can
              control the suppression of alert or monitor  invocations,  or  the  suppression  of
              individual hosts passed to the monitor.

              Alert  suppression.   If  this option is set to "a", then the dependency expression
              will be evaluated after the monitor for the  service  exits  or  after  a  trap  is
              received.  An alert will only be sent if the evaluation succeeds, meaning that none
              of the nodes in the dependency graph indicate failure.

              Monitor suppression.  If it is set to "m", then the dependency expression  will  be
              evaulated  before  the  monitor for the service is about to run.  If the evaulation
              succeeds, then the monitor will be run. Otherwise, the monitor will not be run  and
              the status of the service will remain the same.

              Host  suppression.  If it is set to "hm" then Mon will extract the list of "parent"
              services from the dependency expression.  (In fact the expression  can  be  just  a
              list  of  services.)  Then when the monitor for the service is about to be run, for
              each host in the current hostgroup Mon will search all the  parent  services  which
              are  currently failing and look for the hostname in the current summary output.  If
              the hostname is found, this host will be excluded from this  run  of  the  monitor.
              This  can  be  used  to e.g. allow an SMTP test on a group of hosts to still be run
              even when a single host is not ping-reachable.  If all the rest of  the  hosts  are
              working  fine,  the  service  will be in an OK state, but if another host fails the
              SMTP test Mon can still alert about that host even though the parent dependency was
              failing.  The dependency expression will not be used recursively in this case.

       alertdepend dependexpression

       monitordepend dependexpression

       hostdepend dependexpression
              These  keywords  allow  you to specify multiple dependency expressions of different
              types.  Each one corresponds to the different dep_behavior settings  listed  above.
              They will be evaluated independently in the different contexts as listed above.  If
              depend is present, it takes precedence over the matching keyword, depending on  the
              dep_behavior setting.

       dep_memory timeval
              If set, dep_memory will cause dependencies to continue to prevent alerts/monitoring
              for a period of time after the service returns to a normal state.  This can be used
              to  prevent  over-eager alerting when a machine is rebooting, for example.  See the
              explanation of interval in the "Service Definitions" section for a  description  of
              timeval.

       redistribute alert [arg...]
              A  service may have one redistribute option, which is a special form of an an alert
              definition.  This alert will  be  called  on  every  service  status  update,  even
              sequential  success status updates.  This can be used to integrate Mon with another
              monitoring system, or to link together multiple Mon servers  via  an  alert  script
              that generates Mon traps.  See the "ALERT PROGRAMS" section above for a list of the
              parameters mon will pass automatically to alert programs.

       unack_summary
              Remove the "acknowledged" state from a service if  the  summary  component  of  the
              failure  message  changes.   In  most common usage the summary is the list of hosts
              that are failing, so additional hosts failing would remove an ack.

   Period Definitions
       Periods are used to define the conditions which should allow alerts to be delivered.

       period [label:] periodspec
              A period groups one or more alarms and variables which control how often  an  alert
              happens  when  there  is a failure.  The period definition has two forms. The first
              takes an argument which is a period specification from Patrick Ryan's  Time::Period
              Perl 5 module. Refer to "perldoc Time::Period" for more information.

              The  second  form  requires  a label followed by a period specification, as defined
              above. The label is a tag consisting  of  an  alphabetic  character  or  underscore
              followed by zero or more alphanumerics or underscores and ending with a colon. This
              form allows multiple periods with the same period definition. One use is to have  a
              period definition which has no alertafter or alertevery parameters for a particular
              time period, and another for the same time period with a different  set  of  alerts
              that does contain those parameters.

              Period  definitions, in either the first or second form, must be unique within each
              service definition. For example, if you need to define two  periods  both  for  "wd
              {Sun-Sat}", then one or both of the period definitions must specify a label such as
              "period t1: wd {Sun-Sat}" and "period t2: wd {Sun-Sat}".

       alertevery timeval [observe_detail | strict]
              The alertevery keyword (within a period definition) takes the same type of argument
              as  the interval variable, and limits the number of times an alert is sent when the
              service continues to fail.  For example, if the interval is  "1h",  then  only  the
              alerts  in  the  period  section  will  only  be  triggered once every hour. If the
              alertevery keyword is omitted in a period entry, an alert will be  sent  out  every
              time  a  failure  is  detected. By default, if the summary output of two successive
              failures changes, then the alertevery interval is overridden, and an alert will  be
              sent.   If  the string "observe_detail" is the last argument, then both the summary
              and detail output lines will be considered when comparing the output of  successive
              failures.   If  the  string  "strict"  is the last argument, then the output of the
              monitor or the state change of the service will have no effect on when  alerts  are
              sent.  That is, "alertevery 24h strict" will send only one alert every 24 hours, no
              matter what.  Please refer to the ALERT  DECISION  LOGIC  section  for  a  detailed
              explanation of how alerts are suppressed.

       alertafter num

       alertafter num timeval

       alertafter timeval
              The  alertafter  keyword  (within  a period section) has three forms: only with the
              "num" argument, or with the "num timeval" arguments, or  only  with  the  "timeval"
              argument.  In the first form, an alert will only be invoked after "num" consecutive
              failures.

              In the second form, the arguments are a positive integer followed by  an  interval,
              as  described  by  the interval variable above.  If these parameters are specified,
              then the alerts for that period will only be called after that many failures happen
              within  that  interval.  For example, if alertafter is given the arguments "3 30m",
              then the alert will be called if 3 failures happen within 30 minutes.

              In the third form, the argument is  an  interval,  as  described  by  the  interval
              variable above.  Alerts for that period will only be called if the service has been
              in a failure state for more than the length  of  time  desribed  by  the  interval,
              regardless of the number of failures noticed within that interval.

       numalerts num

              This  variable  tells  the server to call no more than num alerts during a failure.
              The alert counter is kept on a per-period basis, and is reset upon each success.

       no_comp_alerts

              If this option is specified, then upalerts will  be  called  whenever  the  service
              state  changes  from  failure  to  success,  rather than only after a corresponding
              "down" alert.

       alert alert [arg...]
              A period may contain multiple alerts, which  are  triggered  upon  failure  of  the
              service. An alert is specified with the alert keyword, followed by an optional exit
              parameter, and arguments which are interpreted the same as the monitor  definition,
              but  without  the  ";;"  exception.  The exit parameter takes the form of exit=x or
              exit=x-y and has the effect that the alert is only called if the exit status of the
              monitor  script  falls within the range of the exit parameter. If, for example, the
              alert line is alert exit=10-20 mail.alert mis then mail-alert will only be  invoked
              with mis as its arguments if the monitor program's exit value is between 10 and 20.
              This feature allows you to trigger different alerts at  different  severity  levels
              (like when free disk space goes from 8% to 3%).

              See  the  ALERT  PROGRAMS  section above for a list of the pramaeters mon will pass
              automatically to alert programs.

       upalert alert [arg...]
              An upalert is the compliment of an alert.  An upalert is  called  when  a  services
              makes the state transition from failure to success, if a corresponding "down" alert
              was previously sent. The upalert script is called supplying the same parameters  as
              the alert script, with the addition of the -u parameter which is simply used to let
              an alert script know that it is being called as an upalert. Multiple  upalerts  may
              be  specified for each period definition.  Set the per-period no_comp_alerts option
              to send an upalert regardless if whether or not a "down" alert was  sent.

       startupalert alert [arg...]
              A startupalert is only called when the mon  server  starts  execution,  or  when  a
              "reset"  command  was  issued  to  the  server,  depending  on  the  setting of the
              startupalerts_on_reset global.  Unlike other alerts, startupalerts are  not  called
              following the exit of a monitor, i.e. they are called in their own right, therefore
              the "exit=" argument is not applicable to startupalert.

       upalertafter timeval
              The upalertafter parameter is specified as a string that follows the syntax of  the
              interval  parameter ("30s", "1m", etc.), and controls the triggering of an upalert.
              If a service comes back up after being down for a time greater than or equal to the
              value  of  this  option,  an  upalert  will  be  called. Use this option to prevent
              upalerts to be called because of "blips" (brief outages).

AUTHENTICATION CONFIGURATION FILE

       The file specified by the authfile variable in the configuration file (or passed  via  the
       -A  parameter)  will  be  loaded  upon startup.  This file defines restrictions upon which
       client commands may be executed by which users. It  is  a  text  file  which  consists  of
       comments,  command definitions, and trap authentication parameters.  A comment line begins
       with optional whitespace followed by pound sign. Blank lines are ignored.

       The file is separated into a command section and a trap section. Sections are specified by
       a single line containing one of the following statements:

                   command section

       or

                   trap section

       Lines  following one of the above statements apply to that section until either the end of
       the file or another section begins.

       A command definition consists of a command, followed by a  colon,  followed  by  a  comma-
       separated  list  of  users  who may execute the command.  The default is that no users may
       execute any commands unless they are explicitly allowed in this  configuration  file.  For
       clarity,  a user can be denied by prefixing the user name with "!". If the word "AUTH_ANY"
       is used for a username, then any  authenticated  user  will  be  allowed  to  execute  the
       command.  If  the  word "all" is used for a username, then that command may be executed by
       any user, authenticated or not.

       The trap section allows configuration of which users may send traps from which hosts.  The
       syntax  is  a  source host (name or ip address), whitespace, a username, whitespace, and a
       plaintext password for that user. If the source host is "*", then  allow  traps  from  any
       host.  If  the  username  is  "*",  then  accept  traps without regard for the username or
       password. If no hosts or users are specified, then no traps will be accepted.

       An example configuration file:

              command section
              list:          all
              reset:         root,admin
              loadstate:          root
              savestate:          root

              trap section
              127.0.0.1 root r@@tp4sswrd

       This means that all clients are able to perform  the  list  command,  "root"  is  able  to
       perform  "reset",  "loadstate",  "savestate",  and  "admin" is able to execute the "reset"
       command.

CLIENT-SERVER INTERFACE

       The server listens on TCP port 2583, which may be overridden  using  the  -p port  option.
       Commands  are  a  single  line  each,  terminated by a newline.  The server can handle any
       number of simultaneous client connections.

CLIENT INTERFACE COMMANDS

       See manual page for moncmd.

MON TRAPPING

       Mon has the facility to receive special "mon traps" from  any  local  or  remote  machine.
       Currently,  the  only  available  method for sending mon traps are through the Mon::Client
       perl interface, though the UDP packet format is defined well enough to permit the  writing
       of traps in other languages.

       Traps are handled similarly to monitors: a trap sends an operational status, summary line,
       and description text, and mon generates an alert or upalert as necessary.

       Traps can be caught by any watch/service group set  up  in  the  mon  configuration  file,
       however it is suggested that you configure watch/service groups specifically for the traps
       you expect to receive. When defining a special  watch/service  group  for  traps,  do  not
       include  a  "monitor"  directive  (as  no monitor need be invoked). Since a monitor is not
       being invoked, it is not necessary for the watch definition  to  have  a  hostgroup  which
       contains  real  host names.  Just make up a useful name, and mon will automatically create
       the watch group for you.

       Here is a simple config file example:

              watch trap-service
                   service host1-disks
                        description TRAP: for host1 disk status
                        period wd {Sun-Sat}
                             alert mail.alert someone@your.org
                             upalert mail.alert -u someone@your.org

       Since mon listens on a UDP port for any trap, a default facility is available for handling
       traps  to  unknown  groups  or  services.   To  enable  this  facility, you must include a
       "default" watch group with a "default" service entry containing the specifics  of  alarms.
       If  a  default/default  watch group and service are not configured, then unknown traps get
       logged via syslog, and no alarm is sent.  NOTE: The default/default facility is  a  single
       entity  as  far  as accounting and alarming go. Alarm programs which are not aware of this
       fact may send confusing information when a failure trap comes from one  machine,  followed
       by  a  success  (ok)  trap  from  a  different machine. See the alarm environment variable
       MON_TRAP_INTENDED  above  for  a  possible  way  around  this.   It   is   intended   that
       default/default  be  used  as  a facility to catch unknown traps, and should not be relied
       upon to catch all traps in a production environment. If you are lazy and only want to  use
       default/default  for catching all traps, it would be best to disable upalerts, and use the
       MON_TRAP_INTENDED environment variable in alert scripts to make the alerts more meaningful
       to you.

       Here is an example default facility:

              watch default
                   service default
                        description Default trap service
                        period wd {Sun-Sat}
                             alert mail.alert someone@your.org
                             upalert mail.alert -u someone@your.org

EXAMPLES

       The mon distribution comes with an example configuration called example.cf.  Refer to that
       file for more information.

SEE ALSO

       moncmd(1), Time::Period(3pm), Mon::Client(3pm)

HISTORY

       mon was written because I couldn't find anything out there that did just  what  I  needed,
       and nothing was worth modifying to add the features I wanted. It doesn't have a cool name,
       and that bothers me because I couldn't think of one.

BUGS

       Report bugs to the email address below.

AUTHOR

       Jim Trocki <trockij@arctic.org>