trusty (8) mon.8.gz

Provided by: mon_1.2.0-8_amd64 bug

NAME

       mon - monitor services for availability, sending alarms upon failures.

SYNOPSIS

       mon  [-dfhlMSv]  [-a  dir]  [-A  authfile]  [-b dir] [-B dir] [-c config] [-D dir] [-i secs] [-k num] [-l
       [statetype]] [-L dir] [-m num] [-p num] [-P pidfile] [-r delay] [-s dir]

DESCRIPTION

       mon is a general-purpose scheduler  for  monitoring  service  availability  and  triggering  alerts  upon
       detecting  failures.   mon  was  designed  to  be open in the sense that it supports arbitrary monitoring
       facilities and alert methods via a common interface, which are easily implemented through programs (in C,
       Perl, shell, etc.), SNMP traps, and special Mon (UDP packet) traps.

OPTIONS

       -a dir Path to alert scripts. Default is /usr/local/lib/mon/alert.d:alert.d.  Multiple alert paths may be
              specified by separating them with a colon.  Non-absolute paths are taken to  be  relative  to  the
              base directory (/usr/lib/mon by default).

       -b dir Base  directory  for  mon.  scriptdir,  alertdir,  and statedir are all relative to this directory
              unless specified from /.  Default is /usr/lib/mon.

       -B dir Configuration  file  base  directory.  All  config  files  are  located  here,  including  mon.cf,
              monusers.cf, and auth.cf.

       -A authfile
              Authentication  configuration  file. By default this is /etc/mon/auth.cf if the /etc/mon directory
              exists, or /usr/lib/mon/auth.cf otherwise.

       -c file
              Read configuration from file.  This defaults to IR /etc/mon/mon.cf " if the "  /etc/mon  directory
              exists, otherwise to /etc/mon.cf.

       -d     Enable debugging mode.

       -D dir Path   to   state   directory.    Default  is  the  first  of  /var/state/mon,  /var/lib/mon,  and
              /usr/lib/mon/state.d which exists.

       -f     Fork and run as a daemon process. This is the preferred way to run mon.

       -h     Print help information.

       -i secs
              Sleep interval, in seconds. Defaults to 1. This shouldn't need to be adjusted for any reason.

       -k num Set log history to a maximum of num entries. Defaults to 100.

       -l statetype
              Load state from the last saved state file. The  supported  saved  state  types  are  disabled  for
              disabled  watches, services, and hosts, opstatus for failure/alert/ack status of all services, and
              all for both.  If no statetype is provided, disabled is assumed.

       -L dir Sets the log dir. See also logdir in the configuration file.  The default is /var/log/mon if  that
              directory exists, otherwise log.d in the base directory.

       -M     Pre-process the configuration file with the macro expansion package m4.

       -m num Set the throttle for the maximum number of processes to num.

       -p num Make server listen on port num.  This defaults to 2583.

       -S     Start with the scheduler stopped.

       -P pidfile
              Store   the   server's  pid  in  pidfile,  the  default  is  the  first  of  /var/run/mon/mon.pid,
              /var/run/mon.pid, and /etc/mon.pid whose directory exists.  An empty value tells mon not to use  a
              pid file.

       -r delay
              Sets  the  number of seconds used to randomize the startup delay before each service is scheduled.
              Refer to the global randstart variable in the configuration file.

       -s dir Path to monitor scripts. Default is /usr/local/lib/mon/mon.d:mon.d.  Multiple alert paths  may  be
              specified  by  separating  them  with a colon.  Non-absolute paths are taken to be relative to the
              base directory (/usr/lib/mon by default).

       -v     Print version information.

DEFINITIONS

       monitor
              A program which tests for a certain condition,  returns  either  true  or  false,  and  optionally
              produces  output to be passed back to the scheduler.  Common monitors detect host reachability via
              ICMP echo messages, or connection to TCP services.

       period A period in time as interpreted by the Time::Period module.

       alert  A program which sends a message when invoked by the scheduler.  The scheduler calls upon an  alert
              when  it  detects  a  failure  from  a  monitor.   An  alert program accepts a set of command-line
              arguments from the scheduler, in addition to data via standard input.

       hostgroup
              A single host or list of hosts, specified as names or IP addresses.

       service
              A collection of parameters used to deal with monitoring a particular resource which is provided by
              a  group.  Services are usually modeled after things such as an SMTP server, ICMP echo capability,
              server disk space availability, or SNMP events.

       view   A collection of hostgroups, used to filter mon  output  for  client  display.   i.e.  a  'network-
              services'  view might be defined so your network staff can see just the hostgroups which matter to
              them, without having to see all hostgroups defined in Mon.

       watch  A collection of services which apply to a particular group.

OPERATION

       When the mon scheduler starts, it reads a configuration file  to  determine  the  services  it  needs  to
       monitor.  The configuration file defaults to /etc/mon.cf, and can be specified using the -c parameter. If
       the -M option is specified, then the configuration file is pre-processed with m4.  If  the  configuration
       file ends with .m4, the file is also processed by m4 automatically.

       The  scheduler  enters  a loop which handles client connections, monitor invocations, and failure alerts.
       Each service has a timer, specified in the configuration file as the interval variable, which  tells  the
       scheduler how frequently to invoke a monitor process.  The scheduler may be temporarily stopped. While it
       is stopped, client access still functions, but it  just  doesn't  schedule  things.  This  is  useful  in
       conjunction  while  resetting  the server, because you can do this: save the hosts and services which are
       disabled, reset the server with the scheduler stopped, re-disabled those hosts and services,  then  start
       the  scheduler.  It  also allows making atomic changes across several client connections.  See the moncmd
       man page for more information.

MONITOR PROGRAMS

       Monitor processes are invoked with the arguments specified in the configuration  file,  appended  by  the
       hosts  from  the  applicable  host group. For example, if the watch group is "servers", which contain the
       hostnames "smtp", "nntp", and "ns", and the monitor line reads as follows,
        monitor fping.monitor -t 4000 -r 2
       then the exectuable "fping.monitor" will be executed with these parameters:
        MONITOR_DIR/fping.monitor -t 4000 -r 2 smtp nntp ns

       MONITOR_DIR is actually a search path, by default /usr/local/lib/mon/mon.d then  /usr/lib/mon/mon.d,  but
       it  can  be overridden by the -s option or in the configuration file.  If all hosts in the hostgroup have
       been disabled, then a warning is sent to syslog and  the  monitor  is  not  run.  This  behavior  may  be
       overridden  with  the "allow_empty_group" option in the service definition.  If the final argument to the
       "monitor" line is ";;" (it must be preceded by whitespace), then the host list will not  be  appended  to
       the parameter list.

       In  addition  to  environment variables defined by the user in the service definition, mon passes certain
       variables to monitor process.

       MON_LAST_SUMMARY
              The first line of the output from the last time the monitor exited.  This is not  the  summary  of
              the  current  monitor  run,  but the previous one.  This may be used by an alert script to provide
              historical context in an alert.

       MON_LAST_OUTPUT
              The entire output of the monitor from the last time it exited.  This is  not  the  output  of  the
              current  monitor  run,  but  the  previous  one.   This  may be used by an alert script to provide
              historical context in an alert.

       MON_LAST_FAILURE
              The time(2) of the last failure for this service.

       MON_FIRST_FAILURE
              The time(2) of the first time this service failed.

       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.

       MON_DESCRIPTION
              The description of this service, as defined in the configuration file using the description tag.

       MON_DEPEND_STATUS
              The depend status, "o" if dependency failure, "1" otherwise.

       MON_LOGDIR
              The directory log files should  be  placed,  as  indicated  by  the  logdir  global  configuration
              variable.

       MON_STATEDIR
              The  directory where state files should be kept, as indicated by the statedir global configuration
              variable.

       MON_CFBASEDIR
              The directory where configuration files should be kept,  as  indicated  by  the  cfbasedir  global
              configuration variable.

       "fping.monitor"  should  return  an exit status of 0 if it completed successfully (found no problems), or
       nonzero if a problem was detected. The first line of  output  from  the  monitor  script  has  a  special
       meaning:  it  is  used  as  a brief summary of the exact failure which was detected, and is passed to the
       alert program. All remaining output is also  passed  to  the  alert  program,  but  it  has  no  required
       interpretation.

       If a monitor for a particular service is still running, and the time comes for mon to run another monitor
       for that service, it will not start another monitor. For example, if the interval is 10s, and the monitor
       does  not  finish  running  within  10  seconds,  then mon will wait until the first monitor exits before
       running another one.

ALERT DECISION LOGIC

       Upon a non-zero or zero exit status, the associated alert or upalert program (respectively)  is  started,
       pending  the  following conditions: If an alert for a specific service is disabled, do not send an alert.
       If dep_behavior is set to 'a', or alertdepend is set, and a parent dependency is failing,  then  suppress
       the  alert.   If  the  alert  has  previously  been  acknowledged, do not send the alert, unless it is an
       upalert.  If an alert is not within the specified period, record the failure via  syslog(3)  and  do  not
       send  an alert.  If the failure does not fall within a defined period, do not send an alert.  No upalerts
       are sent without corresponding down alerts, unless no_comp_alerts is defined in the  period  section.  An
       upalert  will  only  be sent if the previous state is a failure.  If an alert was already sent within the
       last alertevery interval, do not send another alert, unless the summary output from the  current  monitor
       program  differs from the last monitor process.  Otherwise, send an alert using each alert program listed
       for that period. The observe_detail argument to alertevery affects this behavior by observing the changes
       in  the  detail part of the output in addition to the summary line.  If a monitor has successive failures
       and the summary output changes in each of them, alertevery will not suppress multiple consecutive alerts.
       The  reasoning  is  that  if  the  summary output changes, then a significant event occurred and the user
       should be alerted.  The "strict" argument to alertevery will suppress both comparing the output from  the
       previous  monitor  run to the current and prevent a successful return value of the monitor from resetting
       the alertevery timer. For example, "alertevery 24h strict" will only send out  an  alert  once  every  24
       hours, regardless of whether the monitor output changes, or if the service stops and then starts failing.

ALERT PROGRAMS

       Alert programs are found in the path supplied with the -a parameter, or in the /usr/local/lib/mon/alert.d
       and directories if not specified.  They are invoked with the following command-line parameters:

       -s service
              Service tag from the configuration file.

       -g group
              Host group name from the configuration file.

       -h hosts
              The expanded version of the host group, space delimited, but contained in one shell "word".

       -l alertevery
              The number of seconds until the next alarm will be sent.

       -O     This option  is  supplied  to an alert only if the alert is being generated  as  a  result  of  an
              expected traap timing out

       -t time
              The time (in time(2) format) of when this failure condition was detected.

       -T     This option is supplied to an alert only if the alert was triggered by a trap

       -u     This option is supplied to an alert only if it is being called as an upalert.

       The  remaining  arguments  are supplied from the trailing parameters in the configuration file, after the
       "alert" service parameter.

       As with monitor programs, alert programs are invoked with environment variables defined by  the  user  in
       the service definition, in addition to the following which are explicitly set by the server:

       MON_LAST_SUMMARY
              The first line of the output from the last time the monitor exited.

       MON_LAST_OUTPUT
              The entire output of the monitor from the last time it exited.

       MON_LAST_FAILURE
              The time(2) of the last failure for this service.

       MON_FIRST_FAILURE
              The time(2) of the first time this service failed.

       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.

       MON_DESCRIPTION
              The description of this service, as defined in the configuration file using the description tag.

       MON_GROUP
              The watch group which triggered this alarm

       MON_SERVICE
              The service heading which generated this alert

       MON_RETVAL
              The exit value of the failed monitor program, or return value as accepted from a trap.

       MON_OPSTATUS
              The operational status of the service.

       MON_ALERTTYPE
              Has  one  of  the  following  values:  "failure",  "up",  "startup", "trap", or "traptimeout", and
              signifies the type of alert which was triggered.

       MON_TRAP_INTENDED
              This is only set  when  an  unknown  mon  trap  is  received  and  caught  by  the  default/defaut
              watch/service.  This  contains  colon  separated  entries  of  the trap's intended watch group and
              service name.

       MON_LOGDIR
              The directory log files should  be  placed,  as  indicated  by  the  logdir  global  configuration
              variable.

       MON_STATEDIR
              The  directory where state files should be kept, as indicated by the statedir global configuration
              variable.

       MON_CFBASEDIR
              The directory where configuration files should be kept,  as  indicated  by  the  cfbasedir  global
              configuration variable.

       The  first  line from standard input must be used as a brief summary of the problem, normally supplied as
       the subject line of an email, or text sent to an alphanumeric pager.  Interpretation  of  all  subsequent
       lines  read  from stdin is left up to the alerting program. The usual parameters are a list of recipients
       to deliver the notification to.  The interpretation of the recipients is not specified, and is up to  the
       alert program.

CONFIGURATION FILE

       The  configuration  file  consists  of  zero  or more global variable definitions, zero or more hostgroup
       definitions, and one or more watch definitions. Each watch  definition  may  have  one  or  more  service
       definitions.  A  watch  definition  is  terminated by a blank line, another definition, or the end of the
       file. A line beginning with optional leading whitespace and a pound ("#") is regarded as a  comment,  and
       is ignored.

       Lines are parsed as they are read. Long lines may be continued by ending them with a backslash ("\").  If
       a line is continued, then the backslash, the trailing whitespace after the  backslash,  and  the  leading
       whitespace of the following line are removed. The end result is assembled into a single line.

       Typically the configuration file has the following layout:

       1. Global variable definitions

       2. Hostgroup definitions

       3. Watch definitions

       See the "etc/example.cf" file which comes for the distribution for an example.

   Global Variables
       The  following  variables  may  be set to override compiled-in defaults. Command-line options will have a
       higher precedence than these definitions.

       alertdir = dir
              dir is the full path to the alert scripts. This is the value set by the -a command-line parameter.

              Multiple alert paths may be specified by separating them with a  colon.   Non-absolute  paths  are
              taken to be relative to the base directory (/usr/lib/mon by default).

              When  the  configuration file is read, all alerts referenced from the configuration will be looked
              up in each of these paths, and the full path to the first instance of the alert found is stored in
              a  hash. This hash is only generated upon startup or after a "reset" command, so newly added alert
              scripts will not be recognized until a "reset" is performed.

       mondir = dir
              dir is the full path to the monitor scripts. This value may also be set  by  the  -s  command-line
              parameter. If this path does not begin with a "/", it will be relative to basedir.

              Multiple alert paths may be specified by separating them with a colon. All paths must be absolute.

              When the configuration file is read, all monitors referenced from the configuration will be looked
              up in each of these paths, and the full path to the first instance of the monitor found is  stored
              in  a  hash.  This  hash is only generated upon startup or after a "reset" command, so newly added
              monitor scripts will not be recognized until a "reset" is performed.

       statedir = dir
              dir is the full path to the state directory.  mon  uses  this  directory  to  save  various  state
              information. If this path does not begin with a "/", it will be relative to basedir.

       logdir = dir
              dir  is  the  full  path  to  the  log  directory.   mon uses this directory to save various logs,
              including the downtime log. If this path does not begin  with  a  "/",  it  will  be  relative  to
              basedir.

       basedir = dir
              dir is the full path for the state, log, monitor, and alert directories.

       cfbasedir = dir
              dir is the full path where all the config files can be found (monusers.cf, auth.cf, etc.).

       authfile = file
              file  is  the  path  to the authentication file. If the path does not begin with a "/", it will be
              relative to cfbasedir.

       authtype = type [type...]
              type is the type of authentication to use. A space-separated list of types may be  specified,  and
              they  will  be  checked  the  order  they  are  listed.  As soon as a successful authentication is
              performed, the user is considered authenticated by mon for the duration of the session and no more
              authentication checks are performed.

              If  type is getpwnam, then the standard Unix passwd file authentication method will be used (calls
              getpwnam(3) on the user and compares the crypt(3)ed version of the password with what it gets from
              getpwnam). This will not work if shadow passwords are enabled on the system.

              If  type is userfile, then usernames and hashed passwords are read from userfile, which is defined
              via the userfile configuration variable.

              If type is pam, then PAM (pluggable authentication modules) will be used for authentication.   The
              service  specified  by  the  pamservice global will be used. If no global is given, the PAM passwd
              service will be used.

              If type is trustlocal, then if the client connection comes from locahost, the username passed from
              the  client will be trusted, and the password will be ignored.  This can be used when you want the
              client to handle the authentication for you.  I.e. a CGI script  using  one  of  the  many  apache
              authentication methods.

       userfile = file
              This  file  is  used  when authtype is set to userfile.  It consists of a sequence of lines of the
              format 'username : password'.  password is stored as  the  hash  returned  by  the  standard  Unix
              crypt(3)  function.   NOTE:  the  format  of  this  file  is compatible with the Apache file based
              username/password file format. It is possible to use the htpasswd program supplied with Apache  to
              manage the mon userfile.

              Blank lines and lines beginning with # are ignored.

       pamservice = service
              The  PAM  service  used  for  authentication.  This  is applicable only if "pam" is specified as a
              parameter to the authtype setting. If this global is not defined, it defaults to passwd.

       serverbind = addr

       trapbind = addr

              serverbind and trapbind specify which address to bind the server and trap ports to,  respectively.
              If  these  are  not  defined,  the  default address is INADDR_ANY, which allows connections on all
              interfaces. For security reasons, it could be a good idea to bind only to the loopback interface.

       dtlogfile = file
              file is a file which will be used to record the downtime log. Whenever a service  fails  for  some
              amount  of  time and then stop failing, this event is written to the log. If this parameter is not
              set, no logging is done. The format of the file is as follows (# is a comment and may be ignored):

              timenoticed group service firstfail downtime interval summary.

              timenoticed is the time(2) the service came back up.

              group service is the group and service which failed.

              firstfail is the time(2) when the service began to fail.

              downtime is the number of seconds the service failed.

              interval is the frequency (in seconds) that the service is polled.

              summary is the summary line from when the service was failing.

       monerrfile = filename
              By default, when mon daemonizes itself, it connects stdout and stderr to /dev/null. If  monerrfile
              is  set  to  a  file,  then stdout and stderr will be appended to that file. In all cases stdin is
              connected to /dev/null. If mon is told to run in the foreground and to not daemonize, then none of
              this  applies,  since  stdin/stdout/stderr  stay  connected  to  whatever they were at the time of
              invocation.

       dtlogging = yes/no

              Turns downtime logging on or off. The default is off.

       histlength = num
              num is the the maximum number of events to be retained in history list. The default is 100.   This
              value may also be set by the -k command-line parameter.

       historicfile = file
              If  this  variable  is set, then alerts are logged to file, and upon startup, some (or all) of the
              past history is read into memory.

       historictime = timeval
              num is the amount of the history file to read upon startup.  "Now" -  timeval  is  read.  See  the
              explanation of interval in the "Service Definitions" section for a description of timeval.

       serverport = port
              port  is  the TCP port number that the server should bind to. This value may also be set by the -p
              command-line parameter. Normally this port is looked up via getservbyname(3), and it  defaults  to
              2583.

       trapport = port
              port  is the UDP port number that the trap server should bind to.  Normally this port is looked up
              via getservbyname(3), and it defaults to 2583.

       pidfile = path
              path is the file the sever will store its pid in.  This value may also be set by the  -P  command-
              line parameter.

       maxprocs = num
              Throttles  the  number of concurrently forked processes to num.  The intent is to provide a safety
              net for the unlikely situation when the server tries to take on too many tasks at once.  Note that
              this  situation  has only been reported to happen when trying to use a garbled configuration file!
              You don't want to use a garbled configuration file now, do you?

       cltimeout = secs
              Sets the client inactivity timeout to secs.  This is  meant  to  help  thwart  denial  of  service
              attacks or recover from crashed clients.  secs is interpreted as a "1h/1m/1s" string, where "1m" =
              60 seconds.

       randstart = interval
              When the server starts, normally all services will not be scheduled until the interval defined  in
              the  respective  service section.  This can cause long delays before the first check of a service,
              and possibly a high load on the server if multiple things are scheduled  at  the  same  intervals.
              This  option  is  used  to  randomize the scheduling of the first test for all services during the
              startup period, and immediately after the reset command.  If randstart is defined,  the  scheduled
              run  time  of  all services of all watch groups will be a random number between zero and randstart
              seconds.

       dep_recur_limit = depth
              Limit dependency recursion level to depth.  If dependency recursion (dependencies which depend  on
              other  dependencies)  tries  to  go  beyond depth, then the recursion is aborted and a messages is
              logged to syslog.  The default limit is 10.

       dep_behavior = {a|m|hm}
              dep_behavior controls whether the dependency expression suppresses one of: the running of  alerts,
              the  running of monitors, or the passing of individual hosts to the monitors.  Read more about the
              behavior in the "Service Definitions" section below.

              This is a global setting which controls the default settings for the service-specified variable.

       dep_memory = timeval
              If set, dep_memory will cause dependencies to continue to prevent alerts/monitoring for  a  period
              of  time  after  the  service  returns  to a normal state.  This can be used to prevent over-eager
              alerting when a machine is rebooting, for  example.   See  the  explanation  of  interval  in  the
              "Service Definitions" section for a description of timeval.

              This is a global setting which controls the default settings for the service-specified variable.

       syslog_facility = facility
              Specifies the syslog facility used for logging.  daemon is the default.

       startupalerts_on_reset = {yes|no}

              If  set  to  "yes",  startupalerts  will be invoked when the reset client command is executed. The
              default is "no".

       monremote = program

              If set, this external program will be called by Mon when various client  requests  are  processed.
              This  can  be used to propagate those changes from one Mon server to another, if you have multiple
              monitoring machines.  An example script, monremote.pl is available in the clients directory.

   Hostgroup Entries
       Hostgroup entries begin with the keyword hostgroup, and are followed by a hostgroup tag and one  or  more
       hostnames  or  IP  addresses, separated by whitespace. The hostgroup tag must be composed of alphanumeric
       characters, a dash ("-"), a period ("."), or an underscore ("_"). Non-blank  lines  following  the  first
       hostgroup  line  are interpreted as more hostnames.  The hostgroup definition ends with a blank line. For
       example:

              hostgroup servers nameserver smtpserver nntpserver
                   nfsserver httpserver smbserver

              hostgroup router_group cisco7000 agsplus

   View Entries
       View entries begin with the keyword view, and are followed by a view tag and the names  of  one  or  more
       hostgroups.   The  view tag must be composed of alphanumeric characters, a dash ("-"), a period ("."), or
       an underscore ("_"). Non-blank lines following the first view line  are  interpreted  as  more  hostgroup
       names.  The view definition ends with a blank line. For example:

              view servers dns-servers web-servers file-servers
                   mail-servers

              view network-services routers switches vpn-servers

   Watch Group Entries
       Watch  entries  begin with a line that starts with the keyword watch, followed by whitespace and a single
       word which normally refers to a pre-defined hostgroup.  If  the  second  word  is  not  recognized  as  a
       hostgroup tag, a new hostgroup is created whose tag is that word, and that word is its only member.

       Watch entries consist of one or more service definitions.

       A watch group is terminated by a blank line, the end of the file, or by a subsequent definition, "watch",
       "hostgroup", or otherwise.

       There may be a special watch group entry called "default". If a default watch group  is  defined  with  a
       service  entry  named  "default",  then  this  definition  will be used in handling traps received for an
       unrecognized watch and service.

   Service Definitions
       service servicename
              A service definition begins with they keyword service followed by a word which is the tag for this
              service.  This word must be unique among all services defined for the same watch group.

              The  components of a service are an interval, monitor, and one or more time period definitions, as
              defined below.

              If a service name of "default" is defined within a watch group called "dafault" (see above),  then
              the default/default definition will be used for handling unknown mon traps.

              The following configuration parameters are valid only following a service definition:

       VARIABLE=value
              Environment  variables  may be defined for each service, which will be included in the environment
              of monitors and alerts. Variables must be specified in all capital letters,  must  begin  with  an
              alphabetical  character  or  an  underscore,  and there must be no spaces to the left of the equal
              sign.

       interval timeval
              The keyword interval followed by a time value specifies the frequency that a monitor  script  will
              be  triggered.   Time  values  are  defined  as  "30s", "5m", "1h", or "1d", meaning 30 seconds, 5
              minutes, 1 hour, or 1 day. The numeric portion may be a fraction, such as "1.5h" or an hour and  a
              half. This format of a time specification will be referred to as timeval.

       failure_interval timeval
              Adjusts  the polling interval to timeval when the service check is failing. Resets the interval to
              the original when the service succeeds.

       traptimeout timeval
              This keyword takes the same time specification argument as interval, and makes the service  expect
              a  trap  from  an  external source at least that often, else a failure will be registered. This is
              used for a heartbeat-style service.

       trapduration timeval
              If a trap is received, the status of the service the trap was delivered to  will  normally  remain
              constant.  If  trapduration is specified, the status of the service will remain in a failure state
              for the duration specified by timeval, and then it will be reset to "success".

       randskew timeval
              Rather than schedule the monitor script to run at the start of each interval, randomly adjust  the
              interval  specified  by  the  interval  parameter  by  plus-or-minus randskew .  The skew value is
              specified as the interval parameter: "30s", "5m", etc...  For  example  if  interval  is  1m,  and
              randskew is "5s", then mon will schedule the monitor script some time between every 55 seconds and
              65 seconds.  The intent is to help distribute the load  on  the  server  when  many  services  are
              scheduled at the same intervals.

       monitor monitor-name [arg...]
              The  keyword monitor followed by a script name and arguments specifies the monitor to run when the
              timer expires. Shell-like quoting conventions are followed when specifying the arguments  to  send
              to  the  monitor script.  The script is invoked from the directory given with the -s argument, and
              all following words are supplied as arguments to the monitor program,  followed  by  the  list  of
              hosts  in the group referred to by the current watch group.  If the monitor line ends with ";;" as
              a separate word, the host groups are not appended  to  the  argument  list  when  the  program  is
              invoked.

       allow_empty_group
              The  allow_empty_group  option will allow a monitor to be invoked even when the hostgroup for that
              watch is empty because of disabled hosts. The default behavior is not to invoke the  monitor  when
              all hosts in a hostgroup have been disabled.

       description descriptiontext
              The text following description is queried by client programs, passed to alerts and monitors via an
              environment variable. It should contain a brief description of the service, suitable for inclusion
              in an email or on a web page.

       exclude_hosts host [host...]
              Any hosts listed after exclude_hosts will be excluded from the service check.

       exclude_period periodspec
              Do not run a scheduled monitor during the time identified by periodspec.

       depend dependexpression
              The  depend  keyword is used to specify a dependency expression, which evaluates to either true of
              false, in the boolean sense.   Dependencies  are  actual  Perl  expressions,  and  must  obey  all
              syntactical  rules.  The  expressions  are  evaluated  in  their  own  package  space so as to not
              accidentally have some unwanted side-effect.  If a syntax  error  is  found  when  evaluating  the
              expression, it is logged via syslog.

              Before  evaluation,  the  following substitutions on the expression occur: phrases which look like
              "group:service" are substituted with the value of the current operational status of that specified
              service.  These  opstatus  substitutions  are  computed  recursively, so if service A depends upon
              service B, and service B depends upon service C, then service A depends upon service C. Successful
              operational  statuses  (which  evaluate to "1") are "STAT_OK", "STAT_COLDSTART", "STAT_WARMSTART",
              and  "STAT_UNKNOWN".   The  word  "SELF"  (in  all  caps)  can  be  used  for  the   group   (e.g.
              "SELF:service"), and is an abbreviation for the current watch group.

              This  feature  can  be  used to control alerts for services which are dependent on other services,
              e.g. an SMTP test which is dependent upon the machine being ping-reachable.

       dep_behavior {a|m|hm}
              The evaluation of the  dependency  graphs  specified  via  the  depend  keyword  can  control  the
              suppression  of alert or monitor invocations, or the suppression of individual hosts passed to the
              monitor.

              Alert suppression.  If this option is set to "a", then the dependency expression will be evaluated
              after  the  monitor for the service exits or after a trap is received.  An alert will only be sent
              if the evaluation succeeds, meaning that none of  the  nodes  in  the  dependency  graph  indicate
              failure.

              Monitor suppression.  If it is set to "m", then the dependency expression will be evaulated before
              the monitor for the service is about to run.  If the evaulation succeeds, then the monitor will be
              run. Otherwise, the monitor will not be run and the status of the service will remain the same.

              Host  suppression.   If it is set to "hm" then Mon will extract the list of "parent" services from
              the dependency expression.  (In fact the expression can be just a list of services.) Then when the
              monitor for the service is about to be run, for each host in the current hostgroup Mon will search
              all the parent services which are currently failing and look  for  the  hostname  in  the  current
              summary  output.   If  the  hostname  is  found,  this  host will be excluded from this run of the
              monitor.  This can be used to e.g. allow an SMTP test on a group of hosts to  still  be  run  even
              when  a  single  host  is  not ping-reachable.  If all the rest of the hosts are working fine, the
              service will be in an OK state, but if another host fails the SMTP test Mon can still alert  about
              that  host  even  though the parent dependency was failing.  The dependency expression will not be
              used recursively in this case.

       alertdepend dependexpression

       monitordepend dependexpression

       hostdepend dependexpression
              These keywords allow you to specify multiple dependency expressions of different types.  Each  one
              corresponds  to  the  different  dep_behavior  settings  listed  above.   They  will  be evaluated
              independently in the different  contexts  as  listed  above.   If  depend  is  present,  it  takes
              precedence over the matching keyword, depending on the dep_behavior setting.

       dep_memory timeval
              If  set,  dep_memory will cause dependencies to continue to prevent alerts/monitoring for a period
              of time after the service returns to a normal state.  This  can  be  used  to  prevent  over-eager
              alerting  when  a  machine  is  rebooting,  for  example.   See the explanation of interval in the
              "Service Definitions" section for a description of timeval.

       redistribute alert [arg...]
              A service may have one redistribute option, which is a special form of  an  an  alert  definition.
              This  alert will be called on every service status update, even sequential success status updates.
              This can be used to integrate Mon with another monitoring system, or to link together multiple Mon
              servers  via an alert script that generates Mon traps.  See the "ALERT PROGRAMS" section above for
              a list of the parameters mon will pass automatically to alert programs.

       unack_summary
              Remove the "acknowledged" state from a service if the summary component  of  the  failure  message
              changes.   In  most  common usage the summary is the list of hosts that are failing, so additional
              hosts failing would remove an ack.

   Period Definitions
       Periods are used to define the conditions which should allow alerts to be delivered.

       period [label:] periodspec
              A period groups one or more alarms and variables which control how often  an  alert  happens  when
              there  is  a failure.  The period definition has two forms. The first takes an argument which is a
              period  specification  from  Patrick  Ryan's  Time::Period  Perl  5  module.  Refer  to   "perldoc
              Time::Period" for more information.

              The  second  form requires a label followed by a period specification, as defined above. The label
              is a  tag  consisting  of  an  alphabetic  character  or  underscore  followed  by  zero  or  more
              alphanumerics  or  underscores and ending with a colon. This form allows multiple periods with the
              same period definition. One use is to  have  a  period  definition  which  has  no  alertafter  or
              alertevery  parameters  for  a particular time period, and another for the same time period with a
              different set of alerts that does contain those parameters.

              Period definitions, in either the first or  second  form,  must  be  unique  within  each  service
              definition.  For  example,  if you need to define two periods both for "wd {Sun-Sat}", then one or
              both of the period definitions must specify a label such as "period t1: wd {Sun-Sat}" and  "period
              t2: wd {Sun-Sat}".

       alertevery timeval [observe_detail | strict]
              The  alertevery  keyword  (within  a  period  definition)  takes  the same type of argument as the
              interval variable, and limits the number of times an alert is sent when the service  continues  to
              fail.   For example, if the interval is "1h", then only the alerts in the period section will only
              be triggered once every hour. If the alertevery keyword is omitted in a  period  entry,  an  alert
              will  be  sent  out  every  time  a  failure is detected. By default, if the summary output of two
              successive failures changes, then the alertevery interval is overridden,  and  an  alert  will  be
              sent.   If  the  string  "observe_detail"  is  the last argument, then both the summary and detail
              output lines will be considered when comparing the output of successive failures.  If  the  string
              "strict"  is  the last argument, then the output of the monitor or the state change of the service
              will have no effect on when alerts are sent. That is, "alertevery 24h strict" will send  only  one
              alert  every  24  hours,  no  matter what.  Please refer to the ALERT DECISION LOGIC section for a
              detailed explanation of how alerts are suppressed.

       alertafter num

       alertafter num timeval

       alertafter timeval
              The alertafter keyword (within a period section) has three forms: only with the "num" argument, or
              with  the  "num  timeval"  arguments,  or only with the "timeval" argument.  In the first form, an
              alert will only be invoked after "num" consecutive failures.

              In the second form, the arguments are a positive integer followed by an interval, as described  by
              the  interval  variable above.  If these parameters are specified, then the alerts for that period
              will only be called after that  many  failures  happen  within  that  interval.  For  example,  if
              alertafter  is  given  the  arguments  "3 30m", then the alert will be called if 3 failures happen
              within 30 minutes.

              In the third form, the argument is an interval, as  described  by  the  interval  variable  above.
              Alerts  for  that  period  will only be called if the service has been in a failure state for more
              than the length of time desribed by the interval, regardless of the  number  of  failures  noticed
              within that interval.

       numalerts num

              This variable tells the server to call no more than num alerts during a failure. The alert counter
              is kept on a per-period basis, and is reset upon each success.

       no_comp_alerts

              If this option is specified, then upalerts will be called whenever the service state changes  from
              failure to success, rather than only after a corresponding "down" alert.

       alert alert [arg...]
              A period may contain multiple alerts, which are triggered upon failure of the service. An alert is
              specified with the alert keyword, followed by an optional exit parameter, and arguments which  are
              interpreted the same as the monitor definition, but without the ";;" exception. The exit parameter
              takes the form of exit=x or exit=x-y and has the effect that the alert is only called if the  exit
              status  of  the  monitor script falls within the range of the exit parameter. If, for example, the
              alert line is alert exit=10-20 mail.alert mis then mail-alert will only be invoked with mis as its
              arguments  if  the  monitor  program's exit value is between 10 and 20. This feature allows you to
              trigger different alerts at different severity levels (like when free disk space goes from  8%  to
              3%).

              See  the  ALERT PROGRAMS section above for a list of the pramaeters mon will pass automatically to
              alert programs.

       upalert alert [arg...]
              An upalert is the compliment of an alert.  An upalert is called when a services  makes  the  state
              transition  from  failure  to  success,  if  a corresponding "down" alert was previously sent. The
              upalert script is called supplying the same parameters as the alert script, with the  addition  of
              the  -u  parameter  which is simply used to let an alert script know that it is being called as an
              upalert. Multiple upalerts may be specified  for  each  period  definition.   Set  the  per-period
              no_comp_alerts option to send an upalert regardless if whether or not a "down" alert was  sent.

       startupalert alert [arg...]
              A  startupalert is only called when the mon server starts execution, or when a "reset" command was
              issued to the server, depending on the setting of the startupalerts_on_reset global.  Unlike other
              alerts,  startupalerts  are  not  called  following the exit of a monitor, i.e. they are called in
              their own right, therefore the "exit=" argument is not applicable to startupalert.

       upalertafter timeval
              The upalertafter parameter is specified as a string  that  follows  the  syntax  of  the  interval
              parameter ("30s", "1m", etc.), and controls the triggering of an upalert.  If a service comes back
              up after being down for a time greater than or equal to the value of this option, an upalert  will
              be called. Use this option to prevent upalerts to be called because of "blips" (brief outages).

AUTHENTICATION CONFIGURATION FILE

       The  file  specified  by the authfile variable in the configuration file (or passed via the -A parameter)
       will be loaded upon startup.  This file defines restrictions upon which client commands may  be  executed
       by  which  users.  It  is  a  text  file  which  consists  of  comments,  command  definitions,  and trap
       authentication parameters.  A comment line begins with optional whitespace followed by pound sign.  Blank
       lines are ignored.

       The  file is separated into a command section and a trap section. Sections are specified by a single line
       containing one of the following statements:

                   command section

       or

                   trap section

       Lines following one of the above statements apply to that section until either the end  of  the  file  or
       another section begins.

       A  command  definition  consists of a command, followed by a colon, followed by a comma-separated list of
       users who may execute the command.  The default is that no users may execute any commands unless they are
       explicitly  allowed  in  this configuration file. For clarity, a user can be denied by prefixing the user
       name with "!". If the word "AUTH_ANY" is used for a username, then any authenticated user will be allowed
       to  execute  the  command. If the word "all" is used for a username, then that command may be executed by
       any user, authenticated or not.

       The trap section allows configuration of which users may send traps from which hosts.  The  syntax  is  a
       source  host  (name or ip address), whitespace, a username, whitespace, and a plaintext password for that
       user. If the source host is "*", then allow traps from any host. If the  username  is  "*",  then  accept
       traps without regard for the username or password. If no hosts or users are specified, then no traps will
       be accepted.

       An example configuration file:

              command section
              list:          all
              reset:         root,admin
              loadstate:          root
              savestate:          root

              trap section
              127.0.0.1 root r@@tp4sswrd

       This means that all clients are able to perform the list command, "root"  is  able  to  perform  "reset",
       "loadstate", "savestate", and "admin" is able to execute the "reset" command.

CLIENT-SERVER INTERFACE

       The  server  listens  on  TCP port 2583, which may be overridden using the -p port option. Commands are a
       single line each, terminated by a newline.  The server can  handle  any  number  of  simultaneous  client
       connections.

CLIENT INTERFACE COMMANDS

       See manual page for moncmd.

MON TRAPPING

       Mon has the facility to receive special "mon traps" from any local or remote machine. Currently, the only
       available method for sending mon traps are through the Mon::Client perl interface, though the UDP  packet
       format is defined well enough to permit the writing of traps in other languages.

       Traps  are  handled  similarly  to  monitors:  a  trap  sends  an  operational  status, summary line, and
       description text, and mon generates an alert or upalert as necessary.

       Traps can be caught by any watch/service group set up in  the  mon  configuration  file,  however  it  is
       suggested  that you configure watch/service groups specifically for the traps you expect to receive. When
       defining a special watch/service group for traps, do not include a "monitor"  directive  (as  no  monitor
       need  be  invoked). Since a monitor is not being invoked, it is not necessary for the watch definition to
       have a hostgroup which contains real host names.  Just make up a useful name, and mon will  automatically
       create the watch group for you.

       Here is a simple config file example:

              watch trap-service
                   service host1-disks
                        description TRAP: for host1 disk status
                        period wd {Sun-Sat}
                             alert mail.alert someone@your.org
                             upalert mail.alert -u someone@your.org

       Since  mon  listens  on  a  UDP  port for any trap, a default facility is available for handling traps to
       unknown groups or services.  To enable this facility, you must include a "default"  watch  group  with  a
       "default" service entry containing the specifics of alarms.  If a default/default watch group and service
       are not configured, then unknown traps  get  logged  via  syslog,  and  no  alarm  is  sent.   NOTE:  The
       default/default  facility  is  a single entity as far as accounting and alarming go. Alarm programs which
       are not aware of this fact may send confusing information when a failure trap  comes  from  one  machine,
       followed  by  a  success  (ok)  trap  from  a  different  machine.  See  the  alarm  environment variable
       MON_TRAP_INTENDED above for a possible way around this. It is intended that default/default be used as  a
       facility  to  catch  unknown  traps,  and  should  not  be relied upon to catch all traps in a production
       environment. If you are lazy and only want to use default/default for catching all  traps,  it  would  be
       best to disable upalerts, and use the MON_TRAP_INTENDED environment variable in alert scripts to make the
       alerts more meaningful to you.

       Here is an example default facility:

              watch default
                   service default
                        description Default trap service
                        period wd {Sun-Sat}
                             alert mail.alert someone@your.org
                             upalert mail.alert -u someone@your.org

EXAMPLES

       The mon distribution comes with an example configuration called example.cf.  Refer to that file for  more
       information.

SEE ALSO

       moncmd(1), Time::Period(3pm), Mon::Client(3pm)

HISTORY

       mon  was  written because I couldn't find anything out there that did just what I needed, and nothing was
       worth modifying to add the features I wanted. It doesn't have a cool name, and that bothers me because  I
       couldn't think of one.

BUGS

       Report bugs to the email address below.

AUTHOR

       Jim Trocki <trockij@arctic.org>