Provided by: mon_1.2.0-9_i386 bug

NAME

       mon - monitor services for availability, sending alarms upon failures.

SYNOPSIS

       mon [-dfhlMSv] [-a dir] [-A authfile] [-b dir] [-B dir] [-c config] [-D
       dir] [-i secs] [-k num] [-l [statetype]] [-L dir] [-m num] [-p num] [-P
       pidfile] [-r delay] [-s dir]

DESCRIPTION

       mon  is a general-purpose scheduler for monitoring service availability
       and triggering alerts upon detecting failures.  mon was designed to  be
       open  in the sense that it supports arbitrary monitoring facilities and
       alert methods via a common  interface,  which  are  easily  implemented
       through programs (in C, Perl, shell, etc.), SNMP traps, and special Mon
       (UDP packet) traps.

OPTIONS

       -a dir Path       to       alert       scripts.       Default        is
              /usr/local/lib/mon/alert.d:alert.d.  Multiple alert paths may be
              specified by separating them with a colon.   Non-absolute  paths
              are  taken to be relative to the base directory (/usr/lib/mon by
              default).

       -b dir Base directory for mon. scriptdir, alertdir,  and  statedir  are
              all relative to this directory unless specified from /.  Default
              is /usr/lib/mon.

       -B dir Configuration file base directory. All config files are  located
              here, including mon.cf, monusers.cf, and auth.cf.

       -A authfile
              Authentication   configuration   file.   By   default   this  is
              /etc/mon/auth.cf  if   the   /etc/mon   directory   exists,   or
              /usr/lib/mon/auth.cf otherwise.

       -c file
              Read   configuration   from   file.    This   defaults   to   IR
              /etc/mon/mon.cf " if the " /etc/mon directory exists,  otherwise
              to /etc/mon.cf.

       -d     Enable debugging mode.

       -D dir Path   to   state   directory.    Default   is   the   first  of
              /var/state/mon,  /var/lib/mon,  and  /usr/lib/mon/state.d  which
              exists.

       -f     Fork  and  run as a daemon process. This is the preferred way to
              run mon.

       -h     Print help information.

       -i secs
              Sleep interval, in seconds. Defaults to 1. This  shouldn't  need
              to be adjusted for any reason.

       -k num Set log history to a maximum of num entries. Defaults to 100.

       -l statetype
              Load  state  from the last saved state file. The supported saved
              state types are disabled for  disabled  watches,  services,  and
              hosts,  opstatus  for  failure/alert/ack status of all services,
              and all for both.  If no  statetype  is  provided,  disabled  is
              assumed.

       -L dir Sets  the  log  dir.  See also logdir in the configuration file.
              The default is /var/log/mon if that directory exists,  otherwise
              log.d in the base directory.

       -M     Pre-process  the  configuration  file  with  the macro expansion
              package m4.

       -m num Set the throttle for the maximum number of processes to num.

       -p num Make server listen on port num.  This defaults to 2583.

       -S     Start with the scheduler stopped.

       -P pidfile
              Store the server's pid in pidfile, the default is the  first  of
              /var/run/mon/mon.pid,  /var/run/mon.pid,  and /etc/mon.pid whose
              directory exists.  An empty value tells mon not  to  use  a  pid
              file.

       -r delay
              Sets  the  number of seconds used to randomize the startup delay
              before each service is scheduled. Refer to the global  randstart
              variable in the configuration file.

       -s dir Path       to       monitor       scripts.       Default      is
              /usr/local/lib/mon/mon.d:mon.d.  Multiple  alert  paths  may  be
              specified  by  separating them with a colon.  Non-absolute paths
              are taken to be relative to the base directory (/usr/lib/mon  by
              default).

       -v     Print version information.

DEFINITIONS

       monitor
              A  program  which  tests for a certain condition, returns either
              true or false, and optionally produces output to be passed  back
              to  the scheduler.  Common monitors detect host reachability via
              ICMP echo messages, or connection to TCP services.

       period A period in time as interpreted by the Time::Period module.

       alert  A program which sends a message when invoked by  the  scheduler.
              The scheduler calls upon an alert when it detects a failure from
              a monitor.  An alert  program  accepts  a  set  of  command-line
              arguments  from  the scheduler, in addition to data via standard
              input.

       hostgroup
              A single host or  list  of  hosts,  specified  as  names  or  IP
              addresses.

       service
              A  collection  of  parameters  used  to  deal  with monitoring a
              particular resource which is provided by a group.  Services  are
              usually  modeled  after things such as an SMTP server, ICMP echo
              capability, server disk space availability, or SNMP events.

       view   A collection of hostgroups, used to filter mon output for client
              display.   i.e.  a  'network-services'  view might be defined so
              your network staff can see just the hostgroups which  matter  to
              them, without having to see all hostgroups defined in Mon.

       watch  A collection of services which apply to a particular group.

OPERATION

       When  the  mon  scheduler  starts,  it  reads  a  configuration file to
       determine the services it needs  to  monitor.  The  configuration  file
       defaults  to  /etc/mon.cf, and can be specified using the -c parameter.
       If the -M option is specified, then  the  configuration  file  is  pre-
       processed  with  m4.  If the configuration file ends with .m4, the file
       is also processed by m4 automatically.

       The scheduler enters a loop which handles client  connections,  monitor
       invocations, and failure alerts. Each service has a timer, specified in
       the configuration file  as  the  interval  variable,  which  tells  the
       scheduler  how  frequently  to invoke a monitor process.  The scheduler
       may be temporarily stopped. While it is stopped,  client  access  still
       functions,  but  it  just  doesn't  schedule  things. This is useful in
       conjunction while resetting the server, because you can do  this:  save
       the  hosts  and  services which are disabled, reset the server with the
       scheduler stopped, re-disabled those hosts and services, then start the
       scheduler.  It  also allows making atomic changes across several client
       connections.  See the moncmd man page for more information.

MONITOR PROGRAMS

       Monitor processes are invoked  with  the  arguments  specified  in  the
       configuration  file,  appended  by  the  hosts from the applicable host
       group. For example, if the watch group is "servers", which contain  the
       hostnames  "smtp",  "nntp",  and  "ns",  and  the monitor line reads as
       follows,
        monitor fping.monitor -t 4000 -r 2
       then  the  exectuable  "fping.monitor"  will  be  executed  with  these
       parameters:
        MONITOR_DIR/fping.monitor -t 4000 -r 2 smtp nntp ns

       MONITOR_DIR     is    actually    a    search    path,    by    default
       /usr/local/lib/mon/mon.d  then  /usr/lib/mon/mon.d,  but  it   can   be
       overridden by the -s option or in the configuration file.  If all hosts
       in the hostgroup have been disabled, then a warning is sent  to  syslog
       and  the  monitor  is not run. This behavior may be overridden with the
       "allow_empty_group" option in the service  definition.   If  the  final
       argument  to  the  "monitor"  line  is  ";;"  (it  must  be preceded by
       whitespace), then the host list will not be appended to  the  parameter
       list.

       In addition to environment variables defined by the user in the service
       definition, mon passes certain variables to monitor process.

       MON_LAST_SUMMARY
              The first line of the output from  the  last  time  the  monitor
              exited.  This is not the summary of the current monitor run, but
              the previous one.  This may  be  used  by  an  alert  script  to
              provide historical context in an alert.

       MON_LAST_OUTPUT
              The  entire  output of the monitor from the last time it exited.
              This is not the output of  the  current  monitor  run,  but  the
              previous  one.   This  may be used by an alert script to provide
              historical context in an alert.

       MON_LAST_FAILURE
              The time(2) of the last failure for this service.

       MON_FIRST_FAILURE
              The time(2) of the first time this service failed.

       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.

       MON_DESCRIPTION
              The description of this service, as defined in the configuration
              file using the description tag.

       MON_DEPEND_STATUS
              The depend status, "o" if dependency failure, "1" otherwise.

       MON_LOGDIR
              The  directory  log  files should be placed, as indicated by the
              logdir global configuration variable.

       MON_STATEDIR
              The directory where state files should be kept, as indicated  by
              the statedir global configuration variable.

       MON_CFBASEDIR
              The  directory  where  configuration  files  should  be kept, as
              indicated by the cfbasedir global configuration variable.

       "fping.monitor" should return an exit  status  of  0  if  it  completed
       successfully (found no problems), or nonzero if a problem was detected.
       The first line of output from the monitor script has a special meaning:
       it  is used as a brief summary of the exact failure which was detected,
       and is passed to the alert program. All remaining output is also passed
       to the alert program, but it has no required interpretation.

       If  a  monitor  for a particular service is still running, and the time
       comes for mon to run another monitor for  that  service,  it  will  not
       start  another  monitor.  For  example, if the interval is 10s, and the
       monitor does not finish running within 10 seconds, then mon  will  wait
       until the first monitor exits before running another one.

ALERT DECISION LOGIC

       Upon  a  non-zero  or zero exit status, the associated alert or upalert
       program (respectively) is started, pending the following conditions: If
       an  alert for a specific service is disabled, do not send an alert.  If
       dep_behavior is set to  'a',  or  alertdepend  is  set,  and  a  parent
       dependency  is  failing,  then  suppress  the  alert.  If the alert has
       previously been acknowledged, do not send the alert, unless  it  is  an
       upalert.   If  an  alert is not within the specified period, record the
       failure via syslog(3) and do not send an alert.  If  the  failure  does
       not  fall  within  a defined period, do not send an alert.  No upalerts
       are sent without corresponding down alerts,  unless  no_comp_alerts  is
       defined  in  the  period  section.  An upalert will only be sent if the
       previous state is a failure.  If an alert was already sent  within  the
       last alertevery interval, do not send another alert, unless the summary
       output from the current monitor program differs from the  last  monitor
       process.   Otherwise, send an alert using each alert program listed for
       that period. The observe_detail argument  to  alertevery  affects  this
       behavior  by  observing the changes in the detail part of the output in
       addition to the summary line.  If a monitor has successive failures and
       the  summary  output  changes  in  each  of  them,  alertevery will not
       suppress multiple consecutive alerts.  The reasoning  is  that  if  the
       summary  output changes, then a significant event occurred and the user
       should be alerted.  The "strict" argument to alertevery  will  suppress
       both  comparing the output from the previous monitor run to the current
       and prevent a successful return value of the monitor from resetting the
       alertevery  timer.  For example, "alertevery 24h strict" will only send
       out an alert once every 24 hours, regardless  of  whether  the  monitor
       output changes, or if the service stops and then starts failing.

ALERT PROGRAMS

       Alert programs are found in the path supplied with the -a parameter, or
       in the /usr/local/lib/mon/alert.d and  directories  if  not  specified.
       They are invoked with the following command-line parameters:

       -s service
              Service tag from the configuration file.

       -g group
              Host group name from the configuration file.

       -h hosts
              The  expanded  version  of  the host group, space delimited, but
              contained in one shell "word".

       -l alertevery
              The number of seconds until the next alarm will be sent.

       -O     This option  is  supplied  to an alert  only  if  the  alert  is
              being generated as a result of an expected traap timing out

       -t time
              The  time (in time(2) format) of when this failure condition was
              detected.

       -T     This option is supplied to  an  alert  only  if  the  alert  was
              triggered by a trap

       -u     This  option  is supplied to an alert only if it is being called
              as an upalert.

       The remaining arguments are supplied from the  trailing  parameters  in
       the configuration file, after the "alert" service parameter.

       As  with  monitor programs, alert programs are invoked with environment
       variables defined by the user in the service definition, in addition to
       the following which are explicitly set by the server:

       MON_LAST_SUMMARY
              The  first  line  of  the  output from the last time the monitor
              exited.

       MON_LAST_OUTPUT
              The entire output of the monitor from the last time it exited.

       MON_LAST_FAILURE
              The time(2) of the last failure for this service.

       MON_FIRST_FAILURE
              The time(2) of the first time this service failed.

       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.

       MON_DESCRIPTION
              The description of this service, as defined in the configuration
              file using the description tag.

       MON_GROUP
              The watch group which triggered this alarm

       MON_SERVICE
              The service heading which generated this alert

       MON_RETVAL
              The exit value of the failed monitor program, or return value as
              accepted from a trap.

       MON_OPSTATUS
              The operational status of the service.

       MON_ALERTTYPE
              Has one of the following  values:  "failure",  "up",  "startup",
              "trap",  or "traptimeout", and signifies the type of alert which
              was triggered.

       MON_TRAP_INTENDED
              This is only set when an unknown mon trap is received and caught
              by   the   default/defaut  watch/service.  This  contains  colon
              separated entries of the trap's intended watch group and service
              name.

       MON_LOGDIR
              The  directory  log  files should be placed, as indicated by the
              logdir global configuration variable.

       MON_STATEDIR
              The directory where state files should be kept, as indicated  by
              the statedir global configuration variable.

       MON_CFBASEDIR
              The  directory  where  configuration  files  should  be kept, as
              indicated by the cfbasedir global configuration variable.

       The first line from standard input must be used as a brief  summary  of
       the problem, normally supplied as the subject line of an email, or text
       sent to an alphanumeric pager. Interpretation of all  subsequent  lines
       read  from  stdin  is  left  up  to  the  alerting  program.  The usual
       parameters are a list of recipients to  deliver  the  notification  to.
       The interpretation of the recipients is not specified, and is up to the
       alert program.

CONFIGURATION FILE

       The configuration  file  consists  of  zero  or  more  global  variable
       definitions,  zero or more hostgroup definitions, and one or more watch
       definitions. Each  watch  definition  may  have  one  or  more  service
       definitions.  A watch definition is terminated by a blank line, another
       definition, or the end of the file.  A  line  beginning  with  optional
       leading  whitespace  and a pound ("#") is regarded as a comment, and is
       ignored.

       Lines are parsed as they are read.  Long  lines  may  be  continued  by
       ending  them  with a backslash ("\").  If a line is continued, then the
       backslash, the trailing whitespace after the backslash, and the leading
       whitespace  of  the  following  line  are  removed.  The  end result is
       assembled into a single line.

       Typically the configuration file has the following layout:

       1. Global variable definitions

       2. Hostgroup definitions

       3. Watch definitions

       See the "etc/example.cf" file which comes for the distribution  for  an
       example.

   Global Variables
       The  following  variables  may be set to override compiled-in defaults.
       Command-line  options  will  have  a  higher  precedence   than   these
       definitions.

       alertdir = dir
              dir is the full path to the alert scripts. This is the value set
              by the -a command-line parameter.

              Multiple alert paths may be specified by separating them with  a
              colon.   Non-absolute paths are taken to be relative to the base
              directory (/usr/lib/mon by default).

              When the configuration file is read, all alerts referenced  from
              the  configuration will be looked up in each of these paths, and
              the full path to the first instance of the alert found is stored
              in  a  hash. This hash is only generated upon startup or after a
              "reset" command, so  newly  added  alert  scripts  will  not  be
              recognized until a "reset" is performed.

       mondir = dir
              dir is the full path to the monitor scripts. This value may also
              be set by the -s command-line parameter. If this path  does  not
              begin with a "/", it will be relative to basedir.

              Multiple  alert paths may be specified by separating them with a
              colon. All paths must be absolute.

              When the configuration file is  read,  all  monitors  referenced
              from the configuration will be looked up in each of these paths,
              and the full path to the first instance of the monitor found  is
              stored  in  a  hash. This hash is only generated upon startup or
              after a "reset" command, so newly added monitor scripts will not
              be recognized until a "reset" is performed.

       statedir = dir
              dir  is  the  full  path  to the state directory.  mon uses this
              directory to save various state information. If this  path  does
              not begin with a "/", it will be relative to basedir.

       logdir = dir
              dir  is  the  full  path  to  the  log directory.  mon uses this
              directory to save various logs, including the downtime  log.  If
              this  path  does  not  begin  with a "/", it will be relative to
              basedir.

       basedir = dir
              dir is the full path for the  state,  log,  monitor,  and  alert
              directories.

       cfbasedir = dir
              dir  is  the  full  path where all the config files can be found
              (monusers.cf, auth.cf, etc.).

       authfile = file
              file is the path to the authentication file. If  the  path  does
              not begin with a "/", it will be relative to cfbasedir.

       authtype = type [type...]
              type  is  the  type  of authentication to use. A space-separated
              list of types may be specified, and they  will  be  checked  the
              order they are listed. As soon as a successful authentication is
              performed, the user is considered authenticated by mon  for  the
              duration  of  the  session and no more authentication checks are
              performed.

              If  type  is  getpwnam,  then  the  standard  Unix  passwd  file
              authentication  method  will  be  used (calls getpwnam(3) on the
              user and compares the crypt(3)ed version of  the  password  with
              what  it  gets  from  getpwnam).  This  will  not work if shadow
              passwords are enabled on the system.

              If type is userfile, then usernames  and  hashed  passwords  are
              read   from   userfile,   which  is  defined  via  the  userfile
              configuration variable.

              If type is pam, then PAM (pluggable authentication modules) will
              be  used  for  authentication.   The  service  specified  by the
              pamservice global will be used. If no global is given,  the  PAM
              passwd service will be used.

              If  type is trustlocal, then if the client connection comes from
              locahost, the username passed from the client will  be  trusted,
              and  the  password  will  be ignored.  This can be used when you
              want the client to handle the authentication for  you.   I.e.  a
              CGI script using one of the many apache authentication methods.

       userfile = file
              This file is used when authtype is set to userfile.  It consists
              of a sequence of lines of  the  format  'username  :  password'.
              password  is  stored  as  the hash returned by the standard Unix
              crypt(3) function.  NOTE: the format of this file is  compatible
              with  the Apache file based username/password file format. It is
              possible to use the htpasswd program  supplied  with  Apache  to
              manage the mon userfile.

              Blank lines and lines beginning with # are ignored.

       pamservice = service
              The PAM service used for authentication. This is applicable only
              if "pam" is specified as a parameter to the authtype setting. If
              this global is not defined, it defaults to passwd.

       serverbind = addr

       trapbind = addr

              serverbind and trapbind specify which address to bind the server
              and trap ports to, respectively.  If these are not defined,  the
              default  address  is INADDR_ANY, which allows connections on all
              interfaces. For security reasons, it could be  a  good  idea  to
              bind only to the loopback interface.

       dtlogfile = file
              file  is  a  file which will be used to record the downtime log.
              Whenever a service fails for some amount of time and  then  stop
              failing,  this event is written to the log. If this parameter is
              not set, no logging is done.  The  format  of  the  file  is  as
              follows (# is a comment and may be ignored):

              timenoticed group service firstfail downtime interval summary.

              timenoticed is the time(2) the service came back up.

              group service is the group and service which failed.

              firstfail is the time(2) when the service began to fail.

              downtime is the number of seconds the service failed.

              interval  is  the  frequency  (in  seconds)  that the service is
              polled.

              summary is the summary line from when the service was failing.

       monerrfile = filename
              By default, when mon daemonizes itself, it connects  stdout  and
              stderr to /dev/null. If monerrfile is set to a file, then stdout
              and stderr will be appended to that file. In all cases stdin  is
              connected  to /dev/null. If mon is told to run in the foreground
              and  to  not  daemonize,  then  none  of  this  applies,   since
              stdin/stdout/stderr  stay connected to whatever they were at the
              time of invocation.

       dtlogging = yes/no

              Turns downtime logging on or off. The default is off.

       histlength = num
              num is the the maximum  number  of  events  to  be  retained  in
              history list. The default is 100.  This value may also be set by
              the -k command-line parameter.

       historicfile = file
              If this variable is set, then alerts are  logged  to  file,  and
              upon  startup,  some  (or  all) of the past history is read into
              memory.

       historictime = timeval
              num is the amount of the history  file  to  read  upon  startup.
              "Now"  - timeval is read. See the explanation of interval in the
              "Service Definitions" section for a description of timeval.

       serverport = port
              port is the TCP port number that the server should bind to. This
              value may also be set by the -p command-line parameter. Normally
              this port is looked up via getservbyname(3), and it defaults  to
              2583.

       trapport = port
              port is the UDP port number that the trap server should bind to.
              Normally this port is looked up  via  getservbyname(3),  and  it
              defaults to 2583.

       pidfile = path
              path  is  the  file the sever will store its pid in.  This value
              may also be set by the -P command-line parameter.

       maxprocs = num
              Throttles the number of concurrently forked  processes  to  num.
              The intent is to provide a safety net for the unlikely situation
              when the server tries to take on too many tasks at  once.   Note
              that this situation has only been reported to happen when trying
              to use a garbled configuration file! You don't  want  to  use  a
              garbled configuration file now, do you?

       cltimeout = secs
              Sets  the  client  inactivity timeout to secs.  This is meant to
              help thwart denial of service attacks or  recover  from  crashed
              clients.  secs is interpreted as a "1h/1m/1s" string, where "1m"
              = 60 seconds.

       randstart = interval
              When the server  starts,  normally  all  services  will  not  be
              scheduled  until  the interval defined in the respective service
              section.  This can cause long delays before the first check of a
              service,  and  possibly  a  high  load on the server if multiple
              things are scheduled at the same intervals.  This option is used
              to  randomize  the scheduling of the first test for all services
              during the startup  period,  and  immediately  after  the  reset
              command.  If randstart is defined, the scheduled run time of all
              services of all watch groups will be  a  random  number  between
              zero and randstart seconds.

       dep_recur_limit = depth
              Limit  dependency  recursion  level  to  depth.   If  dependency
              recursion (dependencies  which  depend  on  other  dependencies)
              tries  to  go  beyond depth, then the recursion is aborted and a
              messages is logged to syslog.  The default limit is 10.

       dep_behavior = {a|m|hm}
              dep_behavior  controls   whether   the   dependency   expression
              suppresses  one  of:  the  running  of  alerts,  the  running of
              monitors, or the passing of individual hosts  to  the  monitors.
              Read  more  about  the  behavior  in  the  "Service Definitions"
              section below.

              This is a global setting which controls the default settings for
              the service-specified variable.

       dep_memory = timeval
              If  set,  dep_memory  will  cause  dependencies  to  continue to
              prevent alerts/monitoring for a period of time after the service
              returns  to  a  normal state.  This can be used to prevent over-
              eager alerting when a machine is rebooting,  for  example.   See
              the explanation of interval in the "Service Definitions" section
              for a description of timeval.

              This is a global setting which controls the default settings for
              the service-specified variable.

       syslog_facility = facility
              Specifies  the  syslog facility used for logging.  daemon is the
              default.

       startupalerts_on_reset = {yes|no}

              If set to "yes", startupalerts will be invoked  when  the  reset
              client command is executed. The default is "no".

       monremote = program

              If set, this external program will be called by Mon when various
              client requests are processed.  This can be  used  to  propagate
              those  changes  from  one  Mon  server  to  another, if you have
              multiple monitoring machines.  An example  script,  monremote.pl
              is available in the clients directory.

   Hostgroup Entries
       Hostgroup entries begin with the keyword hostgroup, and are followed by
       a hostgroup tag and one or more hostnames or IP addresses, separated by
       whitespace.   The  hostgroup  tag  must  be  composed  of  alphanumeric
       characters, a dash ("-"), a period ("."), or an underscore ("_").  Non-
       blank  lines following the first hostgroup line are interpreted as more
       hostnames.  The hostgroup  definition  ends  with  a  blank  line.  For
       example:

              hostgroup servers nameserver smtpserver nntpserver
                   nfsserver httpserver smbserver

              hostgroup router_group cisco7000 agsplus

   View Entries
       View  entries  begin  with the keyword view, and are followed by a view
       tag and the names of one or more hostgroups.   The  view  tag  must  be
       composed  of  alphanumeric characters, a dash ("-"), a period ("."), or
       an underscore ("_"). Non-blank lines following the first view line  are
       interpreted  as  more hostgroup names.  The view definition ends with a
       blank line. For example:

              view servers dns-servers web-servers file-servers
                   mail-servers

              view network-services routers switches vpn-servers

   Watch Group Entries
       Watch entries begin with a line that starts  with  the  keyword  watch,
       followed  by  whitespace  and  a single word which normally refers to a
       pre-defined hostgroup. If the  second  word  is  not  recognized  as  a
       hostgroup  tag,  a new hostgroup is created whose tag is that word, and
       that word is its only member.

       Watch entries consist of one or more service definitions.

       A watch group is terminated by a blank line, the end of the file, or by
       a subsequent definition, "watch", "hostgroup", or otherwise.

       There may be a special watch group entry called "default". If a default
       watch group is defined with a service entry named "default", then  this
       definition  will be used in handling traps received for an unrecognized
       watch and service.

   Service Definitions
       service servicename
              A service definition begins with they keyword  service  followed
              by  a word which is the tag for this service.  This word must be
              unique among all services defined for the same watch group.

              The components of a service are an interval, monitor, and one or
              more time period definitions, as defined below.

              If  a  service name of "default" is defined within a watch group
              called  "dafault"  (see   above),   then   the   default/default
              definition will be used for handling unknown mon traps.

              The  following configuration parameters are valid only following
              a service definition:

       VARIABLE=value
              Environment variables may be defined  for  each  service,  which
              will  be  included  in  the  environment of monitors and alerts.
              Variables must be specified in all capital letters,  must  begin
              with  an alphabetical character or an underscore, and there must
              be no spaces to the left of the equal sign.

       interval timeval
              The keyword interval followed by  a  time  value  specifies  the
              frequency  that a monitor script will be triggered.  Time values
              are defined as "30s", "5m", "1h", or "1d", meaning 30 seconds, 5
              minutes,  1  hour,  or  1  day.  The  numeric  portion  may be a
              fraction, such as "1.5h" or an hour and a half. This format of a
              time specification will be referred to as timeval.

       failure_interval timeval
              Adjusts  the  polling interval to timeval when the service check
              is failing. Resets the interval to the original when the service
              succeeds.

       traptimeout timeval
              This  keyword  takes  the  same  time  specification argument as
              interval, and makes the service expect a trap from  an  external
              source  at  least that often, else a failure will be registered.
              This is used for a heartbeat-style service.

       trapduration timeval
              If a trap is received, the status of the service  the  trap  was
              delivered  to  will normally remain constant. If trapduration is
              specified, the status of the service will remain  in  a  failure
              state for the duration specified by timeval, and then it will be
              reset to "success".

       randskew timeval
              Rather than schedule the monitor script to run at the  start  of
              each  interval,  randomly  adjust  the interval specified by the
              interval parameter by plus-or-minus randskew .  The  skew  value
              is specified as the interval parameter: "30s", "5m", etc...  For
              example if interval is 1m, and randskew is "5s", then  mon  will
              schedule  the  monitor script some time between every 55 seconds
              and 65 seconds.  The intent is to help distribute  the  load  on
              the  server  when  many  services  are  scheduled  at  the  same
              intervals.

       monitor monitor-name [arg...]
              The keyword monitor followed by  a  script  name  and  arguments
              specifies  the monitor to run when the timer expires. Shell-like
              quoting conventions are followed when specifying  the  arguments
              to  send  to the monitor script.  The script is invoked from the
              directory given with the -s argument, and  all  following  words
              are  supplied  as  arguments to the monitor program, followed by
              the list of hosts in the group referred to by the current  watch
              group.   If  the monitor line ends with ";;" as a separate word,
              the host groups are not appended to the argument list  when  the
              program is invoked.

       allow_empty_group
              The  allow_empty_group option will allow a monitor to be invoked
              even when the hostgroup for  that  watch  is  empty  because  of
              disabled  hosts.  The  default  behavior  is  not  to invoke the
              monitor when all hosts in a hostgroup have been disabled.

       description descriptiontext
              The text following description is queried  by  client  programs,
              passed  to  alerts  and monitors via an environment variable. It
              should contain a brief description of the service, suitable  for
              inclusion in an email or on a web page.

       exclude_hosts host [host...]
              Any  hosts  listed after exclude_hosts will be excluded from the
              service check.

       exclude_period periodspec
              Do not run a scheduled monitor during  the  time  identified  by
              periodspec.

       depend dependexpression
              The  depend  keyword is used to specify a dependency expression,
              which evaluates to either true of false, in the  boolean  sense.
              Dependencies  are  actual  Perl  expressions,  and must obey all
              syntactical rules. The expressions are evaluated  in  their  own
              package space so as to not accidentally have some unwanted side-
              effect.   If  a  syntax  error  is  found  when  evaluating  the
              expression, it is logged via syslog.

              Before evaluation, the following substitutions on the expression
              occur: phrases which look like "group:service"  are  substituted
              with  the  value  of  the  current  operational  status  of that
              specified service. These  opstatus  substitutions  are  computed
              recursively, so if service A depends upon service B, and service
              B depends upon service C, then service A depends upon service C.
              Successful  operational  statuses  (which  evaluate  to "1") are
              "STAT_OK",     "STAT_COLDSTART",      "STAT_WARMSTART",      and
              "STAT_UNKNOWN".   The  word "SELF" (in all caps) can be used for
              the group (e.g. "SELF:service"), and is an abbreviation for  the
              current watch group.

              This  feature  can  be used to control alerts for services which
              are dependent on other services, e.g.  an  SMTP  test  which  is
              dependent upon the machine being ping-reachable.

       dep_behavior {a|m|hm}
              The evaluation of the dependency graphs specified via the depend
              keyword  can  control  the  suppression  of  alert  or   monitor
              invocations,  or  the  suppression of individual hosts passed to
              the monitor.

              Alert suppression.  If this option  is  set  to  "a",  then  the
              dependency  expression  will  be evaluated after the monitor for
              the service exits or after a trap is received.   An  alert  will
              only  be  sent  if the evaluation succeeds, meaning that none of
              the nodes in the dependency graph indicate failure.

              Monitor suppression.  If it is set to "m", then  the  dependency
              expression  will be evaulated before the monitor for the service
              is about to run.  If the evaulation succeeds, then  the  monitor
              will  be  run.  Otherwise,  the  monitor will not be run and the
              status of the service will remain the same.

              Host suppression.  If it is set to "hm" then  Mon  will  extract
              the  list  of  "parent" services from the dependency expression.
              (In fact the expression can be just a list  of  services.)  Then
              when  the  monitor  for the service is about to be run, for each
              host in the current hostgroup Mon will  search  all  the  parent
              services  which  are currently failing and look for the hostname
              in the current summary output.  If the hostname is  found,  this
              host will be excluded from this run of the monitor.  This can be
              used to e.g. allow an SMTP test on a group of hosts to still  be
              run  even  when a single host is not ping-reachable.  If all the
              rest of the hosts are working fine, the service will be in an OK
              state,  but  if  another  host fails the SMTP test Mon can still
              alert about that host even  though  the  parent  dependency  was
              failing.  The dependency expression will not be used recursively
              in this case.

       alertdepend dependexpression

       monitordepend dependexpression

       hostdepend dependexpression
              These  keywords  allow  you  to  specify   multiple   dependency
              expressions  of  different  types.   Each one corresponds to the
              different dep_behavior settings  listed  above.   They  will  be
              evaluated  independently  in  the  different  contexts as listed
              above.  If depend is  present,  it  takes  precedence  over  the
              matching keyword, depending on the dep_behavior setting.

       dep_memory timeval
              If  set,  dep_memory  will  cause  dependencies  to  continue to
              prevent alerts/monitoring for a period of time after the service
              returns  to  a  normal state.  This can be used to prevent over-
              eager alerting when a machine is rebooting,  for  example.   See
              the explanation of interval in the "Service Definitions" section
              for a description of timeval.

       redistribute alert [arg...]
              A service may have one redistribute option, which is  a  special
              form  of  an  an alert definition.  This alert will be called on
              every service status  update,  even  sequential  success  status
              updates.   This  can  be  used  to  integrate  Mon  with another
              monitoring system, or to link together multiple Mon servers  via
              an  alert  script  that  generates  Mon  traps.   See the "ALERT
              PROGRAMS" section above for a list of the  parameters  mon  will
              pass automatically to alert programs.

       unack_summary
              Remove  the  "acknowledged"  state from a service if the summary
              component of the failure message changes.  In most common  usage
              the summary is the list of hosts that are failing, so additional
              hosts failing would remove an ack.

   Period Definitions
       Periods are used to define the conditions which should allow alerts  to
       be delivered.

       period [label:] periodspec
              A  period  groups one or more alarms and variables which control
              how often an alert happens when there is a failure.  The  period
              definition has two forms. The first takes an argument which is a
              period specification from Patrick  Ryan's  Time::Period  Perl  5
              module. Refer to "perldoc Time::Period" for more information.

              The   second   form  requires  a  label  followed  by  a  period
              specification, as defined above. The label is a  tag  consisting
              of  an  alphabetic  character  or underscore followed by zero or
              more alphanumerics or underscores and ending with a colon.  This
              form  allows  multiple  periods with the same period definition.
              One use is to have a period definition which has  no  alertafter
              or  alertevery  parameters  for  a  particular  time period, and
              another for the same time period with a different set of  alerts
              that does contain those parameters.

              Period  definitions, in either the first or second form, must be
              unique within each service definition. For example, if you  need
              to  define two periods both for "wd {Sun-Sat}", then one or both
              of the period definitions must specify a label such  as  "period
              t1: wd {Sun-Sat}" and "period t2: wd {Sun-Sat}".

       alertevery timeval [observe_detail | strict]
              The  alertevery  keyword  (within a period definition) takes the
              same type of argument as the interval variable, and  limits  the
              number  of  times an alert is sent when the service continues to
              fail.  For example, if the  interval  is  "1h",  then  only  the
              alerts  in  the period section will only be triggered once every
              hour. If the alertevery keyword is omitted in a period entry, an
              alert  will  be  sent  out  every time a failure is detected. By
              default, if  the  summary  output  of  two  successive  failures
              changes,  then  the  alertevery  interval  is overridden, and an
              alert will be sent.  If the string "observe_detail" is the  last
              argument,  then both the summary and detail output lines will be
              considered when comparing the output of successive failures.  If
              the string "strict" is the last argument, then the output of the
              monitor or the state change of the service will have  no  effect
              on  when  alerts are sent. That is, "alertevery 24h strict" will
              send only one alert every 24  hours,  no  matter  what.   Please
              refer  to  the  ALERT  DECISION  LOGIC  section  for  a detailed
              explanation of how alerts are suppressed.

       alertafter num

       alertafter num timeval

       alertafter timeval
              The alertafter keyword  (within  a  period  section)  has  three
              forms:  only  with the "num" argument, or with the "num timeval"
              arguments, or only with the "timeval" argument.   In  the  first
              form,  an  alert  will  only  be invoked after "num" consecutive
              failures.

              In the  second  form,  the  arguments  are  a  positive  integer
              followed  by  an interval, as described by the interval variable
              above.  If these parameters are specified, then the  alerts  for
              that  period will only be called after that many failures happen
              within that interval. For example, if alertafter  is  given  the
              arguments  "3 30m",  then the alert will be called if 3 failures
              happen within 30 minutes.

              In the third form, the argument is an interval, as described  by
              the  interval  variable above.  Alerts for that period will only
              be called if the service has been in a failure  state  for  more
              than  the length of time desribed by the interval, regardless of
              the number of failures noticed within that interval.

       numalerts num

              This variable tells the server to call no more than  num  alerts
              during  a  failure.  The  alert  counter is kept on a per-period
              basis, and is reset upon each success.

       no_comp_alerts

              If this option  is  specified,  then  upalerts  will  be  called
              whenever  the  service  state  changes  from failure to success,
              rather than only after a corresponding "down" alert.

       alert alert [arg...]
              A period may contain multiple alerts, which are  triggered  upon
              failure  of  the  service.  An alert is specified with the alert
              keyword, followed by an optional exit parameter,  and  arguments
              which  are  interpreted  the same as the monitor definition, but
              without the ";;" exception. The exit parameter takes the form of
              exit=x  or  exit=x-y  and  has the effect that the alert is only
              called if the exit status of the monitor script falls within the
              range  of the exit parameter. If, for example, the alert line is
              alert exit=10-20 mail.alert mis then  mail-alert  will  only  be
              invoked  with mis as its arguments if the monitor program's exit
              value is between 10 and 20. This feature allows you  to  trigger
              different  alerts  at  different severity levels (like when free
              disk space goes from 8% to 3%).

              See  the  ALERT  PROGRAMS  section  above  for  a  list  of  the
              pramaeters mon will pass automatically to alert programs.

       upalert alert [arg...]
              An  upalert is the compliment of an alert.  An upalert is called
              when a services makes  the  state  transition  from  failure  to
              success,  if  a  corresponding "down" alert was previously sent.
              The upalert script is called supplying the  same  parameters  as
              the alert script, with the addition of the -u parameter which is
              simply used to let an alert script know that it is being  called
              as  an  upalert.  Multiple  upalerts  may  be specified for each
              period definition.  Set the per-period no_comp_alerts option  to
              send  an upalert regardless if whether or not a "down" alert was
              sent.

       startupalert alert [arg...]
              A startupalert  is  only  called  when  the  mon  server  starts
              execution,  or  when a "reset" command was issued to the server,
              depending on the setting of the  startupalerts_on_reset  global.
              Unlike  other alerts, startupalerts are not called following the
              exit of a monitor, i.e. they are  called  in  their  own  right,
              therefore   the   "exit="   argument   is   not   applicable  to
              startupalert.

       upalertafter timeval
              The upalertafter parameter is specified as a string that follows
              the  syntax  of  the interval parameter ("30s", "1m", etc.), and
              controls the triggering of an upalert.  If a service comes  back
              up  after  being  down  for  a time greater than or equal to the
              value of this option, an upalert will be called. Use this option
              to  prevent  upalerts  to  be  called  because of "blips" (brief
              outages).

AUTHENTICATION CONFIGURATION FILE

       The file specified by the authfile variable in the  configuration  file
       (or  passed  via  the  -A parameter) will be loaded upon startup.  This
       file defines restrictions upon which client commands may be executed by
       which  users.  It  is  a  text file which consists of comments, command
       definitions, and trap authentication parameters.  A comment line begins
       with  optional  whitespace  followed  by  pound  sign.  Blank lines are
       ignored.

       The file is separated into  a  command  section  and  a  trap  section.
       Sections are specified by a single line containing one of the following
       statements:

                   command section

       or

                   trap section

       Lines following one of the above statements apply to that section until
       either the end of the file or another section begins.

       A  command  definition  consists  of  a  command,  followed by a colon,
       followed by a  comma-separated  list  of  users  who  may  execute  the
       command.   The default is that no users may execute any commands unless
       they are explicitly allowed in this configuration file. For clarity,  a
       user  can  be  denied  by prefixing the user name with "!". If the word
       "AUTH_ANY" is used for a username, then any authenticated user will  be
       allowed  to  execute  the  command.  If  the  word  "all" is used for a
       username, then that command may be executed by any user,  authenticated
       or not.

       The  trap  section  allows  configuration of which users may send traps
       from which hosts. The syntax is a source host  (name  or  ip  address),
       whitespace,  a  username, whitespace, and a plaintext password for that
       user. If the source host is "*", then allow traps from any host. If the
       username  is  "*", then accept traps without regard for the username or
       password. If no hosts or users are specified, then  no  traps  will  be
       accepted.

       An example configuration file:

              command section
              list:          all
              reset:         root,admin
              loadstate:          root
              savestate:          root

              trap section
              127.0.0.1 root r@@tp4sswrd

       This  means  that  all  clients  are  able to perform the list command,
       "root" is  able  to  perform  "reset",  "loadstate",  "savestate",  and
       "admin" is able to execute the "reset" command.

CLIENT-SERVER INTERFACE

       The  server listens on TCP port 2583, which may be overridden using the
       -p port option. Commands are  a  single  line  each,  terminated  by  a
       newline.   The  server  can  handle  any  number of simultaneous client
       connections.

CLIENT INTERFACE COMMANDS

       See manual page for moncmd.

MON TRAPPING

       Mon has the facility to receive special "mon traps" from any  local  or
       remote  machine.  Currently,  the only available method for sending mon
       traps are through the Mon::Client perl interface, though the UDP packet
       format  is  defined well enough to permit the writing of traps in other
       languages.

       Traps are handled similarly to monitors: a trap  sends  an  operational
       status,  summary line, and description text, and mon generates an alert
       or upalert as necessary.

       Traps can be caught by any  watch/service  group  set  up  in  the  mon
       configuration   file,  however  it  is  suggested  that  you  configure
       watch/service groups specifically for the traps you expect to  receive.
       When defining a special watch/service group for traps, do not include a
       "monitor" directive (as no monitor need be invoked). Since a monitor is
       not being invoked, it is not necessary for the watch definition to have
       a hostgroup which contains real host names.   Just  make  up  a  useful
       name, and mon will automatically create the watch group for you.

       Here is a simple config file example:

              watch trap-service
                   service host1-disks
                        description TRAP: for host1 disk status
                        period wd {Sun-Sat}
                             alert mail.alert someone@your.org
                             upalert mail.alert -u someone@your.org

       Since  mon  listens  on  a UDP port for any trap, a default facility is
       available for handling traps to unknown groups or services.  To  enable
       this  facility,  you  must  include  a  "default"  watch  group  with a
       "default" service entry containing  the  specifics  of  alarms.   If  a
       default/default  watch  group  and  service  are  not  configured, then
       unknown traps get logged via syslog, and no alarm is sent.   NOTE:  The
       default/default  facility  is  a single entity as far as accounting and
       alarming go. Alarm programs which are not aware of this fact  may  send
       confusing  information  when  a  failure  trap  comes from one machine,
       followed by a success (ok) trap from a different machine. See the alarm
       environment  variable MON_TRAP_INTENDED above for a possible way around
       this. It is intended that default/default be  used  as  a  facility  to
       catch  unknown  traps, and should not be relied upon to catch all traps
       in a production environment. If you are  lazy  and  only  want  to  use
       default/default  for  catching  all  traps, it would be best to disable
       upalerts, and use the MON_TRAP_INTENDED environment variable  in  alert
       scripts to make the alerts more meaningful to you.

       Here is an example default facility:

              watch default
                   service default
                        description Default trap service
                        period wd {Sun-Sat}
                             alert mail.alert someone@your.org
                             upalert mail.alert -u someone@your.org

EXAMPLES

       The  mon  distribution  comes  with  an  example  configuration  called
       example.cf.  Refer to that file for more information.

SEE ALSO

       moncmd(1), Time::Period(3pm), Mon::Client(3pm)

HISTORY

       mon was written because I couldn't find anything  out  there  that  did
       just what I needed, and nothing was worth modifying to add the features
       I wanted. It doesn't have a cool name, and that bothers  me  because  I
       couldn't think of one.

BUGS

       Report bugs to the email address below.

AUTHOR

       Jim Trocki <trockij@arctic.org>