Ubuntu Manpage: analysis.cfg - Configuration file for the xymond

NAME

       analysis.cfg - Configuration file for the xymond_client module

SYNOPSIS

       ~Xymon/server/etc/analysis.cfg

DESCRIPTION

       The  analysis.cfg  file  controls  what  color is assigned to the status-messages that are
       generated from the Xymon client data - typically the cpu, disk, memory, procs-  and  msgs-
       columns.  Color  is  decided  on the basis of some settings defined in this file; settings
       apply to specific hosts through a set of rules.

       Note: This file is only used on the Xymon server - it is not used by the Xymon client,  so
       there is no need to distribute it to your client systems.

FILE FORMAT

       Blank lines and lines starting with a hash mark (#) are treated as comments and ignored.

CPU STATUS COLUMN SETTINGS

LOAD warnlevel paniclevel

If the system load exceeds "warnlevel" or "paniclevel", the "cpu" status will go yellow or
red, respectively. These are decimal numbers.

Defaults: warnlevel=5.0, paniclevel=10.0

UP bootlimit toolonglimit [color]

The cpu status goes yellow/red if the system has been up for less than "bootlimit" time,
or longer than "toolonglimit". The time is in minutes, or you can add h/d/w for
hours/days/weeks - eg. "2h" for two hours, or "4w" for 4 weeks.

Defaults: bootlimit=1h, toolonglimit=-1 (infinite), color=yellow.

CLOCK max.offset [color]

The cpu status goes yellow/red if the system clock on the client differs more than
"max.offset" seconds from that of the Xymon server. Note that this is not a particularly
accurate test, since it is affected by network delays between the client and the server,
and the load on both systems. You should therefore not rely on this being accurate to more
than +/- 5 seconds, but it will let you catch a client clock that goes completely wrong.
The default is NOT to check the system clock.
NOTE: Correct operation of this test obviously requires that the system clock of the Xymon
server is correct. You should therefore make sure that the Xymon server is synchronized to
the real clock, e.g. by using NTP.

Example: Go yellow if the load average exceeds 5, and red if it exceeds 10. Also, go
yellow for 10 minutes after a reboot, and after 4 weeks uptime. Finally, check that the
system clock is at most 15 seconds offset from the clock of the Xymon system and go red if
that is exceeded.

LOAD 5 10
UP 10m 4w yellow
CLOCK 15 red

DISK STATUS COLUMN SETTINGS

       DISK filesystem warnlevel paniclevel
       DISK filesystem IGNORE
       INODE filesystem warnlevel paniclevel
       INODE filesystem IGNORE

       If the utilization of "filesystem" is reported to exceed "warnlevel" or "paniclevel",  the
       "disk"  status  will  go  yellow  or  red, respectively.  "warnlevel" and "paniclevel" are
       either the percentage used, or the space available as reported by the local  "df"  command
       on the host.  For the latter type of check, the "warnlevel" must be followed by the letter
       "U", e.g. "1024U".

       The special keyword "IGNORE" causes this filesystem to be ignored completely, i.e. it will
       not  appear  in  the  "disk"  status column and it will not be tracked in a graph. This is
       useful for e.g. removable devices, backup-disks and similar hardware.

       "filesystem" is the mount-point where the filesystem is mounted, e.g.  "/usr" or  "/home".
       A  filesystem-name  that  begins  with  "%"  is  interpreted  as a Perl-compatible regular
       expression; e.g. "%^/oracle.*/" will match any filesystem  whose  mountpoint  begins  with
       "/oracle".

       "INODE" works identical to "DISK", but uses the count of i-nodes in the filesystem instead
       of the amount of disk space.

       Defaults   DISK:   warnlevel=90%,   paniclevel=95%      Defaults   INODE:   warnlevel=70%,
       paniclevel=90%

MEMORY STATUS COLUMN SETTINGS

       MEMPHYS warnlevel paniclevel
       MEMACT warnlevel paniclevel
       MEMSWAP warnlevel paniclevel

       If  the  memory  utilization  exceeds the "warnlevel" or "paniclevel", the "memory" status
       will change to yellow or red, respectively.  Note: The words "PHYS", "ACT" and "SWAP"  are
       also recognized.

       Example: Go yellow if more than 20% swap is used, and red if more than 40% swap is used or
       the actual memory utilisation exceeds 90%. Don't alert on physical memory usage.

              MEMSWAP 20 40
              MEMACT 90 90
              MEMPHYS 101 101

       Defaults:

              MEMPHYS warnlevel=100 paniclevel=101 (i.e. it will never go red).
              MEMSWAP warnlevel=50 paniclevel=80
              MEMACT  warnlevel=90 paniclevel=97

PROCS STATUS COLUMN SETTINGS

PROC processname minimumcount maximumcount color [TRACK=id] [TEXT=text]

The "ps" listing sent by the client will be scanned for how many processes containing
"processname" are running, and this is then matched against the min/max settings defined
here. If the running count is outside the thresholds, the color of the "procs" status
changes to "color".

To check for a process that must NOT be running: Set minimum and maximum to 0.

"processname" can be a simple string, in which case this string must show up in the "ps"
listing as a command. The scanner will find a ps-listing of e.g. "/usr/sbin/cron" if you
only specify "processname" as "cron". "processname" can also be a Perl-compatiable
regular expression, e.g. "%java.*inst[0123]" can be used to find entries in the ps-
listing for "java -Xmx512m inst2" and "java -Xmx256 inst3". In that case, "processname"
must begin with "%" followed by the regular expression. Note that Xymon defaults to case-
insensitive pattern matching; if that is not what you want, put "(?-i)" between the "%"
and the regular expression to turn this off. E.g. "%(?-i)HTTPD" will match the word HTTPD
only when it is upper-case.
If "processname" contains whitespace (blanks or TAB), you must enclose the full string in
double quotes - including the "%" if you use regular expression matching. E.g.

PROC "%xymond_channel --channel=data.*xymond_rrd" 1 1 yellow

PROC "java -DCLASSPATH=/opt/java/lib" 2 5

You can have multiple "PROC" entries for the same host, all of the checks are merged into
the "procs" status and the most severe check defines the color of the status.

The optional TRACK=id setting causes Xymon to track the number of processes found in an
RRD file, and put this into a graph which is shown on the "procs" status display. The id
setting is a simple text string which will be used as the legend for the graph, and also
as part of the RRD filename. It is recommended that you use only letters and digits for
the ID.
Note that the process counts which are tracked are only performed once when the client
does a poll cycle - i.e. the counts represent snapshots of the system state, not an
average value over the client poll cycle. Therefore there may be peaks or dips in the
actual process counts which will not show up in the graphs, because they happen while the
Xymon client is not doing any polling.

The optional TEXT=text setting is used in the summary of the "procs" status. Normally, the
summary will show the "processname" to identify the process and the related count and
limits. But this may be a regular expression which is not easily recognizable, so if
defined, the text setting string will be used instead. This only affects the "procs"
status display - it has no effect on how the rule counts or recognizes processes in the
"ps" output.

Example: Check that "cron" is running:
PROC cron

Example: Check that at least 5 "httpd" processes are running, but not more than 20:
PROC httpd 5 20

Defaults:
mincount=1, maxcount=-1 (unlimited), color="red".
Note that no processes are checked by default.

MSGS STATUS COLUMN SETTINGS

LOG logfilename pattern [COLOR=color] [IGNORE=excludepattern] [OPTIONAL]

The Xymon client extracts interesting lines from one or more logfiles - see the client-
local.cfg(5) man-page for information about how to configure which logs a client should
look at.

The LOG setting determine how these extracts of log entries are processed, and what
warnings or alerts trigger as a result.

"logfilename" is the name of the logfile. Only logentries from this filename will be
matched against this rule. Note that "logfilename" can be a regular expression (if
prefixed with a '%' character).

"pattern" is a string or regular expression. If the logfile data matches "pattern", it
will trigger the "msgs" column to change color. If no "color" parameter is present, the
default is to go "red" when the pattern is matched. To match against a regular expression,
"pattern" must begin with a '%' sign - e.g "%WARNING|NOTICE" will match any lines
containing either of these two words. Note that Xymon defaults to case-insensitive
pattern matching; if that is not what you want, put "(?-i)" between the "%" and the
regular expression to turn this off. E.g. "%(?-i)WARNING" will match the word WARNING only
when it is upper-case.

"excludepattern" is a string or regular expression that can be used to filter out any
unwanted strings that happen to match "pattern".

The OPTIONAL keyword causes the check to be skipped if the logfile does not exist.

Example: Trigger a red alert when the string "ERROR" appears in the "/var/log/syslog"
file:
LOG /var/log/syslog ERROR

Example: Trigger a yellow warning on all occurrences of the word "WARNING" or "NOTICE" in
the "daemon.log" file, except those from the "lpr" system:
LOG /var/log/daemon.log %WARNING|NOTICE COLOR=yellow IGNORE=lpr

Defaults:
color="red", no "excludepattern".

Note that no logfiles are checked by default. Any log data reported by a client will just
show up on the "msgs" column with status OK (green).

FILES STATUS COLUMN SETTINGS

FILE filename [color] [things to check] [OPTIONAL] [TRACK]

DIR directoryname [color] [size<MAXSIZE] [size>MINSIZE] [TRACK]

These entries control the status of the "files" column. They allow you to check on various
data for files and directories.

filename and directoryname are names of files or directories, with a full path. You can
use a regular expression to match the names of files and directories reported by the
client, if you prefix the expression with a '%' character.

color is the color that triggers when one or more of the checks fail.

The OPTIONAL keyword causes this check to be skipped if the file does not exist. E.g. you
can use this to check if files that should be temporary are not deleted, by checking that
they are not older than the max time you would expect them to stick around, and then using
OPTIONAL to ignore the state where no files exist.

The TRACK keyword causes the size of the file or directory to be tracked in an RRD file,
and presented in a graph on the "files" status display.

For files, you can check one or more of the following:

noexist
triggers a warning if the file exists. By default, a warning is triggered for files
that have a FILE entry, but which do not exist.

ifexist
only checks the file if it exists. If the file is reported as missing by the
client, it is ignored.

type=TYPE
where TYPE is one of "file", "dir", "char", "block", "fifo", or "socket". Triggers
warning if the file is not of the specified type.

ownerid=OWNER
triggers a warning if the owner does not match what is listed here. OWNER is
specified either with the numeric uid, or the user name.

groupid=GROUP
triggers a warning if the group does not match what is listed here. GROUP is
specified either with the numeric gid, or the group name.

mode=MODE
triggers a warning if the file permissions are not as listed. MODE is written in
the standard octal notation, e.g. "644" for the rw-r--r-- permissions.

size<MAX.SIZE and size>MIN.SIZE
triggers a warning it the file size is greater than MAX.SIZE or less than MIN.SIZE,
respectively. For filesizes, you can use the letters "K", "M", "G" or "T" to
indicate that the filesize is in Kilobytes, Megabytes, Gigabytes or Terabytes,
respectively. If there is no such modifier, Kilobytes is assumed. E.g. to warn if a
file grows larger than 1MB, use size<1024M.

mtime>MIN.MTIME mtime<MAX.MTIME
checks how long ago the file was last modified (in seconds). E.g. to check if a
file was updated within the past 10 minutes (600 seconds): mtime<600. Or to check
that a file has NOT been updated in the past 24 hours: mtime>86400.

mtime=TIMESTAMP
checks if a file was last modified at TIMESTAMP. TIMESTAMP is a unix epoch time
(seconds since midnight Jan 1 1970 UTC).

ctime>MIN.CTIME, ctime<MAX.CTIME, ctime=TIMESTAMP
acts as the mtime checks, but for the ctime timestamp (when the directory entry of
the file was last changed, eg. by chown, chgrp or chmod).

md5=MD5SUM, sha1=SHA1SUM, rmd160=RMD160SUM
and so on for RMD160, SHA256, SHA512, SHA224, and SHA384 trigger a warning if the
file checksum using the specified message digest algorithm does not match the one
configured here. Note: The "file" entry in the client-local.cfg(5) file must
specify which algorithm to use as that is the only one that will be sent.

For directories, you can check one or more of the following:

size<MAX.SIZE and size>MIN.SIZE
triggers a warning it the directory size is greater than MAX.SIZE or less than
MIN.SIZE, respectively. Directory sizes are reported in whatever unit the du
command on the client uses - often KB or diskblocks - so MAX.SIZE and MIN.SIZE must
be given in the same unit.

Experience shows that it can be difficult to get these rules right. Especially when
defining minimum/maximum values for file sizes, when they were last modified etc. The one
thing you must remember when setting up these checks is that the rules describe criteria
that must be met - only when they are met will the status be green.

So "mtime<600" means "the difference between current time and the mtime of the file must
be less than 600 seconds - if not, the file status will go red".

PORTS STATUS COLUMN SETTINGS

       PORT criteria [MIN=mincount] [MAX=maxcount] [COLOR=color] [TRACK=id] [TEXT=displaytext]

       The  "netstat"  listing  sent by the client will be scanned for how many sockets match the
       criteria listed.  The criteria you can use are:

       LOCAL=addr
              "addr" is a (partial) local address specification in the format used on the  output
              from netstat.

       EXLOCAL=addr
              Exclude certain local addresses from the rule.

       REMOTE=addr
              "addr" is a (partial) remote address specification in the format used on the output
              from netstat.

       EXREMOTE=addr
              Exclude certain remote addresses from the rule.

       STATE=state
              Causes only the sockets in the specified state to be included, "state"  is  usually
              LISTEN or ESTABLISHED but can be any socket state reported by the clients "netstat"
              command.

       EXSTATE=state
              Exclude certain states from the rule.

       "addr" is typically "10.0.0.1:80" for the IP 10.0.0.1, port 80.  Or "*:80" for  any  local
       address,  port  80.  Note that the Xymon clients normally report only the numeric data for
       IP-addresses and port-numbers, so you must specify the port number (e.g. "80") instead  of
       the service name ("www").
       "addr"   and   "state"   can   also   be   a  Perl-compatiable  regular  expression,  e.g.
       "LOCAL=%[.:](80|443)" can be used to find entries in the netstat local port for both  http
       (port  80)  and  https  (port  443).  In  that case, portname or state must begin with "%"
       followed by the reg.expression.

       The socket count found is then matched against the min/max settings defined here.  If  the
       count  is  outside the thresholds, the color of the "ports" status changes to "color".  To
       check for a socket that must NOT exist: Set minimum and maximum to 0.

       The optional TRACK=id setting causes Xymon to track the number of sockets found in an  RRD
       file,  and  put  this  into  a  graph which is shown on the "ports" status display. The id
       setting is a simple text string which will be used as the legend for the graph,  and  also
       as  part  of  the RRD filename. It is recommended that you use only letters and digits for
       the ID.
       Note that the sockets counts which are tracked are only performed  once  when  the  client
       does  a  poll  cycle  -  i.e.  the  counts represent snapshots of the system state, not an
       average value over the client poll cycle.  Therefore there may be peaks  or  dips  in  the
       actual  sockets counts which will not show up in the graphs, because they happen while the
       Xymon client is not doing any polling.

       The TEXT=displaytext option affects how the port appears on the "ports"  status  page.  By
       default,  the port is listed with the local/remote/state rules as identification, but this
       may be somewhat difficult to understand. You can then use e.g. "TEXT=Secure Shell" to make
       these ports appear with the name "Secure Shell" instead.

       Defaults: mincount=1, maxcount=-1 (unlimited), color="red".  Note: No ports are checked by
       default.

       Example: Check that the SSH daemon is listening on port 22. Track the number of active SSH
       connections, and warn if there are more than 5.
               PORT LOCAL=%[.:]22$ STATE=LISTEN "TEXT=SSH listener"
               PORT LOCAL=%[.:]22$ STATE=ESTABLISHED MAX=5 TRACK=ssh TEXT=SSH

SVCS status (Microsoft Windows clients)

       SVC servicename status=(started|stopped) [startup=automatic|disabled|manual]

DS - RRD based status override

DS column filename:dataset rules COLOR=colorname TEXT=explanation

"column" is the statuscolumn that will be modified. "filename" is the name of the RRD file
holding the data you use for comparison. "dataset" is the name of the dataset in the RRD
file - the "rrdtool info" command is useful when determining these. "rules" determine
when to apply the override. You can use ">", ">=", "<" or "<=" to compare the current
measurement value against one or more thresholds. "explanation" is a text that will be
shown to explain the override - you can use some placeholders in the text: "&N" is
replaced with the name of the dataset, "&V" is replaced with the current value, "&L" is
replaced by the low threshold, "&U" is replaced with the upper threshold.

NOTE: This rule uses the raw data value from a client to examine the rules. So this type
of test is only really suitable for datasets that are of the "GAUGE" type. It cannot be
used meaningfully for datasets that use "COUNTER" or "DERIVE" - e.g. the datasets that are
used to capture network packet traffic - because the data stored in the RRD for COUNTER-
based datasets undergo a transformation (calculation) when going into the RRD. Xymon does
not have direct access to the calculated data.

Example: Flag "conn" status a yellow if responsetime exceeds 100 msec.
DS conn tcp.conn.rrd:sec >0.1 COLOR=yellow TEXT="Response time &V exceeds &U seconds"

MQ Series SETTINGS

       MQ_QUEUE queuename [age-warning=N] [age-critical=N] [depth-warning=N] [depth-critical=N]
       MQ_CHANNEL channelname [warning=PATTERN] [alert=PATTERN]

       This is a set of checks for checking the health of IBM MQ message-queues.  It requires the
       "mq.sh" or similar collector module to run on a node with access to the MQ "Queue Manager"
       so it can report the status of queues and channels.

       The  MQ_QUEUE  setting checks the health of a single queue: You can warn (yellow) or alert
       (red) based on the depth of the queue, and/or the age of the oldest entry  in  the  queue.
       These values are taken directly from the output generated by the "runmqsc" utility.

       The  MQ_CHANNEL  setting  checks  the health of a single MQ channel: You can warn or alert
       based on the reported status of the channel. The PATTERN is a normal pattern, i.e.  either
       a list of status keywords, or a regular expression pattern.

CHANGING THE DEFAULT SETTINGS

       If you would like to use different defaults for the settings described above, then you can
       define the new defaults after a DEFAULT line. E.g. this would explicitly define all of the
       default settings:

              DEFAULT
                   UP      1h
                   LOAD    5.0 10.0
                   DISK    * 90 95
                   MEMPHYS 100 101
                   MEMSWAP 50 80
                   MEMACT  90 97

RULES TO SELECT HOSTS

All of the settings can be applied to a group of hosts, by preceding them with rules. A
rule defines of one of more filters using these keywords (note that this is identical to
the rule definitions used in the alerts.cfg(5) file).

PAGE=targetstring Rule matching an alert by the name of the page in Xymon. "targetstring"
is the path of the page as defined in the hosts.cfg file. E.g. if you have this setup:

page servers All Servers
subpage web Webservers
10.0.0.1 www1.foo.com
subpage db Database servers
10.0.0.2 db1.foo.com

Then the "All servers" page is found with PAGE=servers, the "Webservers" page is
PAGE=servers/web and the "Database servers" page is PAGE=servers/db. Note that you can
also use regular expressions to specify the page name, e.g. PAGE=%.*/db would find the
"Database servers" page regardless of where this page was placed in the hierarchy.

The top-level page has a the fixed name /, e.g. PAGE=/ would match all hosts on the Xymon
frontpage. If you need it in a regular expression, use PAGE=%^/ to avoid matching the
forward-slash present in subpage-names.

EXPAGE=targetstring Rule excluding a host if the pagename matches.

HOST=targetstring Rule matching a host by the hostname. "targetstring" is either a comma-
separated list of hostnames (from the hosts.cfg file), "*" to indicate "all hosts", or a
Perl-compatible regular expression. E.g. "HOST=dns.foo.com,www.foo.com" identifies two
specific hosts; "HOST=%www.*.foo.com EXHOST=www-test.foo.com" matches all hosts with a
name beginning with "www", except the "www-test" host.

EXHOST=targetstring Rule excluding a host by matching the hostname.

CLASS=classname Rule match by the client class-name. You specify the class-name for a host
when starting the client through the "--class=NAME" option to the runclient.sh script. If
no class is specified, the host by default goes into a class named by the operating
system.

EXCLASS=classname Exclude all hosts belonging to "classname" from this rule.

DISPLAYGROUP=groupstring Rule matching an alert by the text of the display-group (text
following the group, group-only, group-except heading) in the hosts.cfg file.
"groupstring" is the text for the group, stripped of any HTML tags. E.g. if you have this
setup:

group Web
10.0.0.1 www1.foo.com
10.0.0.2 www2.foo.com
group Production databases
10.0.1.1 db1.foo.com

Then the hosts in the Web-group can be matched with DISPLAYGROUP=Web, and the database
servers can be matched with DISPLAYGROUP="Production databases". Note that you can also
use regular expressions, e.g. DISPLAYGROUP=%database. If there is no group-setting for
the host, use "DISPLAYGROUP=NONE".

EXDISPLAYGROUP=groupstring Rule excluding a group by matching the display-group string.

TIME=timespecification Rule matching by the time-of-day. This is specified as the DOWNTIME
time specification in the hosts.cfg file. E.g. "TIME=W:0800:2200" applied to a rule will
make this rule active only on week-days between 8AM and 10PM.

EXTIME=timespecification Rule excluding by the time-of-day. This is also specified as the
DOWNTIME time specification in the hosts.cfg file. E.g. "TIME=W:0400:0600" applied to a
rule will make this rule exclude on week-days between 4AM and 6AM. This applies on top of
any TIME= specification, so both must match.

DIRECTING ALERTS TO GROUPS

       For  some tests - e.g. "procs" or "msgs" - the right group of people to alert in case of a
       failure may be different, depending on which of  the  client  rules  actually  detected  a
       problem.  E.g.  if  you  have  PROCS  rules  for  a  host checking both "httpd" and "sshd"
       processes, then the Web admins should handle httpd-failures, whereas "sshd"  failures  are
       handled by the Unix admins.

       To  handle  this,  all  rules can have a "GROUP=groupname" setting.  When a rule with this
       setting triggers a yellow or red status, the groupname is passed on to  the  Xymon  alerts
       module,  so you can use it in the alert rule definitions in alerts.cfg(5) to direct alerts
       to the correct group of people.

RULES: APPLYING SETTINGS TO SELECTED HOSTS

       Rules must be placed after the settings, e.g.

              LOAD 8.0 12.0  HOST=db.foo.com TIME=*:0800:1600

       If you have multiple settings that you want to apply the same rules to, you can write  the
       rules *only* on one line, followed by the settings. E.g.

              HOST=%db.*.foo.com TIME=W:0800:1600
                   LOAD 8.0 12.0
                   DISK /db  98 100
                   PROC mysqld 1

       will  apply  the three settings to all of the "db" hosts on week-days between 8AM and 4PM.
       This can be combined with per-settings rule, in which case the per-settings rule overrides
       the general rule; e.g.

              HOST=%.*.foo.com
                   LOAD 7.0 12.0 HOST=bax.foo.com
                   LOAD 3.0 8.0

       will  result in the load-limits being 7.0/12.0 for the "bax.foo.com" host, and 3.0/8.0 for
       all other foo.com hosts.

       The entire file is evaluated from the top to bottom, and the first match found is used. So
       you should put the specific settings first, and the generic ones last.

NOTES

       For  the  LOG, FILE and DIR checks, it is necessary also to configure the actual file- and
       directory-names in the client-local.cfg(5) file. If the filenames are  not  listed  there,
       the  clients  will not collect any data about these files/directories, and the settings in
       the analysis.cfg file will be silently ignored.

       The ability to compute file checksums with MD5, SHA1 or RMD160  should  not  be  used  for
       general-purpose  file  integrity  checking,  since  the overhead of calculating these on a
       large number of files can be significant. If you need this, look  at  tools  designed  for
       this purpose - e.g. Tripwire or AIDE.

       At  the  time  of  writing  (april  2006),  the SHA-1 and RMD160 algorithms are considered
       cryptographically safe. The MD5 algorithm has been shown to have some weaknesses,  and  is
       not considered strong enough when a high level of security is required.