Ubuntu Manpage: analysis.cfg - Configuration file for the xymond

NAME

       analysis.cfg - Configuration file for the xymond_client module

SYNOPSIS

       ~Xymon/server/etc/analysis.cfg

DESCRIPTION

       The  analysis.cfg file controls what color is assigned to the status-messages that are generated from the
       Xymon client data - typically the cpu, disk, memory, procs- and msgs-columns. Color  is  decided  on  the
       basis of some settings defined in this file; settings apply to specific hosts through a set of rules.

       Note:  This  file  is  only used on the Xymon server - it is not used by the Xymon client, so there is no
       need to distribute it to your client systems.

FILE FORMAT

       Blank lines and lines starting with a hash mark (#) are treated as comments and ignored.

CPU STATUS COLUMN SETTINGS

LOAD warnlevel paniclevel

If the system load exceeds "warnlevel" or "paniclevel", the "cpu" status will go yellow or red,
respectively. These are decimal numbers.

Defaults: warnlevel=5.0, paniclevel=10.0

UP bootlimit toolonglimit [color]

The cpu status goes yellow/red if the system has been up for less than "bootlimit" time, or longer than
"toolonglimit". The time is in minutes, or you can add h/d/w for hours/days/weeks - eg. "2h" for two
hours, or "4w" for 4 weeks.

Defaults: bootlimit=1h, toolonglimit=-1 (infinite), color=yellow.

CLOCK max.offset [color]

The cpu status goes yellow/red if the system clock on the client differs more than "max.offset" seconds
from that of the Xymon server. Note that this is not a particularly accurate test, since it is affected
by network delays between the client and the server, and the load on both systems. You should therefore
not rely on this being accurate to more than +/- 5 seconds, but it will let you catch a client clock that
goes completely wrong. The default is NOT to check the system clock.
NOTE: Correct operation of this test obviously requires that the system clock of the Xymon server is
correct. You should therefore make sure that the Xymon server is synchronized to the real clock, e.g. by
using NTP.

Example: Go yellow if the load average exceeds 5, and red if it exceeds 10. Also, go yellow for 10
minutes after a reboot, and after 4 weeks uptime. Finally, check that the system clock is at most 15
seconds offset from the clock of the Xymon system and go red if that is exceeded.

LOAD 5 10
UP 10m 4w yellow
CLOCK 15 red

DISK STATUS COLUMN SETTINGS

       DISK filesystem warnlevel paniclevel
       DISK filesystem IGNORE
       INODE filesystem warnlevel paniclevel
       INODE filesystem IGNORE

       If the utilization of "filesystem" is reported to exceed "warnlevel" or "paniclevel", the  "disk"  status
       will go yellow or red, respectively.  "warnlevel" and "paniclevel" are either the percentage used, or the
       space  available  as  reported  by the local "df" command on the host.  For the latter type of check, the
       "warnlevel" must be followed by the letter "U", e.g. "1024U".

       The special keyword "IGNORE" causes this filesystem to be ignored completely, i.e. it will not appear  in
       the  "disk"  status  column  and  it  will  not  be tracked in a graph. This is useful for e.g. removable
       devices, backup-disks and similar hardware.

       "filesystem" is the mount-point where the filesystem is mounted, e.g.  "/usr" or "/home".  A  filesystem-
       name  that  begins  with  "%" is interpreted as a Perl-compatible regular expression; e.g. "%^/oracle.*/"
       will match any filesystem whose mountpoint begins with "/oracle".

       "INODE" works identical to "DISK", but uses the count of i-nodes in the filesystem instead of the  amount
       of disk space.

       Defaults DISK: warnlevel=90%, paniclevel=95% Defaults INODE: warnlevel=70%, paniclevel=90%

MEMORY STATUS COLUMN SETTINGS

       MEMPHYS warnlevel paniclevel
       MEMACT warnlevel paniclevel
       MEMSWAP warnlevel paniclevel

       If  the  memory  utilization  exceeds the "warnlevel" or "paniclevel", the "memory" status will change to
       yellow or red, respectively.  Note: The words "PHYS", "ACT" and "SWAP" are also recognized.

       Example: Go yellow if more than 20% swap is used, and red if more than 40% swap is  used  or  the  actual
       memory utilisation exceeds 90%. Don't alert on physical memory usage.

              MEMSWAP 20 40
              MEMACT 90 90
              MEMPHYS 101 101

       Defaults:

              MEMPHYS warnlevel=100 paniclevel=101 (i.e. it will never go red).
              MEMSWAP warnlevel=50 paniclevel=80
              MEMACT  warnlevel=90 paniclevel=97

PROCS STATUS COLUMN SETTINGS

PROC processname minimumcount maximumcount color [TRACK=id] [TEXT=text]

The "ps" listing sent by the client will be scanned for how many processes containing "processname" are
running, and this is then matched against the min/max settings defined here. If the running count is
outside the thresholds, the color of the "procs" status changes to "color".

To check for a process that must NOT be running: Set minimum and maximum to 0.

"processname" can be a simple string, in which case this string must show up in the "ps" listing as a
command. The scanner will find a ps-listing of e.g. "/usr/sbin/cron" if you only specify "processname" as
"cron". "processname" can also be a Perl-compatiable regular expression, e.g. "%java.*inst[0123]" can
be used to find entries in the ps-listing for "java -Xmx512m inst2" and "java -Xmx256 inst3". In that
case, "processname" must begin with "%" followed by the regular expression. Note that Xymon defaults to
case-insensitive pattern matching; if that is not what you want, put "(?-i)" between the "%" and the
regular expression to turn this off. E.g. "%(?-i)HTTPD" will match the word HTTPD only when it is upper-
case.
If "processname" contains whitespace (blanks or TAB), you must enclose the full string in double quotes -
including the "%" if you use regular expression matching. E.g.

PROC "%xymond_channel --channel=data.*xymond_rrd" 1 1 yellow

PROC "java -DCLASSPATH=/opt/java/lib" 2 5

You can have multiple "PROC" entries for the same host, all of the checks are merged into the "procs"
status and the most severe check defines the color of the status.

The optional TRACK=id setting causes Xymon to track the number of processes found in an RRD file, and put
this into a graph which is shown on the "procs" status display. The id setting is a simple text string
which will be used as the legend for the graph, and also as part of the RRD filename. It is recommended
that you use only letters and digits for the ID.
Note that the process counts which are tracked are only performed once when the client does a poll cycle
- i.e. the counts represent snapshots of the system state, not an average value over the client poll
cycle. Therefore there may be peaks or dips in the actual process counts which will not show up in the
graphs, because they happen while the Xymon client is not doing any polling.

The optional TEXT=text setting is used in the summary of the "procs" status. Normally, the summary will
show the "processname" to identify the process and the related count and limits. But this may be a
regular expression which is not easily recognizable, so if defined, the text setting string will be used
instead. This only affects the "procs" status display - it has no effect on how the rule counts or
recognizes processes in the "ps" output.

Example: Check that "cron" is running:
PROC cron

Example: Check that at least 5 "httpd" processes are running, but not more than 20:
PROC httpd 5 20

Defaults:
mincount=1, maxcount=-1 (unlimited), color="red".
Note that no processes are checked by default.

MSGS STATUS COLUMN SETTINGS

LOG logfilename pattern [COLOR=color] [IGNORE=excludepattern] [OPTIONAL]

The Xymon client extracts interesting lines from one or more logfiles - see the client-local.cfg(5) man-
page for information about how to configure which logs a client should look at.

The LOG setting determine how these extracts of log entries are processed, and what warnings or alerts
trigger as a result.

"logfilename" is the name of the logfile. Only logentries from this filename will be matched against this
rule. Note that "logfilename" can be a regular expression (if prefixed with a '%' character).

"pattern" is a string or regular expression. If the logfile data matches "pattern", it will trigger the
"msgs" column to change color. If no "color" parameter is present, the default is to go "red" when the
pattern is matched. To match against a regular expression, "pattern" must begin with a '%' sign - e.g
"%WARNING|NOTICE" will match any lines containing either of these two words. Note that Xymon defaults to
case-insensitive pattern matching; if that is not what you want, put "(?-i)" between the "%" and the
regular expression to turn this off. E.g. "%(?-i)WARNING" will match the word WARNING only when it is
upper-case.

"excludepattern" is a string or regular expression that can be used to filter out any unwanted strings
that happen to match "pattern".

The OPTIONAL keyword causes the check to be skipped if the logfile does not exist.

Example: Trigger a red alert when the string "ERROR" appears in the "/var/log/syslog" file:
LOG /var/log/syslog ERROR

Example: Trigger a yellow warning on all occurrences of the word "WARNING" or "NOTICE" in the
"daemon.log" file, except those from the "lpr" system:
LOG /var/log/daemon.log %WARNING|NOTICE COLOR=yellow IGNORE=lpr

Defaults:
color="red", no "excludepattern".

Note that no logfiles are checked by default. Any log data reported by a client will just show up on the
"msgs" column with status OK (green).

FILES STATUS COLUMN SETTINGS

FILE filename [color] [things to check] [OPTIONAL] [TRACK]

DIR directoryname [color] [size<MAXSIZE] [size>MINSIZE] [TRACK]

These entries control the status of the "files" column. They allow you to check on various data for files
and directories.

filename and directoryname are names of files or directories, with a full path. You can use a regular
expression to match the names of files and directories reported by the client, if you prefix the
expression with a '%' character.

color is the color that triggers when one or more of the checks fail.

The OPTIONAL keyword causes this check to be skipped if the file does not exist. E.g. you can use this to
check if files that should be temporary are not deleted, by checking that they are not older than the max
time you would expect them to stick around, and then using OPTIONAL to ignore the state where no files
exist.

The TRACK keyword causes the size of the file or directory to be tracked in an RRD file, and presented in
a graph on the "files" status display.

For files, you can check one or more of the following:

noexist
triggers a warning if the file exists. By default, a warning is triggered for files that have a
FILE entry, but which do not exist.

ifexist
only checks the file if it exists. If the file is reported as missing by the client, it is
ignored.

type=TYPE
where TYPE is one of "file", "dir", "char", "block", "fifo", or "socket". Triggers warning if the
file is not of the specified type.

ownerid=OWNER
triggers a warning if the owner does not match what is listed here. OWNER is specified either
with the numeric uid, or the user name.

groupid=GROUP
triggers a warning if the group does not match what is listed here. GROUP is specified either
with the numeric gid, or the group name.

mode=MODE
triggers a warning if the file permissions are not as listed. MODE is written in the standard
octal notation, e.g. "644" for the rw-r--r-- permissions.

size<MAX.SIZE and size>MIN.SIZE
triggers a warning it the file size is greater than MAX.SIZE or less than MIN.SIZE, respectively.
For filesizes, you can use the letters "K", "M", "G" or "T" to indicate that the filesize is in
Kilobytes, Megabytes, Gigabytes or Terabytes, respectively. If there is no such modifier,
Kilobytes is assumed. E.g. to warn if a file grows larger than 1MB, use size<1024M.

mtime>MIN.MTIME mtime<MAX.MTIME
checks how long ago the file was last modified (in seconds). E.g. to check if a file was updated
within the past 10 minutes (600 seconds): mtime<600. Or to check that a file has NOT been updated
in the past 24 hours: mtime>86400.

mtime=TIMESTAMP
checks if a file was last modified at TIMESTAMP. TIMESTAMP is a unix epoch time (seconds since
midnight Jan 1 1970 UTC).

ctime>MIN.CTIME, ctime<MAX.CTIME, ctime=TIMESTAMP
acts as the mtime checks, but for the ctime timestamp (when the directory entry of the file was
last changed, eg. by chown, chgrp or chmod).

md5=MD5SUM, sha1=SHA1SUM, rmd160=RMD160SUM
and so on for RMD160, SHA256, SHA512, SHA224, and SHA384 trigger a warning if the file checksum
using the specified message digest algorithm does not match the one configured here. Note: The
"file" entry in the client-local.cfg(5) file must specify which algorithm to use as that is the
only one that will be sent.

For directories, you can check one or more of the following:

size<MAX.SIZE and size>MIN.SIZE
triggers a warning it the directory size is greater than MAX.SIZE or less than MIN.SIZE,
respectively. Directory sizes are reported in whatever unit the du command on the client uses -
often KB or diskblocks - so MAX.SIZE and MIN.SIZE must be given in the same unit.

Experience shows that it can be difficult to get these rules right. Especially when defining
minimum/maximum values for file sizes, when they were last modified etc. The one thing you must remember
when setting up these checks is that the rules describe criteria that must be met - only when they are
met will the status be green.

So "mtime<600" means "the difference between current time and the mtime of the file must be less than 600
seconds - if not, the file status will go red".

PORTS STATUS COLUMN SETTINGS

       PORT criteria [MIN=mincount] [MAX=maxcount] [COLOR=color] [TRACK=id] [TEXT=displaytext]

       The  "netstat" listing sent by the client will be scanned for how many sockets match the criteria listed.
       The criteria you can use are:

       LOCAL=addr
              "addr" is a (partial) local address specification in the format used on the output from netstat.

       EXLOCAL=addr
              Exclude certain local addresses from the rule.

       REMOTE=addr
              "addr" is a (partial) remote address specification in the format used on the output from netstat.

       EXREMOTE=addr
              Exclude certain remote addresses from the rule.

       STATE=state
              Causes only the sockets in the specified state to  be  included,  "state"  is  usually  LISTEN  or
              ESTABLISHED but can be any socket state reported by the clients "netstat" command.

       EXSTATE=state
              Exclude certain states from the rule.

       "addr"  is  typically  "10.0.0.1:80" for the IP 10.0.0.1, port 80.  Or "*:80" for any local address, port
       80. Note that the Xymon clients normally report only the numeric data for IP-addresses and  port-numbers,
       so you must specify the port number (e.g. "80") instead of the service name ("www").
       "addr"  and "state" can also be a Perl-compatiable regular expression, e.g.  "LOCAL=%[.:](80|443)" can be
       used to find entries in the netstat local port for both http (port 80) and  https  (port  443).  In  that
       case, portname or state must begin with "%" followed by the reg.expression.

       The socket count found is then matched against the min/max settings defined here. If the count is outside
       the  thresholds, the color of the "ports" status changes to "color".  To check for a socket that must NOT
       exist: Set minimum and maximum to 0.

       The optional TRACK=id setting causes Xymon to track the number of sockets found in an RRD file,  and  put
       this  into  a  graph which is shown on the "ports" status display. The id setting is a simple text string
       which will be used as the legend for the graph, and also as part of the RRD filename. It  is  recommended
       that you use only letters and digits for the ID.
       Note  that the sockets counts which are tracked are only performed once when the client does a poll cycle
       - i.e. the counts represent snapshots of the system state, not an average  value  over  the  client  poll
       cycle.   Therefore  there may be peaks or dips in the actual sockets counts which will not show up in the
       graphs, because they happen while the Xymon client is not doing any polling.

       The TEXT=displaytext option affects how the port appears on the "ports" status page. By default, the port
       is listed with the local/remote/state rules as identification, but this  may  be  somewhat  difficult  to
       understand.  You  can  then use e.g. "TEXT=Secure Shell" to make these ports appear with the name "Secure
       Shell" instead.

       Defaults: mincount=1, maxcount=-1 (unlimited), color="red".  Note: No ports are checked by default.

       Example: Check that the SSH daemon is listening on port 22. Track the number of active  SSH  connections,
       and warn if there are more than 5.
               PORT LOCAL=%[.:]22$ STATE=LISTEN "TEXT=SSH listener"
               PORT LOCAL=%[.:]22$ STATE=ESTABLISHED MAX=5 TRACK=ssh TEXT=SSH

SVCS status (Microsoft Windows clients)

       SVC servicename status=(started|stopped) [startup=automatic|disabled|manual]

DS - RRD based status override

DS column filename:dataset rules COLOR=colorname TEXT=explanation

"column" is the statuscolumn that will be modified. "filename" is the name of the RRD file holding the
data you use for comparison. "dataset" is the name of the dataset in the RRD file - the "rrdtool info"
command is useful when determining these. "rules" determine when to apply the override. You can use ">",
">=", "<" or "<=" to compare the current measurement value against one or more thresholds. "explanation"
is a text that will be shown to explain the override - you can use some placeholders in the text: "&N" is
replaced with the name of the dataset, "&V" is replaced with the current value, "&L" is replaced by the
low threshold, "&U" is replaced with the upper threshold.

NOTE: This rule uses the raw data value from a client to examine the rules. So this type of test is only
really suitable for datasets that are of the "GAUGE" type. It cannot be used meaningfully for datasets
that use "COUNTER" or "DERIVE" - e.g. the datasets that are used to capture network packet traffic -
because the data stored in the RRD for COUNTER-based datasets undergo a transformation (calculation) when
going into the RRD. Xymon does not have direct access to the calculated data.

Example: Flag "conn" status a yellow if responsetime exceeds 100 msec.
DS conn tcp.conn.rrd:sec >0.1 COLOR=yellow TEXT="Response time &V exceeds &U seconds"

MQ Series SETTINGS

       MQ_QUEUE queuename [age-warning=N] [age-critical=N] [depth-warning=N] [depth-critical=N]
       MQ_CHANNEL channelname [warning=PATTERN] [alert=PATTERN]

       This is a set of checks for checking the health of IBM MQ message-queues.  It  requires  the  "mq.sh"  or
       similar  collector  module  to  run  on a node with access to the MQ "Queue Manager" so it can report the
       status of queues and channels.

       The MQ_QUEUE setting checks the health of a single queue: You can warn (yellow) or alert (red)  based  on
       the  depth of the queue, and/or the age of the oldest entry in the queue. These values are taken directly
       from the output generated by the "runmqsc" utility.

       The MQ_CHANNEL setting checks the health of a single MQ channel: You can  warn  or  alert  based  on  the
       reported  status  of the channel. The PATTERN is a normal pattern, i.e. either a list of status keywords,
       or a regular expression pattern.

CHANGING THE DEFAULT SETTINGS

       If you would like to use different defaults for the settings described above, then you can define the new
       defaults after a DEFAULT line. E.g. this would explicitly define all of the default settings:

              DEFAULT
                   UP      1h
                   LOAD    5.0 10.0
                   DISK    * 90 95
                   MEMPHYS 100 101
                   MEMSWAP 50 80
                   MEMACT  90 97

RULES TO SELECT HOSTS

All of the settings can be applied to a group of hosts, by preceding them with rules. A rule defines of
one of more filters using these keywords (note that this is identical to the rule definitions used in the
alerts.cfg(5) file).

PAGE=targetstring Rule matching an alert by the name of the page in Xymon. "targetstring" is the path of
the page as defined in the hosts.cfg file. E.g. if you have this setup:

page servers All Servers
subpage web Webservers
10.0.0.1 www1.foo.com
subpage db Database servers
10.0.0.2 db1.foo.com

Then the "All servers" page is found with PAGE=servers, the "Webservers" page is PAGE=servers/web and the
"Database servers" page is PAGE=servers/db. Note that you can also use regular expressions to specify the
page name, e.g. PAGE=%.*/db would find the "Database servers" page regardless of where this page was
placed in the hierarchy.

The top-level page has a the fixed name /, e.g. PAGE=/ would match all hosts on the Xymon frontpage. If
you need it in a regular expression, use PAGE=%^/ to avoid matching the forward-slash present in subpage-
names.

EXPAGE=targetstring Rule excluding a host if the pagename matches.

HOST=targetstring Rule matching a host by the hostname. "targetstring" is either a comma-separated list
of hostnames (from the hosts.cfg file), "*" to indicate "all hosts", or a Perl-compatible regular
expression. E.g. "HOST=dns.foo.com,www.foo.com" identifies two specific hosts; "HOST=%www.*.foo.com
EXHOST=www-test.foo.com" matches all hosts with a name beginning with "www", except the "www-test" host.

EXHOST=targetstring Rule excluding a host by matching the hostname.

CLASS=classname Rule match by the client class-name. You specify the class-name for a host when starting
the client through the "--class=NAME" option to the runclient.sh script. If no class is specified, the
host by default goes into a class named by the operating system.

EXCLASS=classname Exclude all hosts belonging to "classname" from this rule.

DISPLAYGROUP=groupstring Rule matching an alert by the text of the display-group (text following the
group, group-only, group-except heading) in the hosts.cfg file. "groupstring" is the text for the group,
stripped of any HTML tags. E.g. if you have this setup:

group Web
10.0.0.1 www1.foo.com
10.0.0.2 www2.foo.com
group Production databases
10.0.1.1 db1.foo.com

Then the hosts in the Web-group can be matched with DISPLAYGROUP=Web, and the database servers can be
matched with DISPLAYGROUP="Production databases". Note that you can also use regular expressions, e.g.
DISPLAYGROUP=%database. If there is no group-setting for the host, use "DISPLAYGROUP=NONE".

EXDISPLAYGROUP=groupstring Rule excluding a group by matching the display-group string.

TIME=timespecification Rule matching by the time-of-day. This is specified as the DOWNTIME time
specification in the hosts.cfg file. E.g. "TIME=W:0800:2200" applied to a rule will make this rule
active only on week-days between 8AM and 10PM.

EXTIME=timespecification Rule excluding by the time-of-day. This is also specified as the DOWNTIME time
specification in the hosts.cfg file. E.g. "TIME=W:0400:0600" applied to a rule will make this rule
exclude on week-days between 4AM and 6AM. This applies on top of any TIME= specification, so both must
match.

DIRECTING ALERTS TO GROUPS

       For some tests - e.g. "procs" or "msgs" - the right group of people to alert in case of a failure may  be
       different,  depending  on  which  of the client rules actually detected a problem. E.g. if you have PROCS
       rules for a host checking both "httpd" and "sshd" processes, then the Web  admins  should  handle  httpd-
       failures, whereas "sshd" failures are handled by the Unix admins.

       To handle this, all rules can have a "GROUP=groupname" setting.  When a rule with this setting triggers a
       yellow  or  red  status,  the groupname is passed on to the Xymon alerts module, so you can use it in the
       alert rule definitions in alerts.cfg(5) to direct alerts to the correct group of people.

RULES: APPLYING SETTINGS TO SELECTED HOSTS

       Rules must be placed after the settings, e.g.

              LOAD 8.0 12.0  HOST=db.foo.com TIME=*:0800:1600

       If you have multiple settings that you want to apply the same rules to, you can write the rules *only* on
       one line, followed by the settings. E.g.

              HOST=%db.*.foo.com TIME=W:0800:1600
                   LOAD 8.0 12.0
                   DISK /db  98 100
                   PROC mysqld 1

       will apply the three settings to all of the "db" hosts on week-days between 8AM  and  4PM.  This  can  be
       combined with per-settings rule, in which case the per-settings rule overrides the general rule; e.g.

              HOST=%.*.foo.com
                   LOAD 7.0 12.0 HOST=bax.foo.com
                   LOAD 3.0 8.0

       will  result  in  the  load-limits  being  7.0/12.0 for the "bax.foo.com" host, and 3.0/8.0 for all other
       foo.com hosts.

       The entire file is evaluated from the top to bottom, and the first match found is used. So you should put
       the specific settings first, and the generic ones last.

NOTES

       For the LOG, FILE and DIR checks, it is necessary also to configure the actual file- and  directory-names
       in  the client-local.cfg(5) file. If the filenames are not listed there, the clients will not collect any
       data about these files/directories, and the settings in the analysis.cfg file will be silently ignored.

       The ability to compute file checksums with MD5, SHA1 or RMD160 should not  be  used  for  general-purpose
       file  integrity  checking,  since  the  overhead  of  calculating these on a large number of files can be
       significant. If you need this, look at tools designed for this purpose - e.g. Tripwire or AIDE.

       At the time of writing (april 2006), the SHA-1 and RMD160  algorithms  are  considered  cryptographically
       safe.  The MD5 algorithm has been shown to have some weaknesses, and is not considered strong enough when
       a high level of security is required.