Provided by: watchdog_5.15-2_amd64 bug

NAME

       watchdog - a software watchdog daemon

SYNOPSIS

       watchdog    [-F|--foreground]    [-f|--force]    [-c    filename|--config-file   filename]
       [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]

DESCRIPTION

       The Linux kernel can reset the system if serious  problems  are  detected.   This  can  be
       implemented  via  special watchdog hardware, or via a slightly less reliable software-only
       watchdog inside the kernel. Either way, there needs to be a daemon that tells  the  kernel
       the system is working fine. If the daemon stops doing that, the system is reset.

       watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to it often enough to
       keep the kernel from resetting, at least once per minute. Each  write  delays  the  reboot
       time  another  minute.  After  a minute of inactivity the watchdog hardware will cause the
       reset. In the case of the software watchdog the ability to reboot will depend on the state
       of the machines and interrupts.

       The watchdog daemon can be stopped without causing a reboot if the device /dev/watchdog is
       closed correctly, unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT  option
       enabled.

TESTS

       The watchdog daemon does several tests to check the system status:

       •  Is the process table full?

       •  Is there enough free memory?

       •  Is there enough allocatable memory?

       •  Are some files accessible?

       •  Have some files changed within a given interval?

       •  Is the average work load too high?

       •  Has a file table overflow occurred?

       •  Is a process still running? The process is specified by a pid file.

       •  Do some IP addresses answer to ping?

       •  Do network interfaces receive traffic?

       •  Is the temperature too high? (Temperature data not always available.)

       •  Execute a user defined command to do arbitrary tests.

       •  Execute  one or more test/repair commands found in /etc/watchdog.d.  These commands are
          called with the argument test or repair.

       If any of these checks fail watchdog will cause a shutdown.  Should  any  of  these  tests
       except  the  user defined binary last longer than one minute the machine will be rebooted,
       too.

OPTIONS

       Available command line options are the following:

       -v, --verbose
              Set verbose mode. Only implemented if compiled with SYSLOG feature. This mode  will
              log  each  several  infos in LOG_DAEMON with priority LOG_DEBUG.  This is useful if
              you want to see exactly what happened  until  the  watchdog  rebooted  the  system.
              Currently it logs the temperature (if available), the load average, the change date
              of the files it checks and how often it went to sleep.

       -s, --sync
              Try to synchronize the filesystem every time the process is awake.  Note  that  the
              system is rebooted if for any reason the synchronizing lasts longer than a minute.

       -b, --softboot
              Soft-boot  the system if an error occurs during the main loop, e.g. if a given file
              is not accessible via the stat(2) call. Note  that  this  does  not  apply  to  the
              opening  of  /dev/watchdog and /proc/loadavg, which are opened before the main loop
              starts. Now this is implemented by disabling the error re-try timer.

       -F, --foreground
              Run in foreground mode, useful for running under systemd (for example).

       -f, --force
              Force the usage of the interval given or the maximal  load  average  given  in  the
              config file. Without this option these values are sanity checked.

       -c config-file, --config-file config-file
              Use    config-file   as   the   configuration   file   instead   of   the   default
              /etc/watchdog.conf.

       -q, --no-action
              Do not reboot or halt the machine. This is for testing  purposes.  All  checks  are
              executed  and  the  results are logged as usual, but no action is taken.  Also your
              hardware card or the kernel software watchdog driver is  not  enabled.  NOTE:  This
              still  allows  'repair'  actions  to  run, but the daemon itself will not attempt a
              reboot.

       -X num, --loop-exit num
              Run for 'num' loops then exit as if SIGTERM was received. Intended  for  test/debug
              (e.g.  using  valgrind  for  checking memory access). If the daemon exits on a loop
              counter and you have the CONFIG_WATCHDOG_NOWAYOUT option compiled for the kernel or
              device-driver then an unplanned reboot will follow - be warned!

FUNCTION

       After  watchdog  starts,  it  puts  itself  into  the background and then tries all checks
       specified in its configuration file in turn. Between each two tests it will write  to  the
       kernel  device  to  prevent  a reset. After finishing all tests watchdog goes to sleep for
       some time. The kernel drivers expects  a  write  to  the  watchdog  device  every  minute.
       Otherwise  the  system  will  be reset.  watchdog will sleep for a configure interval that
       defaults to 1 second to make sure it triggers the device early enough.

       Under high system load watchdog might be swapped out of memory and may  fail  to  make  it
       back  in  in  time.  Under these circumstances the Linux kernel will reset the machine. To
       make sure you won't get unnecessary reboots make sure you have the variable  realtime  set
       to  yes in the configuration file watchdog.conf.  This adds real time support to watchdog:
       it will lock itself into memory and there should  be no problem even under the highest  of
       loads.

       On  system  running  out  of  memory  the kernel will try to free enough memory by killing
       process. The watchdog daemon itself is exempted from this so-called out-of-memory killer.

       Also you can specify a maximal allowed load average. Once this load average is reached the
       system  is  rebooted.  You may specify maximal load averages for 1 minute, 5 minutes or 15
       minutes. The default values is to disable this test. Be careful not to set this  parameter
       too  low.  To set a value less then the predefined minimal value of 2, you have to use the
       -f option.

       You can also specify a minimal amount of virtual memory you  want  to  have  available  as
       free.  As soon as more virtual memory is used action is taken by watchdog.  Note, however,
       that watchdog does not distinguish between different types of memory usage. It just checks
       for free virtual memory.

       If  you  have  a  machine  with  temperature sensor(s) you can specify the maximal allowed
       temperature. Once this temperature is reached on any sensor the system is powered off. The
       default  value  is  90 C. Typically the temperature information is provided by the sensors
       package as files in the virtual  filesystem  /sys/device  and  can  be  found  using,  for
       example, the command

           find /sys -name 'temp*input' -print

       These  files  hold the temperature in milli-Celsius. You can have multiple sensors used in
       the config file. For example to change to 75C maximum and to check two virtual  files  for
       the system temperature you might have this:

           max-temperature = 75
           temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
           temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input

       The  watchdog  will  issue warnings once the temperature increases 90%, 95% and 98% of the
       configured maximum temperature.

       When using file mode watchdog will try to stat(2) the given files. Errors returned by stat
       will  not cause a reboot. For a reboot the stat call has to last at least the re-try time-
       out value (default 1 minute).  This may happen if the file is located on  an  NFS  mounted
       filesystem.  If your system relies on an NFS mounted filesystem you might try this option.
       However, in such a case the sync option may not work if the NFS server is not answering.

       watchdog can read the pid from a pid file and see whether the  process  still  exists.  If
       not,  action  is  taken by watchdog.  So you can for instance restart the server from your
       repair-binary.

       watchdog will try periodically to fork itself to see whether the process  table  is  full.
       This  process  will  leave  a zombie process until watchdog wakes up again and catches it;
       this is harmless, don't worry about it.

       In ping mode watchdog tries to ping the given IPv4 addresses. These addresses do not  have
       to be a single machine. It is possible to ping to a broadcast address instead to see if at
       least one machine in a subnet is still living.

       Do not use this broadcast ping unless your MIS person a) knows about it and b)  has  given
       you explicit permission to use it!

       watchdog will send out three ping packages and wait up to <interval> seconds for the reply
       with <interval> being the time it goes to sleep between two times triggering the  watchdog
       device. Thus a unreachable network will not cause a hard reset but a soft reboot.

       You  can  also  test  passively  for  an  unreachable  network  by just monitoring a given
       interface for traffic. If no traffic arrives the network is considered unreachable causing
       a soft reboot or action from the repair binary.

       watchdog  can  run  an  external command for user-defined tests. A return code not equal 0
       means an error occurred and watchdog should react. If the external command is killed by an
       uncaught  signal this is considered an error by watchdog too.  The command may take longer
       than the time slice defined for the  kernel  device  without  a  problem.  However,  error
       messages are generated into the syslog facility. If you have enabled softboot on error the
       machine will be rebooted if the binary doesn't exit  in  half  the  time  watchdog  sleeps
       between two tries triggering the kernel device.

       If  you specify a repair binary it will be started instead of shutting down the system. If
       this binary is not able to fix the problem watchdog will still cause a reboot afterwards.

       If the machine is halted an email is sent to notify a human  that  the  machine  is  going
       down.  Starting  with  version  4.4  watchdog  will also notify the human in charge if the
       machine is rebooted.

       The re-try timer applies to most errors, except reset/reboot calls and too hot.  It allows
       a  given  error source to recover, and treats most tests in this way.  Exceptions are file
       handle test, load averages, and system memory. If set to the minimum time of 1  second  it
       will still allow a single re-try at any polling interval of the system.

SOFT REBOOT

       A  soft  reboot (i.e. controlled shutdown and reboot) is initiated for every error that is
       found. Since there might be no more processes available, watchdog does it all by  himself.
       That means:

       1.  Kill all processes with SIGTERM.

       2.  After a short pause kill all remaining processes with SIGKILL.

       3.  Record a shutdown entry in wtmp.

       4.  Save  the random seed from /dev/urandom.  If the device is non-existant or there is no
           filename for saving this step is skipped.

       5.  Turn off accounting.

       6.  Turn off quota and swap.

       7.  Unmount all partitions except the root partition.

       8.  Remount the root partition read-only.

       9.  Shut down all network interfaces.

       10. Finally reboot.

CHECK BINARY

       If the return code of the check binary is not zero  watchdog  will  assume  an  error  and
       reboot  the  system.  Be  careful  with  this if you are using the real-time properties of
       watchdog since watchdog will wait for the return of this binary before proceeding. An exit
       code  smaller  than  245 is interpreted as an system error code (see errno.h for details).
       Values of 245 or larger than are special to watchdog:

       255 (based on -1 as unsigned 8-bit number)
              Reboot the system. This is not exactly an error message but a command to  watchdog.
              If  the  return  code  is  this  the watchdog will not try to run a shutdown script
              instead.

       254    Reset the system. This is not exactly an error message but a command  to  watchdog.
              If  the  return  code  is  this the watchdog will attempt to hard-reset the machine
              without attempting any sort of orderly stopping  of  process,  unmounting  of  file
              systems, etc.

       253    Maximum load average exceeded.

       252    The temperature inside is too high.

       251    /proc/loadavg contains no (or not enough) data.

       250    The given file was not changed in the given interval.

       249    /proc/meminfo contains invalid data.

       248    Child process was killed by a signal.

       247    Child process did not return in time.

       246    Free for personal watchdog-specific use (was -10 as an unsigned 8-bit number).

       245    Reserved  for  an  unknown result, for example a slow background test that is still
              running so neither a success nor an error.

REPAIR BINARY

       The repair binary is started with one parameter: the error number that caused watchdog  to
       initiate the boot process. After trying to repair the system the binary should exit with 0
       if the system was successfully repaired and thus there is  no  need  to  boot  anymore.  A
       return  value  not  equal 0 tells watchdog to reboot. The return code of the repair binary
       should be the error number of the error causing watchdog to reboot. Be careful  with  this
       if  you are using the real-time properties since watchdog will wait for the return of this
       binary before proceeding.

       The configuration file parameter repair-maximum controls the number of  successive  repair
       attempts  that  report  0  (i.e.  success)  but fail to clear the tested fault. If this is
       exceeded then a reboot takes place. If set to zero then a reboot can always be blocked  by
       the repair program reporting success.

TEST DIRECTORY

       Executables  placed  in  the  test directory are discovered by watchdog on startup and are
       automatically executed.  They are bounded  time-wise  by  the  test-timeout  directive  in
       watchdog.conf.

       These  executables are called with either "test" as the first argument (if a test is being
       performed) or "repair" as the first argument (if a repair for a  previously-failed  "test"
       operation on is being performed).

       The  as  with test binaries and repair binaries, expected exit codes for a successful test
       or repair operation is always zero.

       If an executable's test operation fails, the same executable is automatically called  with
       the "repair" argument as well as the return code of the previously-failed test operation.

       For example, if the following execution returns 42:

           /etc/watchdog.d/my-test test

       The watchdog daemon will attempt to repair the problem by calling:

           /etc/watchdog.d/my-test repair 42

       This  enables  administrators  and  application developers to make intelligent test/repair
       commands.  If the "repair" operation is not required (or is not likely to succeed), it  is
       important that the author of the command return a non-zero value so the machine will still
       reboot as expected.

       Note that the watchdog daemon may interpret and act upon any of the reserved return  codes
       noted in the Check Binary section prior to calling a given command in "repair" mode.

       As  for  the  repair  binary, the configuration parameter repair-maximum also controls the
       number of successive repair attempts that report success (return 0) but fail to clear  the
       fault.

BUGS

       None known so far.

AUTHORS

       The original code is an example written by Alan Cox <alan@lxorguk.ukuu.org.uk>, the author
       of the kernel driver. All additions were written by  Michael  Meskes  <meskes@debian.org>.
       Johnie  Ingram  <johnie@netgod.net> had the idea of testing the load average. He also took
       over the Debian  specific  work.  Dave  Cinege  <dcinege@psychosis.com>  brought  up  some
       hardware watchdog issues and helped testing this stuff.

FILES

       /dev/watchdog
              The watchdog device.

       /var/run/watchdog.pid
              The pid file of the running watchdog.

SEE ALSO

       watchdog.conf(5)