Provided by: watchdog_5.16-1_amd64 bug

NAME

       watchdog - a software watchdog daemon

SYNOPSIS

       watchdog  [-F|--foreground]  [-f|--force] [-c filename|--config-file filename] [-v|--verbose] [-s|--sync]
       [-b|--softboot] [-q|--no-action]

DESCRIPTION

       The Linux kernel can reset the system if serious problems are detected.   This  can  be  implemented  via
       special  watchdog  hardware,  or  via  a slightly less reliable software-only watchdog inside the kernel.
       Either way, there needs to be a daemon that tells the kernel the system is working fine.  If  the  daemon
       stops doing that, the system is reset.

       watchdog  is  such  a  daemon.  It  opens /dev/watchdog, and keeps writing to it often enough to keep the
       kernel from resetting, at least once per minute. Each write delays the reboot time another minute.  After
       a  minute  of inactivity the watchdog hardware will cause the reset. In the case of the software watchdog
       the ability to reboot will depend on the state of the machines and interrupts.

       The watchdog daemon can be stopped without causing  a  reboot  if  the  device  /dev/watchdog  is  closed
       correctly, unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.

TESTS

       The watchdog daemon does several tests to check the system status:

       •  Is the process table full?

       •  Is there enough free memory?

       •  Is there enough allocatable memory?

       •  Are some files accessible?

       •  Have some files changed within a given interval?

       •  Is the average work load too high?

       •  Has a file table overflow occurred?

       •  Is a process still running? The process is specified by a pid file.

       •  Do some IP addresses answer to ping?

       •  Do network interfaces receive traffic?

       •  Is the temperature too high? (Temperature data not always available.)

       •  Execute a user defined command to do arbitrary tests.

       •  Execute one or more test/repair commands found in /etc/watchdog.d.  These commands are called with the
          argument test or repair.

       If any of these checks fail watchdog will cause a shutdown. Should any of these  tests  except  the  user
       defined binary last longer than one minute the machine will be rebooted, too.

OPTIONS

       Available command line options are the following:

       -v, --verbose
              Set  verbose  mode.  Only  implemented  if  compiled  with SYSLOG feature. This mode will log each
              several infos in LOG_DAEMON with priority LOG_DEBUG.  This is useful if you want  to  see  exactly
              what  happened  until  the  watchdog  rebooted  the  system. Currently it logs the temperature (if
              available), the load average, the change date of the files it checks and  how  often  it  went  to
              sleep. You can use this twice to enable some more verbose debug message for testing.

       -s, --sync
              Try  to  synchronize  the  filesystem  every  time  the  process is awake. Note that the system is
              rebooted if for any reason the synchronizing lasts longer than a minute.

       -b, --softboot
              Soft-boot the system if an error occurs during the  main  loop,  e.g.  if  a  given  file  is  not
              accessible via the stat(2) call. Note that this does not apply to the opening of /dev/watchdog and
              /proc/loadavg, which are opened before the main loop starts. Now this is implemented by  disabling
              the error re-try timer.

       -F, --foreground
              Run in foreground mode, useful for running under systemd (for example).

       -f, --force
              Force  the  usage  of  the  interval  given  or the maximal load average given in the config file.
              Without this option these values are sanity checked.

       -c config-file, --config-file config-file
              Use config-file as the configuration file instead of the default /etc/watchdog.conf.

       -q, --no-action
              Do not reboot or halt the machine. This is for testing purposes. All checks are executed  and  the
              results  are  logged  as  usual,  but  no  action is taken.  Also your hardware card or the kernel
              software watchdog driver is not enabled. NOTE: This still allows 'repair' actions to run, but  the
              daemon itself will not attempt a reboot.

       -X num, --loop-exit num
              Run  for  'num'  loops  then  exit as if SIGTERM was received. Intended for test/debug (e.g. using
              valgrind for checking memory access). If the daemon exits on a  loop  counter  and  you  have  the
              CONFIG_WATCHDOG_NOWAYOUT  option compiled for the kernel or device-driver then an unplanned reboot
              will follow - be warned!

FUNCTION

       After watchdog starts, it puts itself into the background and then tries  all  checks  specified  in  its
       configuration file in turn. Between each two tests it will write to the kernel device to prevent a reset.
       After finishing all tests watchdog goes to sleep for some time. The kernel drivers expects a write to the
       watchdog  device  every  minute. Otherwise the system will be reset.  watchdog will sleep for a configure
       interval that defaults to 1 second to make sure it triggers the device early enough.

       Under high system load watchdog might be swapped out of memory and may fail to make it back in  in  time.
       Under these circumstances the Linux kernel will reset the machine. To make sure you won't get unnecessary
       reboots make sure you have the variable realtime set to yes  in  the  configuration  file  watchdog.conf.
       This  adds real time support to watchdog: it will lock itself into memory and there should  be no problem
       even under the highest of loads.

       On system running out of memory the kernel will try  to  free  enough  memory  by  killing  process.  The
       watchdog daemon itself is exempted from this so-called out-of-memory killer.

       Also  you  can  specify  a  maximal allowed load average. Once this load average is reached the system is
       rebooted. You may specify maximal load averages for 1 minute, 5 minutes or 15 minutes. The default values
       is  to  disable  this  test.  Be  careful not to set this parameter too low. To set a value less then the
       predefined minimal value of 2, you have to use the -f option.

       You can also specify a minimal amount of virtual memory you want to have available as free.  As  soon  as
       more  virtual  memory  is  used  action  is  taken  by  watchdog.   Note, however, that watchdog does not
       distinguish between different types of memory usage. It just checks for free virtual memory.

       If you have a machine with temperature sensor(s) you can specify the maximal  allowed  temperature.  Once
       this temperature is reached on any sensor the system is powered off. The default value is 90 C. Typically
       the temperature information is provided by the  sensors  package  as  files  in  the  virtual  filesystem
       /sys/device and can be found using, for example, the command

           find /sys -name 'temp*input' -print

       These files hold the temperature in milli-Celsius. You can have multiple sensors used in the config file.
       For example to change to 75C maximum and to check two virtual files for the system temperature you  might
       have this:

           max-temperature = 75
           temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
           temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input

       The  watchdog  will  issue  warnings  once  the  temperature increases 90%, 95% and 98% of the configured
       maximum temperature.

       When using file mode watchdog will try to stat(2) the given files. Errors returned by stat will not cause
       a  reboot.  For a reboot the stat call has to last at least the re-try time-out value (default 1 minute).
       This may happen if the file is located on an NFS mounted filesystem. If your  system  relies  on  an  NFS
       mounted  filesystem  you  might try this option.  However, in such a case the sync option may not work if
       the NFS server is not answering.

       watchdog can read the pid from a pid file and see whether the process still exists.  If  not,  action  is
       taken by watchdog.  So you can for instance restart the server from your repair-binary.

       watchdog will try periodically to fork itself to see whether the process table is full. This process will
       leave a zombie process until watchdog wakes up again and catches it; this is harmless, don't worry  about
       it.

       In  ping mode watchdog tries to ping the given IPv4 addresses. These addresses do not have to be a single
       machine. It is possible to ping to a broadcast address instead to see if at least one machine in a subnet
       is still living.

       Do  not  use  this  broadcast ping unless your MIS person a) knows about it and b) has given you explicit
       permission to use it!

       watchdog will send out three ping packages  and  wait  up  to  <interval>  seconds  for  the  reply  with
       <interval>  being  the  time  it  goes  to sleep between two times triggering the watchdog device. Thus a
       unreachable network will not cause a hard reset but a soft reboot.

       You can also test passively for an unreachable network by just monitoring a given interface for  traffic.
       If  no  traffic  arrives  the  network is considered unreachable causing a soft reboot or action from the
       repair binary.

       watchdog can run an external command for user-defined tests. A return code not equal  0  means  an  error
       occurred  and  watchdog  should  react.  If  the external command is killed by an uncaught signal this is
       considered an error by watchdog too.  The command may take longer than the time  slice  defined  for  the
       kernel  device  without a problem. However, error messages are generated into the syslog facility. If you
       have enabled softboot on error the machine will be rebooted if the binary doesn't exit in half  the  time
       watchdog sleeps between two tries triggering the kernel device.

       If  you specify a repair binary it will be started instead of shutting down the system. If this binary is
       not able to fix the problem watchdog will still cause a reboot afterwards.

       If the machine is halted an email is sent to notify a human that the machine is going down. Starting with
       version 4.4 watchdog will also notify the human in charge if the machine is rebooted.

       The  re-try timer applies to most errors, except reset/reboot calls and too hot.  It allows a given error
       source to recover, and treats most tests in this way.  Exceptions are file handle  test,  load  averages,
       and  system  memory.  If  set  to the minimum time of 1 second it will still allow a single re-try at any
       polling interval of the system.

SOFT REBOOT

       A soft reboot (i.e. controlled shutdown and reboot) is initiated for every error  that  is  found.  Since
       there might be no more processes available, watchdog does it all by himself. That means:

       1.  Kill all processes with SIGTERM.

       2.  After a short pause kill all remaining processes with SIGKILL.

       3.  Record a shutdown entry in wtmp.

       4.  Save  the  random  seed from /dev/urandom.  If the device is non-existant or there is no filename for
           saving this step is skipped.

       5.  Turn off accounting.

       6.  Turn off quota and swap.

       7.  Unmount all partitions

       8.  Finally reboot.

CHECK BINARY

       If the return code of the check binary is not zero watchdog will assume an error and reboot  the  system.
       Be  careful  with this if you are using the real-time properties of watchdog since watchdog will wait for
       the return of this binary before proceeding. An exit code smaller than 245 is interpreted  as  an  system
       error code (see errno.h for details). Values of 245 or larger than are special to watchdog:

       255    (based on -1 as unsigned 8-bit number) Reboot the system. This is not exactly an error message but
              a command to watchdog.  If the return code is this the watchdog will not try  to  run  a  shutdown
              script instead.

       254    Reset  the  system. This is not exactly an error message but a command to watchdog.  If the return
              code is this the watchdog will attempt to hard-reset the machine without attempting  any  sort  of
              orderly stopping of process, unmounting of file systems, etc.

       253    Maximum load average exceeded.

       252    The temperature inside is too high.

       251    /proc/loadavg contains no (or not enough) data.

       250    The given file was not changed in the given interval.

       249    /proc/meminfo contains invalid data.

       248    Child process was killed by a signal.

       247    Child process did not return in time.

       246    Free for personal watchdog-specific use (was -10 as an unsigned 8-bit number).

       245    Reserved  for  an  unknown  result,  for  example  a slow background test that is still running so
              neither a success nor an error.

REPAIR BINARY

       The repair binary is started with one parameter: the error number that caused watchdog  to  initiate  the
       boot  process.  After  trying  to  repair  the  system  the  binary  should exit with 0 if the system was
       successfully repaired and thus there is no need to boot  anymore.  A  return  value  not  equal  0  tells
       watchdog  to reboot. The return code of the repair binary should be the error number of the error causing
       watchdog to reboot. Be careful with this if you are using the real-time properties  since  watchdog  will
       wait for the return of this binary before proceeding.

       The  configuration  file  parameter repair-maximum controls the number of successive repair attempts that
       report 0 (i.e. success) but fail to clear the tested fault. If this  is  exceeded  then  a  reboot  takes
       place. If set to zero then a reboot can always be blocked by the repair program reporting success.

TEST DIRECTORY

       Executables  placed  in  the  test  directory are discovered by watchdog on startup and are automatically
       executed.  They are bounded time-wise by the test-timeout directive in watchdog.conf.

       These executables are called with either "test" as the first argument (if a test is being  performed)  or
       "repair"  as  the  first  argument  (if  a  repair  for  a previously-failed "test" operation on is being
       performed).

       As with test binaries and repair binaries, expected exit codes for a successful test or repair  operation
       is always zero.

       If  an  executable's  test operation fails, the same executable is automatically called with the "repair"
       argument as well as the return code of the previously-failed test operation.

       For example, if the following execution returns 42:

           /etc/watchdog.d/my-test test

       The watchdog daemon will attempt to repair the problem by calling:

           /etc/watchdog.d/my-test repair 42

       This enables administrators and application developers to make intelligent test/repair commands.  If  the
       "repair"  operation is not required (or is not likely to succeed), it is important that the author of the
       command return a non-zero value so the machine will still reboot as expected.

       Note that the watchdog daemon may interpret and act upon any of the reserved return codes  noted  in  the
       Check Binary section prior to calling a given command in "repair" mode.

       As  for  the  repair  binary,  the  configuration  parameter  repair-maximum  also controls the number of
       successive repair attempts that report success (return 0) but fail to clear the fault.

BUGS

       None known so far.

AUTHORS

       The original code is an example written by Alan Cox <alan@lxorguk.ukuu.org.uk>, the author of the  kernel
       driver.   All   additions   were   written   by   Michael   Meskes   <meskes@debian.org>.  Johnie  Ingram
       <johnie@netgod.net> had the idea of testing the load average. He also took over the Debian specific work.
       Dave  Cinege  <dcinege@psychosis.com>  brought  up  some hardware watchdog issues and helped testing this
       stuff.

FILES

       /dev/watchdog
              The watchdog device.

       /var/run/watchdog.pid
              The pid file of the running watchdog.

SEE ALSO

       watchdog.conf(5)