Provided by: cman_3.0.12-2ubuntu2_i386 bug


       qdisk - a disk-based quorum daemon for CMAN / Linux-Cluster

1. Overview

1.1 Problem

       In  some  situations,  it  may  be  necessary or desirable to sustain a
       majority node failure of a cluster without  introducing  the  need  for
       asymmetric  cluster  configurations  (e.g.  client-server,  or heavily-
       weighted voting nodes).

1.2. Design Requirements

       * Ability to sustain 1..(n-1)/n simultaneous node failures, without the
       danger  of  a simple network partition causing a split brain.  That is,
       we need to be able to ensure that the  majority  failure  case  is  not
       merely the result of a network partition.

       *  Ability  to use external reasons for deciding which partition is the
       the quorate partition in a partitioned cluster.  For  example,  a  user
       may  have  a  service running on one node, and that node must always be
       the master in the event of a network partition.  Or, a node might  lose
       all  network  connectivity  except  the cluster communication path - in
       which case, a user may wish that node to be evicted from the cluster.

       * Integration with CMAN.  We must not require CMAN to run with  us  (or
       without  us).   Linux-Cluster does not require a quorum disk normally -
       introducing new requirements on the base of how Linux-Cluster  operates
       is not allowed.

       * Data integrity.  In order to recover from a majority failure, fencing
       is required.  The fencing  subsystem  is  already  provided  by  Linux-

       *  Non-reliance  on  hardware  or  protocol specific methods (i.e. SCSI
       reservations).  This ensures the quorum disk algorithm can be  used  on
       the widest range of hardware configurations possible.

       *  Little  or  no  memory allocation after initialization.  In critical
       paths during failover, we do not want to  have  to  worry  about  being
       killed  during  a  memory  pressure situation because we request a page
       fault, and the Linux OOM killer responds...

1.3. Hardware Considerations and Requirements

1.3.1. Concurrent, Synchronous, Read/Write Access

       This quorum daemon requires  a  shared  block  device  with  concurrent
       read/write  access  from  all  nodes  in the cluster.  The shared block
       device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
       RAIDed  iSCSI target, or even GNBD.  The quorum daemon uses O_DIRECT to
       write to the device.

1.3.2. Bargain-basement JBODs need not apply

       There is a minimum performance requirement inherent  when  using  disk-
       based  cluster  quorum  algorithms, so design your cluster accordingly.
       Using a cheap JBOD with old SCSI2 disks on a multi-initiator  bus  will
       cause problems at the first load spike.  Plan your loads accordingly; a
       node's inability to write to the quorum disk in a  timely  manner  will
       cause  the  cluster  to  evict  the  node.   Using  host-RAID or multi-
       initiator  parallel  SCSI  configurations  with  the  qdisk  daemon  is
       unlikely  to  work,  and  will  probably  cause administrators a lot of
       frustration.   That  having  been  said,  because  the   timeouts   are
       configurable,  most  hardware  should work if the timeouts are set high

1.3.3. Fencing is Required

       In order to maintain data integrity under all failure scenarios, use of
       this  quorum  daemon  requires adequate fencing, preferably power-based
       fencing.  Watchdog timers and software-based solutions  to  reboot  the
       node   internally,   while  possibly  sufficient,  are  not  considered
       'fencing' for the purposes of using the quorum disk.

1.4. Limitations

       * At this time, this daemon supports a maximum of 16  nodes.   This  is
       primarily  a  scalability  issue:  As  we  increase  the node count, we
       increase the amount of synchronous I/O contention on the shared  quorum

       *  Cluster  node  IDs must be statically configured in cluster.conf and
       must be numbered from 1..16 (there can be gaps, of course).

       * Cluster node votes must all be 1.

       * CMAN must be running before the qdisk program  can  operate  in  full
       capacity.  If CMAN is not running, qdisk will wait for it.

       *  CMAN's eviction timeout should be at least 2x the quorum daemon's to
       give the quorum daemon adequate time to converge on a master  during  a
       failure  +  load  spike  situation.   See  section  3.3.1  for specific

       * For 'all-but-one'  failure  operation,  the  total  number  of  votes
       assigned  to  the  quorum device should be equal to or greater than the
       total number of node-votes in the cluster.  While  it  is  possible  to
       assign  only  one (or a few) votes to the quorum device, the effects of
       doing so have not been explored.

       * For 'tiebreaker'  operation  in  a  two-node  cluster,  unset  CMAN's
       two_node  flag  (or set it to 0), set CMAN's expected votes to '3', set
       each node's vote to '1', and leave qdisk's vote count unset.  This will
       allow  the  cluster  to  operate  if either both nodes are online, or a
       single node & the heuristics.

       * Currently, the quorum disk daemon is difficult to use  with  CLVM  if
       the  quorum  disk  resides  on  a CLVM logical volume.  CLVM requires a
       quorate cluster to correctly operate, which introduces  a  chicken-and-
       egg problem for starting the cluster: CLVM needs quorum, but the quorum
       daemon needs CLVM (if and only if  the  quorum  device  lies  on  CLVM-
       managed  storage).   One  way  to  work around this is to *not* set the
       cluster's expected votes to include the quorum daemon's  votes.   Bring
       all nodes online, and start the quorum daemon *after* the whole cluster
       is running.  This will allow the expected votes to increase  naturally.

2. Algorithms

2.1. Heartbeating & Liveliness Determination

       Nodes  update  individual  status  blocks on the quorum disk at a user-
       defined rate.  Each write of a status block alters the timestamp, which
       is  what other nodes use to decide whether a node has hung or not.  If,
       after a user-defined number of 'misses' (that is, failure to  update  a
       timestamp),  a  node  is  declared  offline.  After a certain number of
       'hits' (changed timestamp + "i am alive" state), the node  is  declared

       The  status block contains additional information, such as a bitmask of
       the nodes that node believes are online.  Some of this  information  is
       used  by the master - while some is just for performance recording, and
       may be used at a later time.  The most important pieces of  information
       a node writes to its status block are:

            - Timestamp
            - Internal state (available / not available)
            - Score
            -  Known  max  score  (may be used in the future to detect invalid
            - Vote/bid messages
            - Other nodes it thinks are online

2.2. Scoring & Heuristics

       The administrator can configure up to 10 purely  arbitrary  heuristics,
       and  must  exercise  caution  in doing so.  At least one administrator-
       defined heuristic is required for operation, but it is generally a good
       idea  to  have more than one heuristic.  By default, only nodes scoring
       over 1/2 of the total maximum score will claim they are  available  via
       the quorum disk, and a node (master or otherwise) whose score drops too
       low will remove itself (usually, by rebooting).

       The heuristics themselves can be any command  executable  by  'sh  -c'.
       For example, in early testing the following was used:

            <heuristic program="[ -f /quorum ]" score="10" interval="2"/>

       This is a literal sh-ism which tests for the existence of a file called
       "/quorum".  Without that file, the node would claim it was unavailable.
       This is an awful example, and should never, ever be used in production,
       but is provided as an example as to what one could do...

       Typically, the heuristics should be snippets of shell code or  commands
       which  help  determine  a  node's usefulness to the cluster or clients.
       Ideally, you want to add traces for all of  your  network  paths  (e.g.
       check  links,  or  ping routers), and methods to detect availability of
       shared storage.

2.3. Master Election

       Only one master is present at any one time in the  cluster,  regardless
       of  how many partitions exist within the cluster itself.  The master is
       elected by a simple voting  scheme  in  which  the  lowest  node  which
       believes  it  is  capable of running (i.e. scores high enough) bids for
       master status.  If the other nodes agree, it becomes the master.   This
       algorithm is run whenever no master is present.

       If another node comes online with a lower node ID while a node is still
       bidding for master status, it will rescind its bid  and  vote  for  the
       lower  node  ID.   If  a master dies or a bidding node dies, the voting
       algorithm is started over.  The voting algorithm  typically  takes  two
       passes to complete.

       Master  deaths  take  marginally longer to recover from than non-master
       deaths, because a new master must be elected before the old master  can
       be evicted & fenced.

2.4. Master Duties

       The  master  node  decides who is or is not in the master partition, as
       well as handles eviction of dead nodes (both via the  quorum  disk  and
       via  the  linux-cluster  fencing  system  by using the cman_kill_node()

2.5. How it All Ties Together

       When a master is present, and if the  master  believes  a  node  to  be
       online,  that  node  will  advertise  to  CMAN  that the quorum disk is
       available.  The master will only grant a node membership if:

            (a) CMAN believes the node to be online, and
            (b) that node has made enough consecutive, timely writes
                to the quorum disk, and
            (c) the node has a high enough score to consider itself online.

3. Configuration

3.1. The <quorumd> tag

       This tag is a child of the top-level <cluster> tag.

            This is the frequency of read/write cycles, in seconds.

            This is the number of cycles a node  must  miss  in  order  to  be
            declared  dead.   The  default for this number is dependent on the
            configured token timeout.

            This is the number of cycles a node must be seen in  order  to  be
            declared online.  Default is floor(tko/3).

            This  is the number of cycles a node must wait before initiating a
            bid for master status after heuristic scoring becomes  sufficient.
            The default is 2.  This can not be set to 0, and should not exceed

            This is the number of cycles a node must  wait  for  votes  before
            declaring   itself   master   after  making  a  bid.   Default  is
            floor(tko/2).  This can not be less than 2, must be  greater  than
            tko_up, and should not exceed tko.

            This  is  the number of votes the quorum daemon advertises to CMAN
            when it has a high enough score.  The default  is  the  number  of
            nodes  in  the cluster minus 1.  For example, in a 4 node cluster,
            the default is 3.  This value may change during normal  operation,
            for example when adding or removing a node from the cluster.

            This  controls  the  verbosity  of the quorum daemon in the system
            logs.  0 = emergencies; 7 = debug.  This option is deprecated.

            This controls the syslog facility used by the quorum  daemon  when
            logging.   For  a  complete  list  of  available  facilities,  see
            syslog.conf(5).  The default value for  this  is  'daemon'.   This
            option is deprecated.

            Write  internal  states  out  to this file periodically ("-" = use
            stdout).  This is primarily used for debugging.  The default value
            for this attribute is undefined.  This option can be changed while
            qdiskd is running.

            Absolute minimum score to be  consider  one's  self  "alive".   If
            omitted,  or  set  to  0, the default function "floor((n+1)/2)" is
            used, where n is the total of all  of  defined  heuristics'  score
            attribute.   This  must  never  exceed  the  sum  of the heuristic
            scores, or else the quorum disk will never be available.

            If set to 0 (off), qdiskd  will  *not*  reboot  after  a  negative
            transition  as  a  result  in a change in score (see section 2.2).
            The default for this value is 1 (on).  This option can be  changed
            while qdiskd is running.

            If  set to 1 (on), only the qdiskd master will advertise its votes
            to CMAN.  In a network  partition,  only  the  qdisk  master  will
            provide votes to CMAN.  Consequently, that node will automatically
            "win" in a fence race.

            This option requires careful  tuning  of  the  CMAN  timeout,  the
            qdiskd  timeout,  and  CMAN's quorum_dev_poll value.  As a rule of
            thumb, CMAN's quorum_dev_poll value should  be  equal  to  Totem's
            token  timeout  and qdiskd's timeout (interval*tko) should be less
            than half of Totem's token timeout.  See section  3.3.1  for  more

            This   option  only  takes  effect  if  there  are  no  heuristics
            configured.  Usage of this option in configurations with more than
            two cluster nodes is undefined and should not be done.

            In a two-node cluster with no heuristics and no defined vote count
            (see above), this mode is turned by default.  If enabled  in  this
            way  at  startup  and  a  node  is  later  added  to  the  cluster
            configuration or the vote count is set to a value  other  than  1,
            this mode will be disabled.

            If  set  to  0  (off), qdiskd will *not* instruct to kill nodes it
            thinks are dead (as a result of not writing to the  quorum  disk).
            The  default for this value is 1 (on).  This option can be changed
            while qdiskd is running.

            If set to 1 (on), qdiskd will watch internal timers and reboot the
            node  if it takes more than (interval * tko) seconds to complete a
            quorum disk pass.  The default for this value is  0  (off).   This
            option can be changed while qdiskd is running.

            If set to 1 (on), qdiskd will watch internal timers and reboot the
            node if qdisk is not able to write to disk after (interval *  tko)
            seconds.   The default for this value is 0 (off). If io_timeout is
            active max_error_cycles is overridden and set to off.

            Valid  values  are  'rr',  'fifo',  and  'other'.    Selects   the
            scheduling  queue  in the Linux kernel for operation of the main &
            score threads (does not affect the heuristics; they are always run
            in    the    'other'    queue).     Default    is    'rr'.     See
            sched_setscheduler(2) for more details.

            Valid values for 'rr' and  'fifo'  are  1..100  inclusive.   Valid
            values  for  'other'  are -20..20 inclusive.  Sets the priority of
            the main & score threads.  The default value is 1 (in the  RR  and
            FIFO  queues,  higher  numbers  denote  higher priority; in OTHER,
            lower values denote higher priority).  This option can be  changed
            while qdiskd is running.

            Ordinarily,  cluster membership is left up to CMAN, not qdisk.  If
            this parameter is set to 1 (on), qdiskd will tell  CMAN  to  leave
            the  cluster  if it is unable to initialize the quorum disk during
            startup.  This can be used to prevent cluster participation  by  a
            node  which  has  been disconnected from the SAN.  The default for
            this value is 0 (off).  This option can be changed while qdiskd is

            If  this  parameter  is set to 1 (on), qdiskd will use values from
            /proc/uptime for internal timings.  This is  a  bit  less  precise
            than  gettimeofday(2), but the benefit is that changing the system
            clock will not affect qdiskd's behavior  -  even  if  paranoid  is
            enabled.   If  set to 0, qdiskd will use gettimeofday(2), which is
            more precise.  The default for this value is 1 (on / use  uptime).

            This  is  the device the quorum daemon will use.  This device must
            be the same on all nodes.

            This overrides the device field if  present.   If  specified,  the
            quorum  daemon  will  read  /proc/partitions  and  check for qdisk
            signatures on  every  block  device  found,  comparing  the  label
            against  the  specified  label.   This is useful in configurations
            where the block device name differs on a per-node basis.

            This overrides the  label  advertised  to  CMAN  if  present.   If
            specified,  the quorum daemon will register with this name instead
            of the actual device name.

            If we receive an I/O error during a cycle, we do not poll CMAN and
            tell  it we are alive.  If specified, this value will cause qdiskd
            to exit after the specified number of  consecutive  cycles  during
            which  I/O  errors  occur.   The  default is 0 (no maximum).  This
            option can be changed while qdiskd is  running.   This  option  is
            ignored if io_timeout is set to 1.


3.3.1. Quorum Disk Timings

       Qdiskd  should  not be used in environments requiring failure detection
       times of less than approximately 10 seconds.

       Qdiskd will attempt to automatically configure  timings  based  on  the
       totem  timeout  and  the  TKO.   If configuring manually, Totem's token
       timeout must be set to a value at least 1 interval greater than the the
       following function:

         interval * (tko + master_wait + upgrade_wait)

       So,  if  you  have  an  interval of 2, a tko of 7, master_wait of 2 and
       upgrade_wait of 2, the token timeout should  be  at  least  24  seconds
       (24000 msec).

       It  is  recommended  to have at least 3 intervals to reduce the risk of
       quorum loss during heavy I/O load.  As a rule of thumb, using  a  totem
       timeout  more than 2x of qdiskd's timeout will result in good behavior.

       An improper timing configuration will cause CMAN to give up on  qdiskd,
       causing a temporary loss of quorum during master transition.

3.2. The <heuristic> tag

       This  tag  is  a  child  of  the  <quorumd> tag.  Heuristics may not be
       changed while qdiskd is running.

            This is the program used to determine if this heuristic is  alive.
            This  can  be  anything  which  may  be executed by /bin/sh -c.  A
            return value of zero indicates success;  anything  else  indicates
            failure.  This is required.

            This is the weight of this heuristic.  Be careful when determining
            scores for heuristics.  The default score for each heuristic is 1.

            This is the frequency (in seconds) at which we poll the heuristic.
            The default interval for every heuristic is 2 seconds.

            After this many failed  attempts  to  run  the  heuristic,  it  is
            considered  DOWN,  and  its score is removed.  The default tko for
            each heuristic is 1, which may be inadequate for  things  such  as

3.3. Examples

3.3.1. 3 cluster nodes & 3 routers

        <cman expected_votes="6" .../>
            <clusternode name="node1" votes="1" ... />
            <clusternode name="node2" votes="1" ... />
            <clusternode name="node3" votes="1" ... />
        <quorumd interval="1" tko="10" votes="3" label="testing">
            <heuristic   program="ping   A  -c1  -t1"  score="1"  interval="2"
            <heuristic  program="ping  B  -c1  -t1"   score="1"   interval="2"
            <heuristic   program="ping   C  -c1  -t1"  score="1"  interval="2"

3.3.2. 2 cluster nodes & 1 IP tiebreaker

        <cman two_node="0" expected_votes="3" .../>
            <clusternode name="node1" votes="1" ... />
            <clusternode name="node2" votes="1" ... />
        <quorumd interval="1" tko="10" votes="1" label="testing">
            <heuristic  program="ping  A  -c1  -t1"   score="1"   interval="2"

3.4. Heuristic score considerations

       *  Heuristic  timeouts  should be set high enough to allow the previous
       run of a given heuristic to complete.

       * Heuristic scripts returning anything except 0 as  their  return  code
       are considered failed.

       *  The worst-case for improperly configured quorum heuristics is a race
       to fence where two partitions simultaneously try to kill each other.

3.5. Creating a quorum disk partition

       The mkqdisk utility can create and  list  currently  configured  quorum
       disks visible to the local node; see mkqdisk(8) for more details.


       mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5), gettimeofday(2)

                                  20 Feb 2007                         QDisk(5)