oracular (8) sanlock.8.gz

Provided by: sanlock_3.8.5-3_amd64 bug

NAME

       sanlock - shared storage lock manager

SYNOPSIS

       sanlock [COMMAND] [ACTION] ...

DESCRIPTION

       sanlock  is  a lock manager built on shared storage.  Hosts with access to the storage can
       perform locking.  An application running on the hosts is given a small amount of space  on
       the  shared  block  device  or  file,  and  uses  sanlock for its own application-specific
       synchronization.  Internally, the sanlock daemon manages locks using two disk-based  lease
       algorithms: delta leases and paxos leases.

       • delta leases are slow to acquire and demand regular i/o to shared storage.  sanlock only
         uses them internally to hold a lease on its "host_id" (an integer host  identifier  from
         1-2000).   They  prevent two hosts from using the same host identifier.  The delta lease
         renewals also indicate if a host is alive.  ("Light-Weight  Leases  for  Storage-Centric
         Coordination", Chockler and Malkhi.)

       • paxos  leases  are  fast  to acquire and sanlock makes them available to applications as
         general purpose resource leases.  The disk paxos algorithm uses host_id's internally  to
         represent  different hosts, and the owner of a paxos lease.  delta leases provide unique
         host_id's for implementing paxos leases, and delta lease renewals serve as a  proxy  for
         paxos lease renewal.  ("Disk Paxos", Eli Gafni and Leslie Lamport.)

       Externally,  the sanlock daemon exposes a locking interface through libsanlock in terms of
       "lockspaces" and "resources".  A lockspace  is  a  locking  context  that  an  application
       creates  for  itself  on shared storage.  When the application on each host is started, it
       "joins" the lockspace.  It can then  create  "resources"  on  the  shared  storage.   Each
       resource  represents  an  application-specific  entity.   The  application can acquire and
       release leases on resources.

       To use sanlock from an application:

       • Allocate shared storage for an application, e.g. a shared LUN or LV from a SAN, or files
         from NFS.

       • Provide the storage to the application.

       • The  application  uses  this storage with libsanlock to create a lockspace and resources
         for itself.

       • The application joins the lockspace when it starts.

       • The application acquires and releases leases on resources.

       How lockspaces and resources translate to delta leases and paxos leases within sanlock:

       Lockspaces

       • A lockspace is based on delta leases held by each host using the lockspace.

       • A lockspace is a series of 2000 delta leases on disk, and requires 1MB of storage.  (See
         Storage below for size variations.)

       • A  lockspace  can  support  up to 2000 concurrent hosts using it, each using a different
         delta lease.

       • Applications can i) create, ii) join and iii) leave a lockspace, which corresponds to i)
         initializing  the set of delta leases on disk, ii) acquiring one of the delta leases and
         iii) releasing the delta lease.

       • When a lockspace is created, a unique lockspace name and disk location  is  provided  by
         the application.

       • When  a  lockspace  is created/initialized, sanlock formats the sequence of 2000 on-disk
         delta lease structures on the file or disk,  e.g.  /mnt/leasefile  (NFS)  or  /dev/vg/lv
         (SAN).

       • The   2000   individual   delta   leases  in  a  lockspace  are  identified  by  number:
         1,2,3,...,2000.

       • Each delta lease is a 512 byte sector in the 1MB lockspace, offset by its  number,  e.g.
         delta  lease  1  is  offset  0,  delta lease 2 is offset 512, delta lease 2000 is offset
         1023488.  (See Storage below for size variations.)

       • When an application joins a lockspace, it must specify the lockspace name, the lockspace
         location  on  shared disk/file, and the local host's host_id.  sanlock then acquires the
         delta lease corresponding to the host_id, e.g. joining  the  lockspace  with  host_id  1
         acquires delta lease 1.

       • The terms delta lease, lockspace lease, and host_id lease are used interchangeably.

       • sanlock acquires a delta lease by writing the host's unique name to the delta lease disk
         sector, reading it back after a delay, and verifying it is the same.

       • If a unique host name is not specified, sanlock generates a uuid to use  as  the  host's
         name.  The delta lease algorithm depends on hosts using unique names.

       • The  application  on  each  host  should  be configured with a unique host_id, where the
         host_id is an integer 1-2000.

       • If hosts are misconfigured and have the same  host_id,  the  delta  lease  algorithm  is
         designed  to  detect  this conflict, and only one host will be able to acquire the delta
         lease for that host_id.

       • A delta lease ensures that a lockspace host_id is being used by a single host  with  the
         unique name specified in the delta lease.

       • Resolving  delta  lease conflicts is slow, because the algorithm is based on waiting and
         watching for some time for other hosts to write to the  same  delta  lease  sector.   If
         multiple  hosts  try  to use the same delta lease, the delay is increased substantially.
         So, it is best to configure applications to use unique host_id's that will not conflict.

       • After sanlock acquires a delta lease, the lease must be renewed  until  the  application
         leaves the lockspace (which corresponds to releasing the delta lease on the host_id.)

       • sanlock  renews  delta  leases  every 20 seconds (by default) by writing a new timestamp
         into the delta lease sector.

       • When a host acquires a delta lease in a lockspace, it can be referred  to  as  "joining"
         the  lockspace.   Once it has joined the lockspace, it can use resources associated with
         the lockspace.

       Resources

       • A lockspace is  a  context  for  resources  that  can  be  locked  and  unlocked  by  an
         application.

       • sanlock  uses  paxos leases to implement leases on resources.  The terms paxos lease and
         resource lease are used interchangeably.

       • A paxos lease exists on shared storage and requires 1MB of space.  It contains a  unique
         resource name and the name of the lockspace.

       • An  application  assigns  its own meaning to a sanlock resource and the leases on it.  A
         sanlock resource could represent some shared object like a file,  or  some  unique  role
         among the hosts.

       • Resource  leases  are associated with a specific lockspace and can only be used by hosts
         that have joined that lockspace (they are holding a delta lease on  a  host_id  in  that
         lockspace.)

       • An  application  must  keep track of the disk locations of its lockspaces and resources.
         sanlock does not maintain any persistent index or directory of lockspaces  or  resources
         that have been created by applications, so applications need to remember where they have
         placed their own leases (which files or disks and offsets).

       • sanlock does not renew paxos leases directly (although it could).  Instead, the  renewal
         of  a  host's  delta lease represents the renewal of all that host's paxos leases in the
         associated lockspace. In effect, many paxos lease renewals are  factored  out  into  one
         delta lease renewal.  This reduces i/o when many paxos leases are used.

       • The  disk paxos algorithm allows multiple hosts to all attempt to acquire the same paxos
         lease at once, and will produce a single winner/owner of the  resource  lease.   (Shared
         resource leases are also possible in addition to the default exclusive leases.)

       • The disk paxos algorithm involves a specific sequence of reading and writing the sectors
         of the paxos lease disk area.  Each host has a dedicated 512 byte sector  in  the  paxos
         lease  disk  area  where it writes its own "ballot", and each host reads the entire disk
         area to see the ballots of other hosts.  The first  sector  of  the  disk  area  is  the
         "leader record" that holds the result of the last paxos ballot.  The winner of the paxos
         ballot writes the result of the ballot to the leader record (the winner  of  the  ballot
         may have selected another contending host as the owner of the paxos lease.)

       • After a paxos lease is acquired, no further i/o is done in the paxos lease disk area.

       • Releasing the paxos lease involves writing a single sector to clear the current owner in
         the leader record.

       • If a host holding a paxos lease fails, the disk area of the paxos lease still  indicates
         that  the  paxos lease is owned by the failed host.  If another host attempts to acquire
         the paxos lease, and finds the lease is held by another host_id, it will check the delta
         lease  of  that  host_id.   If the delta lease of the host_id is being renewed, then the
         paxos lease is owned and cannot be acquired.  If the delta lease of the owner's  host_id
         has  expired,  then  the  paxos  lease is expired and can be taken (by going through the
         paxos lease algorithm.)

       • The "interaction" or "awareness" between hosts of each other  is  limited  to  the  case
         where  they attempt to acquire the same paxos lease, and need to check if the referenced
         delta lease has expired or not.

       • When hosts do not attempt to lock the same resources  concurrently,  there  is  no  host
         interaction or awareness.  The state or actions of one host have no effect on others.

       • To  speed  up  checking  delta lease expiration (in the case of a paxos lease conflict),
         sanlock keeps track of past renewals of other delta leases in the lockspace.

       Resource Index

       The resource index (rindex) is an optional sanlock feature that applications  can  use  to
       keep  track of resource lease offsets.  Without the rindex, an application must keep track
       of where its resource leases exist on disk and find available locations when creating  new
       leases.

       The  sanlock  rindex uses two align-size areas on disk following the lockspace.  The first
       area holds rindex entries; each entry records a resource lease  name  and  location.   The
       second  area  holds  a  private  paxos lease, used by sanlock internally to protect rindex
       updates.

       The application creates the rindex on disk with the "format" function.  Format is a  disk-
       only  operation and does not interact with the live lockspace, so it can be called without
       first calling add_lockspace.  The application needs to follow the  convention  of  writing
       the  lockspace at the start of the device (offset 0) and formatting the rindex immediately
       following the lockspace area.  When formatting, the application must set flags for  sector
       size and align size to match those for the lockspace.

       To use the rindex, the application:

       • Uses the "create" function to create a new resource lease on disk.  This takes the place
         of the write_resource function.  The create function requires the location of the rindex
         and the name of the new resource lease.  sanlock finds a free lease area, writes the new
         resource lease at that location, updates the rindex with the  name:offset,  and  returns
         the  offset  to  the  caller.   The  caller uses this offset when acquiring the resource
         lease.

       • Uses the "delete" function to remove a resource disk on disk (also corresponding to  the
         write_resource  function.)   sanlock  clears the resource lease and the rindex entry for
         it.  A subsequent call to create may  use  this  same  disk  location  for  a  different
         resource lease.

       • Uses the "lookup" function to discover the offset of a resource lease given the resource
         lease name.  The caller would typically call this prior to acquiring the resource lease.

       • Uses the "rebuild" function  to  recreate  the  rindex  if  it  is  damaged  or  becomes
         inconsistent.   This  function scans the disk for resource leases and creates new rindex
         entries to match the leases it finds.

       • The "update" function manipulates rindex entries directly and  should  not  normally  be
         used  by  the  application.  In normal usage, the create and delete functions manipulate
         rindex entries.  Update is mainly useful for testing or repairs.

       Expiration

       • If a host fails to renew its delta lease, e.g. it looses  access  to  the  storage,  its
         delta  lease  will  eventually  expire  and  another  host will be able to take over any
         resource leases held by the host.  sanlock must  ensure  that  the  application  on  two
         different hosts is not holding and using the same lease concurrently.

       • When  sanlock  has  failed  to  renew  a delta lease for a period of time, it will begin
         taking measures to stop local processes (applications) from using  any  resource  leases
         associated with the expiring lockspace delta lease.  sanlock enters this "recovery mode"
         well ahead of the time when another host could  take  over  the  locally  owned  leases.
         sanlock  must  have  sufficient  time  to  stop  all  local processes that are using the
         expiring leases.

       • sanlock uses three methods to stop local processes that are using expiring leases:

         1. Graceful shutdown.  sanlock will execute  a  "graceful  shutdown"  program  that  the
         application  previously  specified  for  this  case.   The  shutdown  program  tells the
         application to shut down because its leases are expiring.  The application must  respond
         by  stopping  its activities and releasing its leases (or exit).  If an application does
         not specify a graceful shutdown program, sanlock sends SIGTERM to the  process  instead.
         The  process must release its leases or exit in a prescribed amount of time (see -g), or
         sanlock proceeds to the next method of stopping.

         2. Forced shutdown.  sanlock will send SIGKILL to processes using the  expiring  leases.
         The  processes  have  a fixed amount of time to exit after receiving SIGKILL.  If any do
         not exit in this time, sanlock will proceed to the next method.

         3. Host reset.  sanlock will trigger the host's watchdog device to  forcibly  reset  it.
         sanlock  carefully  manages  the  timing of the watchdog device so that it fires shortly
         before any other host could take over the resource leases held by local processes.

       Failures

       If a process holding resource leases fails or exits without releasing its leases,  sanlock
       will  release  the  leases  for  it  automatically (unless persistent resource leases were
       used.)

       If the sanlock daemon cannot renew a lockspace delta lease for a specific period  of  time
       (see Expiration), sanlock will enter "recovery mode" where it attempts to stop and/or kill
       any processes holding resource leases in the expiring lockspace.  If the processes do  not
       exit in time, sanlock will force the host to be reset using the local watchdog device.

       If  the  sanlock  daemon  crashes  or hangs, it will not renew the expiry time of the per-
       lockspace connections it had to the wdmd daemon.  This will lead to the expiration of  the
       local watchdog device, and the host will be reset.

       Watchdog

       sanlock  uses  the  wdmd(8)  daemon  to  access  /dev/watchdog.  wdmd multiplexes multiple
       timeouts onto the single watchdog timer.  This is required because delta leases  for  each
       lockspace are renewed and expire independently.

       sanlock  maintains  a  wdmd connection for each lockspace delta lease being renewed.  Each
       connection has an expiry time for some seconds in the future.  After each successful delta
       lease  renewal,  the  expiry  time is renewed for the associated wdmd connection.  If wdmd
       finds any connection expired, it will not renew the  /dev/watchdog  timer.   Given  enough
       successive  failed renewals, the watchdog device will fire and reset the host.  (Given the
       multiplexing nature of wdmd, shorter overlapping renewal failures from multiple lockspaces
       could cause spurious watchdog firing.)

       The  direct link between delta lease renewals and watchdog renewals provides a predictable
       watchdog firing time based on delta lease renewal timestamps that are visible  from  other
       hosts.   sanlock  knows the time the watchdog on another host has fired based on the delta
       lease time.  Furthermore, if the watchdog device on another host fails  to  fire  when  it
       should,  the  continuation  of  delta  lease  renewals  from the other host will make this
       evident and prevent leases from being taken from the failed host.

       If sanlock is able to stop/kill all processing using an expiring lockspace, the associated
       wdmd connection for that lockspace is removed.  The expired wdmd connection will no longer
       block /dev/watchdog renewals, and the host should avoid being reset.

       Storage

       The sector size and the align size  should  be  specified  when  creating  lockspaces  and
       resources  (and  rindex).   The  "align  size"  is  the  size  on disk of a lockspace or a
       resource, i.e. the amount of disk space it uses.   Lockspaces  and  resources  should  use
       matching sector and align sizes, and must use offsets in multiples of the align size.  The
       max number of hosts that can use a lockspace or resource depends  on  the  combination  of
       sector  size and align size, shown below.  The host_id of hosts using the lockspace can be
       no larger than the max_hosts value for the lockspace.

       Accepted combinations of sector size and align size, and the corresponding max_hosts  (and
       max host_id) are:

       sector_size 512, align_size 1M, max_hosts 2000
       sector_size 4096, align_size 1M, max_hosts 250
       sector_size 4096, align_size 2M, max_hosts 500
       sector_size 4096, align_size 4M, max_hosts 1000
       sector_size 4096, align_size 8M, max_hosts 2000

       When  sector_size  and  align_size  are  not  specified, the behavior matches the behavior
       before these sizes  could  be  configured:  on  devices  which  report  sector  size  512,
       512/1M/2000  is  used, on devices which report sector size 4096, 4096/8M/2000 is used, and
       on files, 512/1M/2000 is always used.  (Other combinations are not compatible with sanlock
       version 3.6 or earlier.)

       Using  sanlock  on shared block devices that do host based mirroring or replication is not
       likely to work correctly.  When using sanlock on shared files, all sanlock io should go to
       one file server.

       Example

       This  is  an example of creating and using lockspaces and resources from the command line.
       (Most applications would use sanlock through libsanlock rather than  through  the  command
       line.)

       1.  Allocate shared storage for sanlock leases.

           This example assumes 512 byte sectors on the device, in which case the lockspace needs
           1MB and each resource needs 1MB.

           The example shared block device accessible to all hosts is /dev/leases.

       2.  Start sanlock on all hosts.

           The -w 0 disables use of the watchdog for testing.

           # sanlock daemon -w 0

       3.  Start a dummy application on all hosts.

           This sanlock command registers with  sanlock,  then  execs  the  sleep  command  which
           inherits the registered fd.  The sleep process acts as the dummy application.  Because
           the sleep process is registered with sanlock, leases can be acquired for it.

           # sanlock client command -c /bin/sleep 600 &

       4.  Create a lockspace for the application (from one host).

           The lockspace is named "test".

           # sanlock client init -s test:0:/dev/leases:0

       5.  Join the lockspace for the application.

           Use a unique host_id on each host.

           host1:
           # sanlock client add_lockspace -s test:1:/dev/leases:0
           host2:
           # sanlock client add_lockspace -s test:2:/dev/leases:0

       6.  Create two resources for the application (from one host).

           The resources are named "RA" and "RB".  Offsets are used on the  same  device  as  the
           lockspace.  Different LVs or files could also be used.

           # sanlock client init -r test:RA:/dev/leases:1048576
           # sanlock client init -r test:RB:/dev/leases:2097152

       7.  Acquire resource leases for the application on host1.

           Acquire  an  exclusive  lease  (the default) on the first resource, and a shared lease
           (SH) on the second resource.

           # export P=`pidof sleep`
           # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
           # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P

       8.  Acquire resource leases for the application on host2.

           Acquiring the exclusive lease on the first resource will fail because it  is  held  by
           host1.  Acquiring the shared lease on the second resource will succeed.

           # export P=`pidof sleep`
           # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
           # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P

       9.  Release resource leases for the application on both hosts.

           The  sleep pid could also be killed, which will result in the sanlock daemon releasing
           its leases when it exits.

           # sanlock client release -r test:RA:/dev/leases:1048576 -p $P
           # sanlock client release -r test:RB:/dev/leases:2097152 -p $P

       10. Leave the lockspace for the application.

           host1:
           # sanlock client rem_lockspace -s test:1:/dev/leases:0
           host2:
           # sanlock client rem_lockspace -s test:2:/dev/leases:0

       11. Stop sanlock on all hosts.

           # sanlock shutdown

OPTIONS

       COMMAND can be one of three primary top level choices

       sanlock daemon start daemon
       sanlock client send request to daemon (default command if none given)
       sanlock direct access storage directly (no coordination with daemon)

   Daemon Command
       sanlock daemon [options]

       -D no fork and print all logging to stderr

       -Q 0|1 quiet error messages for common lock contention

       -R 0|1 renewal debugging, log debug info for each renewal

       -L pri write logging at priority level and up to logfile (-1 none)

       -S pri write logging at priority level and up to syslog (-1 none)

       -U uid user id

       -G gid group id

       -H num renewal history size

       -t num max worker threads

       -g sec seconds for graceful recovery

       -w 0|1 use watchdog through wdmd

       -h 0|1 use high priority (RR) scheduling

       -l num use mlockall (0 none, 1 current, 2 current and future)

       -b sec seconds a host id bit will remain set in delta lease bitmap

       -e str local host name used in delta leases

   Client Command
       sanlock client action [options]

       sanlock client status

       Print processes, lockspaces, and resources being managed by the sanlock daemon.  Add -D to
       show extra internal daemon status for debugging.  Add -o p to show resources by pid, or -o
       s to show resources by lockspace.

       sanlock client host_status

       Print state of host_id delta leases read during the last renewal.  State of all lockspaces
       is  shown  (use  -s  to  select  one).   Add  -D  to show extra internal daemon status for
       debugging.

       sanlock client gets

       Print lockspaces being managed by the  sanlock  daemon.   The  LOCKSPACE  string  will  be
       followed  by ADD or REM if the lockspace is currently being added or removed.  Add -h 1 to
       also show hosts in each lockspace.

       sanlock client renewal -s LOCKSPACE

       Print a history of renewals with timing details.  See the Renewal history section below.

       sanlock client log_dump

       Print the sanlock daemon internal debug log.

       sanlock client shutdown

       Ask the sanlock daemon to exit.  Without the force option (-f  0),  the  command  will  be
       ignored  if  any lockspaces exist.  With the force option (-f 1), any registered processes
       will be killed, their resource leases released, and lockspaces  removed.   With  the  wait
       option  (-w  1), the command will wait for a result from the daemon indicating that it has
       shut down and is exiting, or cannot shut down because lockspaces exist (command fails).

       sanlock client init -s LOCKSPACE

       Tell the sanlock daemon to initialize a lockspace on disk.  The -o option can be  used  to
       specify  the io timeout to be written in the host_id leases.  The -Z and -A options can be
       used to specify the sector size and align size, and both should be  set  together.   (Also
       see sanlock direct init.)

       sanlock client init -r RESOURCE

       Tell the sanlock daemon to initialize a resource lease on disk.  The -Z and -A options can
       be used to specify the sector size and align size, and both should be set together.  (Also
       see sanlock direct init.)

       sanlock client read -s LOCKSPACE

       Tell the sanlock daemon to read a lockspace from disk.  Only the LOCKSPACE path and offset
       are required.  If host_id is zero, the first record at offset (host_id 1)  is  used.   The
       complete  LOCKSPACE  is printed.  Add -D to print other details.  (Also see sanlock direct
       read_leader.)

       sanlock client read -r RESOURCE

       Tell the sanlock daemon to read a resource lease from disk.  Only the  RESOURCE  path  and
       offset  are  required.   The complete RESOURCE is printed.  Add -D to print other details.
       (Also see sanlock direct read_leader.)

       sanlock client add_lockspace -s LOCKSPACE

       Tell the sanlock daemon to acquire the specified host_id  in  the  lockspace.   This  will
       allow resources to be acquired in the lockspace.  The -o option can be used to specify the
       io timeout of the acquiring host, and will be written in the host_id lease.

       sanlock client inq_lockspace -s LOCKSPACE

       Inquire about the state of the lockspace in the sanlock daemon, whether it is being  added
       or removed, or is joined.

       sanlock client rem_lockspace -s LOCKSPACE

       Tell  the sanlock daemon to release the specified host_id in the lockspace.  Any processes
       holding resource leases in this lockspace will be killed,  and  the  resource  leases  not
       released.

       sanlock client command -r RESOURCE -c path args

       Register  with  the  sanlock  daemon,  acquire  the specified resource lease, and exec the
       command at path with args.  When the command exits, the sanlock daemon  will  release  the
       lease.  -c must be the final option.

       sanlock client acquire -r RESOURCE -p pid
       sanlock client release -r RESOURCE -p pid

       Tell  the  sanlock daemon to acquire or release the specified resource lease for the given
       pid.  The pid must be registered with the sanlock daemon.  acquire can optionally  take  a
       versioned  RESOURCE string RESOURCE:lver, where lver is the version of the lease that must
       be acquired, or fail.

       sanlock client convert -r RESOURCE -p pid

       Tell the sanlock daemon to convert the mode of the specified resource lease for the  given
       pid.   If the existing mode is exclusive (default), the mode of the lease can be converted
       to shared with RESOURCE:SH.  If the existing mode is shared, the mode of the lease can  be
       converted to exclusive with RESOURCE (no :SH suffix).

       sanlock client inquire -p pid

       Print  the  resource leases held the given pid.  The format is a versioned RESOURCE string
       "RESOURCE:lver" where lver is the version of the lease held.

       sanlock client request -r RESOURCE -f force_mode

       Request the owner of a  resource  do  something  specified  by  force_mode.   A  versioned
       RESOURCE:lver  string  must  be  used with a greater version than is presently held.  Zero
       lver and force_mode clears the request.

       sanlock client examine -r RESOURCE

       Examine the request record for the currently held resource lease and carry out the  action
       specified by the requested force_mode.

       sanlock client examine -s LOCKSPACE

       Examine  requests  for  all  resource  leases currently held in the named lockspace.  Only
       lockspace_name is used from the LOCKSPACE argument.

       sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num

       Set an event for another host.  When the sanlock daemon next renews its  delta  lease  for
       the  lockspace it will: set the bit for the host_id in its bitmap, and set the generation,
       event and data values in its own delta lease.  An  application  that  has  registered  for
       events  from  this  lockspace on the destination host will get the event that has been set
       when the destination sees the event during its next delta lease renewal.

       sanlock client set_config -s LOCKSPACE

       Set a configuration value for a lockspace.  Only lockspace_name is used from the LOCKSPACE
       argument.   The  USED  flag  has  the  same  effect  on a lockspace as a process holding a
       resource lease that will not exit.  The USED_BY_ORPHANS flag means that an orphan resource
       lease will have the same effect as the USED.
       -u 0|1 Set (1) or clear (0) the USED flag.
       -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.

       sanlock client format -x RINDEX

       Create  a  resource index on disk.  Use -Z and -A to set the sector size and align size to
       match the lockspace.

       sanlock client create -x RINDEX -e resource_name

       Create a new resource lease on disk, using the rindex to find a free offset.

       sanlock client delete -x RINDEX -e resource_name[:offset]

       Delete an existing resource lease on disk.

       sanlock client lookup -x RINDEX -e resource_name

       Look up the offset of an existing resource lease by name on disk, using the rindex.   With
       no  -e  option,  lookup  returns the next free lease offset.  If -e specifes both name and
       offset, the lookup verifies both are correct.

       sanlock client update -x RINDEX -e resource_name[:offset] [-z 0|1]

       Add (-z 0) or remove (-z 1) an rindex entry on disk.

       sanlock client rebuild -x RINDEX

       Rebuild the rindex entries by scanning the disk for resource leases.

   Direct Command
       sanlock direct action [options]

       -o sec io timeout in seconds

       sanlock direct init -s LOCKSPACE
       sanlock direct init -r RESOURCE

       Initialize storage for a lockspace or resource.  Use the -Z and -A flags  to  specify  the
       sector  size  and  align size.  The max hosts that can use the lockspace/resource (and the
       max possible host_id) is  determined  by  the  sector/align  size  combination.   Possible
       combinations  are:  512/1M,  4096/1M, 4096/2M, 4096/4M, 4096/8M.  Lockspaces and resources
       both use the same amount of space (align_size) for each combination.  When initializing  a
       lockspace,  sanlock  initializes  delta  leases  for  max_hosts  in the given space.  When
       initializing a resource, sanlock initializes a single paxos lease in the space.  With  -s,
       the  -o option specifies the io timeout to be written in the host_id leases.  With -r, the
       -z 1  option  invalidates  the  resource  lease  on  disk  so  it  cannot  be  used  until
       reinitialized normally.

       sanlock direct read_leader -s LOCKSPACE
       sanlock direct read_leader -r RESOURCE

       Read  a  leader  record  from  disk and print the fields.  The leader record is the single
       sector of a delta lease, or the first sector of a paxos lease.

       sanlock direct dump path[:offset[:size]]

       Read disk sectors and print leader records for delta or paxos leases.  Add -f 1  to  print
       the  request  record  values  for  paxos  leases, host_ids set in delta lease bitmaps, and
       rindex entries.

       sanlock direct format -x RINDEX
       sanlock direct lookup -x RINDEX -e resource_name
       sanlock direct update -x RINDEX -e resource_name[:offset] [-z 0|1]
       sanlock direct rebuild -x RINDEX

       Access the resource index  on  disk  without  going  through  the  sanlock  daemon.   This
       precludes  using  the  internal  paxos  lease to protect rindex modifications.  See client
       equivalents for descriptions.

   LOCKSPACE option string
       -s lockspace_name:host_id:path:offset

       lockspace_name name of lockspace
       host_id local host identifier in lockspace
       path path to storage to use for leases
       offset offset on path (bytes)

   RESOURCE option string
       -r lockspace_name:resource_name:path:offset

       lockspace_name name of lockspace
       resource_name name of resource
       path path to storage to use leases
       offset offset on path (bytes)

   RESOURCE option string with suffix
       -r lockspace_name:resource_name:path:offset:lver

       lver leader version

       -r lockspace_name:resource_name:path:offset:SH

       SH indicates shared mode

   RINDEX option string
       -x lockspace_name:path:offset

       lockspace_name name of lockspace
       path path to storage to use for leases
       offset offset on path (bytes) of rindex

   Defaults
       sanlock help shows the default values for the options above.

       sanlock version shows the build version.

OTHER

   Request/Examine
       The first part of making a request for a resource is writing the  request  record  of  the
       resource (the sector following the leader record).  To make a successful request:

       • RESOURCE:lver  must  be  greater  than  the lver presently held by the other host.  This
         implies the leader record must be read to discover the lver, prior to making a request.

       • RESOURCE:lver must be greater than or equal to the lver presently written to the request
         record.   Two hosts may write a new request at the same time for the same lver, in which
         case both would succeed, but the force_mode from the last would win.

       • The force_mode must be greater than zero.

       • To unconditionally clear the request record (set both lver and force_mode  to  0),  make
         request with RESOURCE:0 and force_mode 0.

       The  owner  of the requested resource will not know of the request unless it is explicitly
       told to examine its resources via the "examine" api/command, or otherwise notfied.

       The second part of making a request is notifying the resource lease owner that  it  should
       examine the request records of its resource leases.  The notification will cause the lease
       owner to automatically run the equivalent of "sanlock client examine -s LOCKSPACE" for the
       lockspace of the requested resource.

       The  notification is made using a bitmap in each host_id delta lease.  Each bit represents
       each of the possible host_ids (1-2000).  If host A wants to notify host B to  examine  its
       resources,  A sets the bit in its own bitmap that corresponds to the host_id of B.  When B
       next renews its delta lease, it reads the delta leases  for  all  hosts  and  checks  each
       bitmap  to  see if its own host_id has been set.  It finds the bit for its own host_id set
       in A's bitmap, and examines its resource request records.  (The bit  remains  set  in  A's
       bitmap for set_bitmap_seconds.)

       force_mode determines the action the resource lease owner should take:

       • FORCE  (1):  kill  the process holding the resource lease.  When the process has exited,
         the resource lease will be released, and can then  be  acquired  by  anyone.   The  kill
         signal is SIGKILL (or SIGTERM if SIGKILL is restricted.)

       • GRACEFUL (2): run the program configured by sanlock_killpath against the process holding
         the resource lease.  If no killpath is defined, then FORCE is used.

   Persistent and orphan resource leases
       A resource lease can be acquired with the PERSISTENT flag (-P 1).  If the process  holding
       the  lease  exits,  the  lease  will not be released, but kept on an orphan list.  Another
       local process can acquire an orphan lease using the ORPHAN flag (-O  1),  or  release  the
       orphan  lease  using the ORPHAN flag (-O 1).  All orphan leases can be released by setting
       the lockspace name (-s lockspace_name) with no resource name.

   Renewal history
       sanlock saves a limited history of lease  renewal  information  in  each  lockspace.   See
       sanlock.conf renewal_history_size to set the amount of history or to disable (set to 0).

       IO  times  are measured in delta lease renewal (each delta lease renewal includes one read
       and one write).

       For each successful renewal, a record is saved that includes:

       • the timestamp written in the delta lease by the renewal

       • the time in milliseconds taken by the delta lease read

       • the time in milliseconds taken by the delta lease write

       Also counted and recorded are the number io  timeouts  and  other  io  errors  that  occur
       between successful renewals.

       Two consecutive successful renewals would be recorded as:
       timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
       timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0

       Those fields are:

       • timestamp is the value written into the delta lease during that renewal.

       • read_ms/write_ms are the milliseconds taken for the renewal read/write ios.

       • next_timeouts  are the number of io timeouts that occurred after the renewal recorded on
         that line, and before the next successful renewal on the following line.

       • next_errors are the number of io errors  (not  timeouts)  that  occurred  after  renewal
         recorded on that line, and before the next successful renewal on the following line.

       The  command  'sanlock  client  renewal  -s  lockspace_name'  reports  the full history of
       renewals saved by sanlock, which by default is 180 records, about 1 hour of  history  when
       using a 20 second renewal interval for a 10 second io timeout.

INTERNALS

   Disk Format
       • This example uses 512 byte sectors.

       • Each  lockspace  is  1MB.   It holds 2000 delta_leases, one per sector, supporting up to
         2000 hosts.

       • Each paxos_lease is 1MB.  It is used as a lease for one resource.

       • The leader_record structure is used differently by each lease type.

       • To display all leader_record fields, see sanlock direct read_leader.

       • A lockspace is often followed on disk by the paxos_leases used  within  that  lockspace,
         but this layout is not required.

       • The request_record and host_id bitmap are used for requests/events.

       • The mode_block contains the SHARED flag indicating a lease is held in the shared mode.

       • In  a  lockspace, the host using host_id N writes to a single delta_lease in sector N-1.
         No other hosts write to this sector.  All hosts read all lockspace sectors when renewing
         their own delta_lease, and are able to monitor renewals of all delta_leases.

       • In  a  paxos_lease,  each  host  has a dedicated sector it writes to, containing its own
         paxos_dblock and mode_block structures.  Its sector is based on its host_id;  host_id  1
         writes to the dblock/mode_block in sector 2 of the paxos_lease.

       • The  paxos_dblock  structures  are  used by the paxos_lease algorithm, and the result is
         written to the leader_record.

       0x000000 lockspace foo:0:/path:0

       (There is no representation on disk of the lockspace in  general,  only  the  sequence  of
       specific delta_leases which collectively represent the lockspace.)

       delta_lease foo:1:/path:0
       0x000 0         leader_record         (sector 0, for host_id 1)
                       magic: 0x12212010
                       space_name: foo
                       resource_name: host uuid/name
                       ...
                       host_id bitmap        (leader_record + 256)

       delta_lease foo:2:/path:0
       0x200 512       leader_record         (sector 1, for host_id 2)
                       magic: 0x12212010
                       space_name: foo
                       resource_name: host uuid/name
                       ...
                       host_id bitmap        (leader_record + 256)

       delta_lease foo:3:/path:0
       0x400 1024      leader_record         (sector 2, for host_id 3)
                       magic: 0x12212010
                       space_name: foo
                       resource_name: host uuid/name
                       ...
                       host_id bitmap        (leader_record + 256)

       delta_lease foo:2000:/path:0
       0xF9E00         leader_record         (sector 1999, for host_id 2000)
                       magic: 0x12212010
                       space_name: foo
                       resource_name: host uuid/name
                       ...
                       host_id bitmap        (leader_record + 256)

       0x100000 paxos_lease foo:example1:/path:1048576
       0x000 0         leader_record         (sector 0)
                       magic: 0x06152010
                       space_name: foo
                       resource_name: example1

       0x200 512       request_record        (sector 1)
                       magic: 0x08292011

       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
       0x480 1152      mode_block            (paxos_dblock + 128)

       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
       0x680 1664      mode_block            (paxos_dblock + 128)

       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
       0x880 2176      mode_block            (paxos_dblock + 128)

       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
       0xFA280         mode_block            (paxos_dblock + 128)

       0x200000 paxos_lease foo:example2:/path:2097152
       0x000 0         leader_record         (sector 0)
                       magic: 0x06152010
                       space_name: foo
                       resource_name: example2

       0x200 512       request_record        (sector 1)
                       magic: 0x08292011

       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
       0x480 1152      mode_block            (paxos_dblock + 128)

       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
       0x680 1664      mode_block            (paxos_dblock + 128)

       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
       0x880 2176      mode_block            (paxos_dblock + 128)

       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
       0xFA280         mode_block            (paxos_dblock + 128)

   Lease ownership
       Not  shown  in  the  leader_record structures above are the owner_id, owner_generation and
       timestamp fields.  These are the fields that define the lease owner.

       The delta_lease  at  sector  N  for  host_id  N+1  has  leader_record.owner_id  N+1.   The
       leader_record.owner_generation is incremented each time the delta_lease is acquired.  When
       a delta_lease is acquired, the leader_record.timestamp field is set to  the  time  of  the
       host  and the leader_record.resource_name is set to the unique name of the host.  When the
       host renews the delta_lease,  it  writes  a  new  leader_record.timestamp.   When  a  host
       releases a delta_lease, it writes zero to leader_record.timestamp.

       When  a  host  acquires  a  paxos_lease,  it  uses  the  host_id/generation value from the
       delta_lease it holds in the lockspace.  It uses this host_id/generation to identify itself
       in  the paxos_dblock when running the paxos algorithm.  The result of the algorithm is the
       winning  host_id/generation  -  the  new  owner   of   the   paxos_lease.    The   winning
       host_id/generation   are   written   to   the   paxos_lease   leader_record.owner_id   and
       leader_record.owner_generation fields and leader_record.timestamp is  set.   When  a  host
       releases a paxos_lease, it sets leader_record.timestamp to 0.

       When  a  paxos_lease is free (leader_record.timestamp is 0), multiple hosts may attempt to
       acquire it.  The paxos algorithm, using the paxos_dblock structures, will select only  one
       of  the  hosts  as  the  new  owner,  and that owner is written in the leader_record.  The
       paxos_lease will no longer be free (non-zero timestamp).  Other hosts will  see  this  and
       will not attempt to acquire the paxos_lease until it is free again.

       If  a  paxos_lease  is  owned  (non-zero  timestamp),  but  the  owner has not renewed its
       delta_lease for a specific length of time, then the owner value in the paxos_lease becomes
       expired,  and other hosts will use the paxos algorithm to acquire the paxos_lease, and set
       a new owner.

FILES

       /etc/sanlock/sanlock.conf

       • quiet_fail = 1
         See -Q

       • debug_renew = 0
         See -R

       • logfile_priority = 4
         See -L

       • logfile_use_utc = 0
         Use UTC instead of local time in log messages.

       • syslog_priority = 3
         See -S

       • names_log_priority = 4
         Log resource names at this priority level (uses syslog priority numbers).   If  this  is
         greater  than or equal to logfile_priority, each requested resource name and location is
         recorded in sanlock.log.

       • use_watchdog = 1
         See -w

       • high_priority = 1
         See -h

       • mlock_level = 1
         See -l

       • sh_retries = 8
         The number of times to try acquiring a paxos lease when acquiring a  shared  lease  when
         the paxos lease is held by another host acquiring a shared lease.

       • uname = sanlock
         See -U

       • gname = sanlock
         See -G

       • our_host_name = <str>
         See -e

       • renewal_read_extend_sec = <seconds>
         If  a  renewal  read  i/o  times out, wait this many additional seconds for that read to
         complete at the start of the subsequent renewal attempt.  When not  configured,  sanlock
         waits for an additional io_timeout seconds for a previous timed out read to complete.

       • renewal_history_size = 180
         See -H

       • paxos_debug_all = 0
         Include all details in the paxos debug logging.

       • debug_io = <str>
         Add  debug  logging  for  each  i/o.   "submit"  (no  quotes)  produces  debug output at
         submission  time,  "complete"  produces   debug   output   at   completion   time,   and
         "submit,complete" (no space) produces both.

       • max_sectors_kb = <str>|<num>
         Set  to "ignore" (no quotes) to prevent sanlock from checking or changing max_sectors_kb
         for the lockspace disk when starting a lockspace.  Set to "align"  (no  quotes)  to  set
         max_sectors_kb  for  the  lockspace  disk  to the align size of the lockspace.  Set to a
         number to set a specific number of KB for all lockspace disks.

       • debug_clients = 0
         Enable or disable debug logging for all client connections to the sanlock daemon.

       • debug_cmd = +|-<name>
         Enable (+name) or disable (-name) debug logging at  the  command  processing  level  for
         specifically   named   commands,   e.g.   "debug_cmd   =   +acquire",  or  "debug_cmd  =
         -inq_lockspace".  Repeat this line for each command name.  Use a plus prefix before  the
         name  to enable and a minus prefix to disable.  By default sanlock disables some command
         level debugging for commands that are often repetitive and  fill  the  in  memory  debug
         buffer.   This only affects debug logging, not errors or warnings, and disabling command
         level debugging for a command does not disable lower level debugging for  that  command.
         Special  values  +all and -all can be used to enable or disable all commands, and can be
         used before or after other debug_cmd lines.

       • write_init_io_timeout = <seconds>
         The io timeout to use when initializing ondisk  lease  structures  for  a  lockspace  or
         resource.  This timeout is not used as a part of either lease algorithm (as the standard
         io_timeout is.)

       • max_worker_threads = <num>
         See -t

SEE ALSO

       wdmd(8)

                                            2015-01-23                                 SANLOCK(8)