Provided by: nfs-ganesha-rados-grace_4.3-2_amd64 bug

NAME

       ganesha-rados-cluster-design - Clustered RADOS Recovery Backend Design

OVERVIEW

       This  document  aims  to  explain  the theory and design behind the rados_cluster recovery
       backend, which coordinates  grace  period  enforcement  among  multiple,  independent  NFS
       servers.

       In  order to understand the clustered recovery backend, it's first necessary to understand
       how recovery works with a single server:

SINGLETON SERVER RECOVERY

       NFSv4 is a lease-based protocol. Clients set up a relationship  to  the  server  and  must
       periodically  renew  their  lease  in order to maintain their ephemeral state (open files,
       locks, delegations or layouts).

       When a singleton NFS server is restarted, any ephemeral state is  lost.  When  the  server
       comes  comes  back  online, NFS clients detect that the server has been restarted and will
       reclaim the ephemeral state that they held at the time of  their  last  contact  with  the
       server.

SINGLETON GRACE PERIOD

       In  order to ensure that we don't end up with conflicts, clients are barred from acquiring
       any new state while in the Recovery phase. Only reclaim operations are allowed.

       This period of time is called the grace period. Most NFS servers have a grace period  that
       lasts  around  two  lease  periods, however nfs-ganesha can and will lift the grace period
       early if it determines that no more clients will be allowed to recover.

       Once the grace period ends, the server will move into its Normal operation  state.  During
       this period, no more recovery is allowed and new state can be acquired by NFS clients.

REBOOT EPOCHS

       The  lifecycle  of  a singleton NFS server can be considered to be a series of transitions
       from the Recovery period to Normal operation and back. In the remainder of  this  document
       we'll consider such a period to be an epoch, and assign each a number beginning with 1.

       Visually,  we  can represent it like this, such that each Normal -> Recovery transition is
       marked by a change in the epoch value:

          +-------+-------+-------+---------------+-------+
          | State | R | N | R | N | R | R | R | N | R | N |
          +-------+-------+-------+---------------+-------+
          | Epoch |   1   |   2   |       3       |   4   |
          +-------+-------+-------+---------------+-------+

       Note that it is possible to restart during the grace period (as shown above  during  epoch
       3).  That  just  serves  to  extend the recovery period and the epoch. A new epoch is only
       declared during a Recovery -> Normal transition.

CLIENT RECOVERY DATABASE

       There are some potential edge cases  that  can  occur  involving  network  partitions  and
       multiple  reboots.  In  order to prevent those, the server must maintain a list of clients
       that hold state on the server at any given time. This list must be  maintained  on  stable
       storage.  If a client sends a request to reclaim some state, then the server must check to
       make sure it's on that list before allowing the request.

       Thus when the server allows reclaim requests it must always gate it against  the  recovery
       database  from the previous epoch. As clients come in to reclaim, we establish records for
       them in a new database associated with the current epoch.

       The transition from recovery to normal  operation  should  perform  an  atomic  switch  of
       recovery  databases.  A  recovery database only becomes legitimate on a recovery to normal
       transition. Until that point, the  recovery  database  from  the  previous  epoch  is  the
       canonical one.

EXPORTING A CLUSTERED FILESYSTEM

       Let's  consider  a set of independent NFS servers, all serving out the same content from a
       clustered backend filesystem of any flavor. Each NFS server in this  case  can  itself  be
       considered  a  clustered  FS client. This means that the NFS server is really just a proxy
       for state on the clustered filesystem.

       The filesystem must make some guarantees to the NFS server. First filesystem guarantee:

       1. The filesystem ensures that the NFS servers (aka the FS clients)  cannot  obtain  state
          that conflicts with that of another NFS server.

       This  is  somewhat  obvious and is what we expect from any clustered filesystem outside of
       any requirements of NFS. If the clustered filesystem can provide this, then we  know  that
       conflicting state during normal operations cannot be granted.

       The  recovery  period  has  a  different  set  of  rules.  If an NFS server crashes and is
       restarted, then we have a window of time when that NFS server does not know what state was
       held by its clients.

       If  the  state  held  by  the  crashed NFS server is immediately released after the crash,
       another NFS server could hand out conflicting state before the original NFS client  has  a
       chance to recover it.

       This must be prevented. Second filesystem guarantee:

       2. The  filesystem must not release state held by a server during the previous epoch until
          all servers in the cluster are enforcing the grace period.

       In practical terms, we want the filesystem to provide a way for an NFS server to  tell  it
       when  it's  safe to release state held by a previous instance of itself. The server should
       do this once it knows that all of its siblings are enforcing the grace period.

       Note that we do not require that all servers restart and allow reclaim at that point. It's
       sufficient  for them to simply begin grace period enforcement as soon as possible once one
       server needs it.

CLUSTERED GRACE PERIOD DATABASE

       At this point the cluster siblings are no longer completely  independent,  and  the  grace
       period has become a cluster-wide property. This means that we must track the current epoch
       on some sort of shared storage that the servers can all access.

       Additionally we must also keep track of whether a cluster-wide grace period is in  effect.
       Any  running  nodes  should  all be informed when either of this info changes, so they can
       take appropriate steps when it occurs.

       In the rados_cluster backend, we track these using two epoch values:

       C: is the current epoch. This represents the current epoch value
              of the cluster

       R: is the recovery epoch. This represents the epoch from which
              clients are allowed to recover. A non-zero value here  means  that  a  cluster-wide
              grace period is in effect. Setting this to 0 ends that grace period.

       In  order to decide when to make grace period transitions, each server must also advertise
       its state to the other nodes. Specifically, each server must be able  to  determine  these
       two things about each of its siblings:

       1. Does  this  server  have  clients  from  the previous epoch that will require recovery?
          (NEED)

       2. Is this server enforcing the grace period by refusing non-reclaim locks?  (ENFORCING)

       We do this with a pair of flags per sibling (NEED and ENFORCING).  Each  server  typically
       manages its own flags.

       The  rados_cluster backend stores all of this information in a single RADOS object that is
       modified using read/modify/write cycles. Typically we'll read the whole object, modify it,
       and  then  attempt  to  write it back. If something changes between the read and write, we
       redo the read and try it again.

CLUSTERED CLIENT RECOVERY DATABASES

       In rados_cluster the client recovery databases are  stored  as  RADOS  objects.  Each  NFS
       server  has  its  own set of them and they are given names that have the current epoch (C)
       embedded in it. This ensures that recovery databases are specific to a particular epoch.

       In general, it's safe to delete any recovery database that precedes R when R is  non-zero,
       and safe to remove any recovery database except for the current one (the one with C in the
       name) when the grace period is not in effect (R==0).

ESTABLISHING A NEW GRACE PERIOD

       When a server restarts and wants  to  allow  clients  to  reclaim  their  state,  it  must
       establish  a  new  epoch  by  incrementing the current epoch to declare a new grace period
       (R=C; C=C+1).

       The exception to this rule is when the cluster is already in a grace period.  Servers  can
       just  join an in-progress grace period instead of establishing a new one if one is already
       active.

       In either case, the server should also set its NEED and ENFORCING flags at the same time.

       The other surviving cluster siblings should take steps to begin grace  period  enforcement
       as soon as possible. This entails "draining off" any in-progress state morphing operations
       and then blocking the acquisition of any new state (usually with a return of NFS4ERR_GRACE
       to  clients  that attempt it). Again, there is no need for the survivors from the previous
       epoch to allow recovery here.

       The surviving servers must however establish a new client recovery database at this  point
       to ensure that their clients can do recovery in the event of a crash afterward.

       Once  all  of  the siblings are enforcing the grace period, the recovering server can then
       request that the filesystem release the old state, and allow clients to  begin  reclaiming
       their  state.  In  the rados_cluster backend driver, we do this by stalling server startup
       until all hosts in the cluster are enforcing the grace period.

LIFTING THE GRACE PERIOD

       Transitioning from recovery to normal operation really consists of two different steps:

       1. the server decides that it no longer requires a grace period, either due to  it  timing
          out or there not being any clients that would be allowed to reclaim.

       2. the server stops enforcing the grace period and transitions to normal operation

       These concepts are often conflated in singleton servers, but in a cluster we must consider
       them independently.

       When a server is finished with its own local recovery period, it  should  clear  its  NEED
       flag.  That  server  should  continue  enforcing  the grace period however until the grace
       period is fully lifted. The server must not permit reclaims after clearing its NEED  flag,
       however.

       If  the  servers' own NEED flag is the last one set, then it can lift the grace period (by
       setting R=0). At that point, all servers in the cluster can end grace period  enforcement,
       and communicate that fact to the others by clearing their ENFORCING flags.

                                           Apr 28, 2023           GANESHA-RADOS-CLUSTER-DESIGN(8)