Provided by: manpages-dev_6.03-2_all bug

NAME

       seccomp_unotify - Seccomp user-space notification mechanism

LIBRARY

       Standard C library (libc, -lc)

SYNOPSIS

       #include <linux/seccomp.h>
       #include <linux/filter.h>
       #include <linux/audit.h>

       int seccomp(unsigned int operation, unsigned int flags, void *args);

       #include <sys/ioctl.h>

       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
                 struct seccomp_notif *req);
       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
                 struct seccomp_notif_resp *resp);
       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ADDFD,
                 struct seccomp_notif_addfd *addfd);

DESCRIPTION

       This page describes the user-space notification mechanism provided by the Secure Computing
       (seccomp) facility.  As well as the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag,  the
       SECCOMP_RET_USER_NOTIF  action  value, and the SECCOMP_GET_NOTIF_SIZES operation described
       in seccomp(2), this mechanism involves the use of a number of related ioctl(2)  operations
       (described below).

   Overview
       In  conventional  usage of a seccomp filter, the decision about how to treat a system call
       is made by the filter itself.  By contrast, the user-space notification  mechanism  allows
       the  seccomp  filter  to  delegate  the  handling of the system call to another user-space
       process.  Note that this mechanism is explicitly not intended  as  a  method  implementing
       security policy; see NOTES.

       In  the discussion that follows, the thread(s) on which the seccomp filter is installed is
       (are) referred to as the target, and the  process  that  is  notified  by  the  user-space
       notification mechanism is referred to as the supervisor.

       A  suitably privileged supervisor can use the user-space notification mechanism to perform
       actions on behalf of the target.  The advantage of the user-space  notification  mechanism
       is  that  the supervisor will usually be able to retrieve information about the target and
       the performed system call that the seccomp filter itself cannot.   (A  seccomp  filter  is
       limited in the information it can obtain and the actions that it can perform because it is
       running on a virtual machine inside the kernel.)

       An overview of the steps performed by the target and the supervisor is as follows:

       (1)  The  target  establishes  a  seccomp  filter  in  the  usual  manner,  but  with  two
            differences:

            •  The  seccomp(2) flags argument includes the flag SECCOMP_FILTER_FLAG_NEW_LISTENER.
               Consequently, the return value of  the  (successful)  seccomp(2)  call  is  a  new
               "listening"  file  descriptor that can be used to receive notifications.  Only one
               "listening" seccomp filter can be installed for a thread.

            •  In cases where it is appropriate, the seccomp  filter  returns  the  action  value
               SECCOMP_RET_USER_NOTIF.  This return value will trigger a notification event.

       (2)  In  order  that  the  supervisor  can  obtain  notifications using the listening file
            descriptor, (a duplicate of) that file descriptor must be passed from the  target  to
            the  supervisor.   One  way  in  which  this  could  be  done  is by passing the file
            descriptor over a UNIX domain socket connection between the target and the supervisor
            (using  the  SCM_RIGHTS ancillary message type described in unix(7)).  Another way to
            do this is through the use of pidfd_getfd(2).

       (3)  The supervisor will receive notification events on  the  listening  file  descriptor.
            These  events  are  returned  as  structures  of  type  seccomp_notif.   Because this
            structure and its size may evolve over kernel versions,  the  supervisor  must  first
            determine  the  size  of  this structure using the seccomp(2) SECCOMP_GET_NOTIF_SIZES
            operation, which returns a structure of  type  seccomp_notif_sizes.   The  supervisor
            allocates  a  buffer  of  size  seccomp_notif_sizes.seccomp_notif  bytes  to  receive
            notification events.  In addition,the supervisor allocates  another  buffer  of  size
            seccomp_notif_sizes.seccomp_notif_resp    bytes    for   the   response   (a   struct
            seccomp_notif_resp structure) that it will  provide  to  the  kernel  (and  thus  the
            target).

       (4)  The  target  then  performs  its  workload,  which includes system calls that will be
            controlled by the seccomp filter.  Whenever one of  these  system  calls  causes  the
            filter  to  return the SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet)
            execute the system call; instead, execution of  the  target  is  temporarily  blocked
            inside  the  kernel  (in  a  sleep  state  that  is  interruptible  by signals) and a
            notification event is generated on the listening file descriptor.

       (5)  The  supervisor  can  now  repeatedly  monitor  the  listening  file  descriptor  for
            SECCOMP_RET_USER_NOTIF-triggered  events.   To  do  this,  the  supervisor  uses  the
            SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information about a  notification
            event;  this  operation  blocks until an event is available.  The operation returns a
            seccomp_notif structure containing information about the system call  that  is  being
            attempted  by  the  target.   (As described in NOTES, the file descriptor can also be
            monitored with select(2), poll(2), or epoll(7).)

       (6)  The  seccomp_notif  structure  returned  by  the  SECCOMP_IOCTL_NOTIF_RECV  operation
            includes  the  same  information  (a  seccomp_data  structure) that was passed to the
            seccomp filter.  This information allows the supervisor to discover the  system  call
            number and the arguments for the target's system call.  In addition, the notification
            event contains the ID of the thread that triggered  the  notification  and  a  unique
            cookie   value   that   is   used   in  subsequent  SECCOMP_IOCTL_NOTIF_ID_VALID  and
            SECCOMP_IOCTL_NOTIF_SEND operations.

            The information in the notification can be used to discover  the  values  of  pointer
            arguments  for  the target's system call.  (This is something that can't be done from
            within a seccomp filter.)  One way in which the supervisor can do this is to open the
            corresponding  /proc/tid/mem file (see proc(5)) and read bytes from the location that
            corresponds to  one  of  the  pointer  arguments  whose  value  is  supplied  in  the
            notification  event.   (The supervisor must be careful to avoid a race condition that
            can occur when doing this; see the description  of  the  SECCOMP_IOCTL_NOTIF_ID_VALID
            ioctl(2)  operation  below.)   In  addition,  the  supervisor can access other system
            information that is visible in user space but which is not accessible from a  seccomp
            filter.

       (7)  Having  obtained information as per the previous step, the supervisor may then choose
            to perform an action in response to the target's system call (which, as noted  above,
            is  not  executed  when  the seccomp filter returns the SECCOMP_RET_USER_NOTIF action
            value).

            One example use case here relates to containers.  The target may be located inside  a
            container where it does not have sufficient capabilities to mount a filesystem in the
            container's mount namespace.  However,  the  supervisor  may  be  a  more  privileged
            process that does have sufficient capabilities to perform the mount operation.

       (8)  The  supervisor  then  sends a response to the notification.  The information in this
            response is used by the kernel to construct a return value for  the  target's  system
            call and provide a value that will be assigned to the errno variable of the target.

            The  response is sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation, which is
            used to transmit a  seccomp_notif_resp  structure  to  the  kernel.   This  structure
            includes  a  cookie value that the supervisor obtained in the seccomp_notif structure
            returned by the SECCOMP_IOCTL_NOTIF_RECV operation.  This  cookie  value  allows  the
            kernel  to  associate  the response with the target.  This structure must include the
            cookie value that the supervisor obtained in the seccomp_notif structure returned  by
            the SECCOMP_IOCTL_NOTIF_RECV operation; the cookie allows the kernel to associate the
            response with the target.

       (9)  Once the notification has been sent, the system call in the target  thread  unblocks,
            returning  the  information  that  was provided by the supervisor in the notification
            response.

       As a variation on the last two steps, the supervisor can send a response  that  tells  the
       kernel  that  it  should  execute  the  target thread's system call; see the discussion of
       SECCOMP_USER_NOTIF_FLAG_CONTINUE, below.

IOCTL OPERATIONS

       The following ioctl(2) operations are supported by  the  seccomp  user-space  notification
       file  descriptor.   For  each of these operations, the first (file descriptor) argument of
       ioctl(2) is the listening file descriptor returned  by  a  call  to  seccomp(2)  with  the
       SECCOMP_FILTER_FLAG_NEW_LISTENER flag.

   SECCOMP_IOCTL_NOTIF_RECV
       The  SECCOMP_IOCTL_NOTIF_RECV  operation  (available  since Linux 5.0) is used to obtain a
       user-space notification event.  If no such  event  is  currently  pending,  the  operation
       blocks  until an event occurs.  The third ioctl(2) argument is a pointer to a structure of
       the following form which contains information about the event.   This  structure  must  be
       zeroed out before the call.

           struct seccomp_notif {
               __u64  id;              /* Cookie */
               __u32  pid;             /* TID of target thread */
               __u32  flags;           /* Currently unused (0) */
               struct seccomp_data data;   /* See seccomp(2) */
           };

       The fields in this structure are as follows:

       id     This is a cookie for the notification.  Each such cookie is guaranteed to be unique
              for the corresponding seccomp filter.

              •  The cookie can be used with the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2)  operation
                 described below.

              •  When  returning  a  notification  response  to  the  kernel, the supervisor must
                 include the cookie value in the seccomp_notif_resp structure that  is  specified
                 as the argument of the SECCOMP_IOCTL_NOTIF_SEND operation.

       pid    This is the thread ID of the target thread that triggered the notification event.

       flags  This  is  a  bit  mask of flags providing further information on the event.  In the
              current implementation, this field is always zero.

       data   This is a seccomp_data structure containing information about the system call  that
              triggered  the  notification.   This  is  the  same structure that is passed to the
              seccomp filter.  See seccomp(2) for details of this structure.

       On success, this operation returns 0; on failure, -1 is returned,  and  errno  is  set  to
       indicate the cause of the error.  This operation can fail with the following errors:

       EINVAL (since Linux 5.5)
              The seccomp_notif structure that was passed to the call contained nonzero fields.

       ENOENT The  target thread was killed by a signal as the notification information was being
              generated, or the target's (blocked)  system  call  was  interrupted  by  a  signal
              handler.

   SECCOMP_IOCTL_NOTIF_ID_VALID
       The  SECCOMP_IOCTL_NOTIF_ID_VALID  operation  (available since Linux 5.0) is used to check
       that a notification ID returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation is  still
       valid (i.e., that the target still exists and its system call is still blocked waiting for
       a response).

       The  third  ioctl(2)  argument  is  a  pointer  to  the  cookie  (id)  returned   by   the
       SECCOMP_IOCTL_NOTIF_RECV operation.

       This  operation is necessary to avoid race conditions that can occur when the pid returned
       by the SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that process  ID  is  reused  by
       another process.  An example of this kind of race is the following

       (1)  A  notification  is  generated  on  the  listening  file  descriptor.   The  returned
            seccomp_notif contains the TID of  the  target  thread  (in  the  pid  field  of  the
            structure).

       (2)  The target terminates.

       (3)  Another thread or process is created on the system that by chance reuses the TID that
            was freed when the target terminated.

       (4)  The supervisor open(2)s the /proc/tid/mem file for the TID obtained in step  1,  with
            the  intention  of  (say)  inspecting  the  memory  location(s)  that  containing the
            argument(s) of the system call that triggered the notification in step 1.

       In the above scenario, the risk is that the supervisor may try to access the memory  of  a
       process  other than the target.  This race can be avoided by following the call to open(2)
       with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to verify that the  process  that  generated
       the  notification  is  still  alive.  (Note that if the target terminates after the latter
       step, a subsequent read(2) from the file descriptor may return 0, indicating end of file.)

       See NOTES for a discussion of other cases where SECCOMP_IOCTL_NOTIF_ID_VALID  checks  must
       be performed.

       On  success  (i.e.,  the  notification  ID  is still valid), this operation returns 0.  On
       failure (i.e., the notification ID is no longer valid), -1 is returned, and errno  is  set
       to ENOENT.

   SECCOMP_IOCTL_NOTIF_SEND
       The  SECCOMP_IOCTL_NOTIF_SEND  operation  (available  since  Linux  5.0) is used to send a
       notification response back to the kernel.  The third ioctl(2) argument of  this  structure
       is a pointer to a structure of the following form:

           struct seccomp_notif_resp {
               __u64 id;           /* Cookie value */
               __s64 val;          /* Success return value */
               __s32 error;        /* 0 (success) or negative error number */
               __u32 flags;        /* See below */
           };

       The fields of this structure are as follows:

       id     This  is  the  cookie  value  that  was obtained using the SECCOMP_IOCTL_NOTIF_RECV
              operation.  This cookie  value  allows  the  kernel  to  correctly  associate  this
              response with the system call that triggered the user-space notification.

       val    This  is  the value that will be used for a spoofed success return for the target's
              system call; see below.

       error  This is the value that will be used as the error number (errno) for a spoofed error
              return for the target's system call; see below.

       flags  This is a bit mask that includes zero or more of the following flags:

              SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
                     Tell the kernel to execute the target's system call.

       Two kinds of response are possible:

       •  A response to the kernel telling it to execute the target's system call.  In this case,
          the flags field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error and val  fields
          must be zero.

          This  kind  of  response can be useful in cases where the supervisor needs to do deeper
          analysis of the target's system call than is possible  from  a  seccomp  filter  (e.g.,
          examining  the  values  of pointer arguments), and, having decided that the system call
          does not require emulation by the supervisor, the supervisor wants the system  call  to
          be executed normally in the target.

          The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used with caution; see NOTES.

       •  A spoofed return value for the target's system call.  In this case, the kernel does not
          execute the target's system call, instead causing the system call to return  a  spoofed
          value  as  specified  by  fields  of  the seccomp_notif_resp structure.  The supervisor
          should set the fields of this structure as follows:

          +  flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE.

          +  error is set either to 0 for a spoofed "success"  return  or  to  a  negative  error
             number  for  a  spoofed "failure" return.  In the former case, the kernel causes the
             target's system call to return the value specified in the val field.  In the  latter
             case, the kernel causes the target's system call to return -1, and errno is assigned
             the negated error value.

          +  val is set to a value that will be used as the return value for a spoofed  "success"
             return  for  the  target's  system  call.  The value in this field is ignored if the
             error field contains a nonzero value.

       On success, this operation returns 0; on failure, -1 is returned,  and  errno  is  set  to
       indicate the cause of the error.  This operation can fail with the following errors:

       EINPROGRESS
              A response to this notification has already been sent.

       EINVAL An invalid value was specified in the flags field.

       EINVAL The  flags  field  contained SECCOMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
              field was not zero.

       ENOENT The blocked system call in the target has been interrupted by a signal  handler  or
              the target has terminated.

   SECCOMP_IOCTL_NOTIF_ADDFD
       The  SECCOMP_IOCTL_NOTIF_ADDFD operation (available since Linux 5.9) allows the supervisor
       to install a file descriptor into the target's file descriptor table.  Much like  the  use
       of  SCM_RIGHTS messages described in unix(7), this operation is semantically equivalent to
       duplicating a file descriptor  from  the  supervisor's  file  descriptor  table  into  the
       target's file descriptor table.

       The  SECCOMP_IOCTL_NOTIF_ADDFD operation permits the supervisor to emulate a target system
       call (such as socket(2) or openat(2)) that generates a file  descriptor.   The  supervisor
       can  perform  the system call that generates the file descriptor (and associated open file
       description) and then use this operation to allocate a file descriptor that refers to  the
       same  open file description in the target.  (For an explanation of open file descriptions,
       see open(2).)

       Once this operation has been performed, the supervisor can close  its  copy  of  the  file
       descriptor.

       In  the  target, the received file descriptor is subject to the same Linux Security Module
       (LSM) checks as are applied to a  file  descriptor  that  is  received  in  an  SCM_RIGHTS
       ancillary  message.   If  the  file  descriptor refers to a socket, it inherits the cgroup
       version 1 network controller settings (classid and netprioidx) of the target.

       The third ioctl(2) argument is a pointer to a structure of the following form:

           struct seccomp_notif_addfd {
               __u64 id;           /* Cookie value */
               __u32 flags;        /* Flags */
               __u32 srcfd;        /* Local file descriptor number */
               __u32 newfd;        /* 0 or desired file descriptor
                                      number in target */
               __u32 newfd_flags;  /* Flags to set on target file
                                      descriptor */
           };

       The fields in this structure are as follows:

       id     This field should be set to the notification ID (cookie value)  that  was  obtained
              via SECCOMP_IOCTL_NOTIF_RECV.

       flags  This  field  is  a  bit  mask  of  flags that modify the behavior of the operation.
              Currently, only one flag is supported:

              SECCOMP_ADDFD_FLAG_SETFD
                     When allocating the file descriptor in the target, use the  file  descriptor
                     number specified in the newfd field.

              SECCOMP_ADDFD_FLAG_SEND (since Linux 5.14)
                     Perform     the     equivalent     of     SECCOMP_IOCTL_NOTIF_ADDFD     plus
                     SECCOMP_IOCTL_NOTIF_SEND as an atomic operation.  On successful  invocation,
                     the  target  process's errno will be 0 and the return value will be the file
                     descriptor number that was allocated in the target.  If allocating the  file
                     descriptor  in  the  target  fails, the target's system call continues to be
                     blocked until a successful response is sent.

       srcfd  This field should be set to the number of the file  descriptor  in  the  supervisor
              that is to be duplicated.

       newfd  This  field determines which file descriptor number is allocated in the target.  If
              the SECCOMP_ADDFD_FLAG_SETFD flag is set, then  this  field  specifies  which  file
              descriptor  number  should be allocated.  If this file descriptor number is already
              open in the target,  it  is  atomically  closed  and  reused.   If  the  descriptor
              duplication  fails due to an LSM check, or if srcfd is not a valid file descriptor,
              the file descriptor newfd will not be closed in the target process.

              If the SECCOMP_ADDFD_FLAG_SETFD flag it not set, then this field must be 0, and the
              kernel allocates the lowest unused file descriptor number in the target.

       newfd_flags
              This field is a bit mask specifying flags that should be set on the file descriptor
              that is received in the target process.  Currently,  only  the  following  flag  is
              implemented:

              O_CLOEXEC
                     Set the close-on-exec flag on the received file descriptor.

       On  success,  this  ioctl(2)  call  returns  the  number  of  the file descriptor that was
       allocated in the target.  Assuming that the emulated system call is  one  that  returns  a
       file  descriptor  as  its function result (e.g., socket(2)), this value can be used as the
       return value (resp.val) that is supplied in the response that is  subsequently  sent  with
       the SECCOMP_IOCTL_NOTIF_SEND operation.

       On error, -1 is returned and errno is set to indicate the cause of the error.

       This operation can fail with the following errors:

       EBADF  Allocating the file descriptor in the target would cause the target's RLIMIT_NOFILE
              limit to be exceeded (see getrlimit(2)).

       EBUSY  If the flag SECCOMP_IOCTL_NOTIF_SEND  is  used,  this  means  the  operation  can't
              proceed until other SECCOMP_IOCTL_NOTIF_ADDFD requests are processed.

       EINPROGRESS
              The  user-space  notification specified in the id field exists but has not yet been
              fetched (by a SECCOMP_IOCTL_NOTIF_RECV) or has already  been  responded  to  (by  a
              SECCOMP_IOCTL_NOTIF_SEND).

       EINVAL An invalid flag was specified in the flags or newfd_flags field, or the newfd field
              is nonzero and the SECCOMP_ADDFD_FLAG_SETFD flag was not  specified  in  the  flags
              field.

       EMFILE The  file  descriptor  number  specified  in  newfd  exceeds the limit specified in
              /proc/sys/fs/nr_open.

       ENOENT The blocked system call in the target has been interrupted by a signal  handler  or
              the target has terminated.

       Here   is   some   sample   code   (with   error   handling   omitted)   that   uses   the
       SECCOMP_ADDFD_FLAG_SETFD operation (here, to emulate a call to openat(2)):

           int fd, removeFd;

           fd = openat(req->data.args[0], path, req->data.args[2],
                           req->data.args[3]);

           struct seccomp_notif_addfd addfd;
           addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
           addfd.srcfd = fd;
           addfd.newfd = 0;
           addfd.flags = 0;
           addfd.newfd_flags = O_CLOEXEC;

           targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);

           close(fd);          /* No longer needed in supervisor */

           struct seccomp_notif_resp *resp;
               /* Code to allocate 'resp' omitted */
           resp->id = req->id;
           resp->error = 0;        /* "Success" */
           resp->val = targetFd;
           resp->flags = 0;
           ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);

NOTES

       One example use case for the user-space notification mechanism is  to  allow  a  container
       manager  (a  process  which  is  typically  running with more privilege than the processes
       inside the container) to mount block devices or create device  nodes  for  the  container.
       The  mount  use  case  provides  an  example of where the SECCOMP_USER_NOTIF_FLAG_CONTINUE
       ioctl(2) operation is useful.  Upon receiving a notification for the mount(2) system call,
       the  container  manager  (the  "supervisor")  can  distinguish  a request to mount a block
       filesystem (which would not be possible for a "target" process inside the  container)  and
       mount  that  file  system.   If, on the other hand, the container manager detects that the
       operation could be performed by the process inside the  container  (e.g.,  a  mount  of  a
       tmpfs(5)  filesystem),  it can notify the kernel that the target process's mount(2) system
       call can continue.

   select()/poll()/epoll semantics
       The   file   descriptor    returned    when    seccomp(2)    is    employed    with    the
       SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using  poll(2),  epoll(7), and
       select(2).  These interfaces indicate that the file descriptor is ready as follows:

       •  When a notification is pending, these interfaces indicate that the file  descriptor  is
          readable.  Following such an indication, a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl(2)
          will not block, returning either information about a notification or else failing  with
          the  error  EINTR if the target has been killed by a signal or its system call has been
          interrupted by a signal handler.

       •  After the  notification  has  been  received  (i.e.,  by  the  SECCOMP_IOCTL_NOTIF_RECV
          ioctl(2)  operation),  these  interfaces indicate that the file descriptor is writable,
          meaning that a notification response can be  sent  using  the  SECCOMP_IOCTL_NOTIF_SEND
          ioctl(2) operation.

       •  After  the last thread using the filter has terminated and been reaped using waitpid(2)
          (or similar), the file descriptor  indicates  an  end-of-file  condition  (readable  in
          select(2); POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)).

   Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
       The intent of the user-space notification feature is to allow system calls to be performed
       on behalf of the target.  The target's  system  call  should  either  be  handled  by  the
       supervisor or allowed to continue normally in the kernel (where standard security policies
       will be applied).

       Note well: this mechanism must not be used to make security  policy  decisions  about  the
       system call, which would be inherently race-prone for reasons described next.

       The  SECCOMP_USER_NOTIF_FLAG_CONTINUE  flag  must  be  used  with  caution.  If set by the
       supervisor, the target's system call will continue.  However, there  is  a  time-of-check,
       time-of-use  race  here,  since  an  attacker could exploit the interval of time where the
       target is blocked waiting on the "continue" response to do things such  as  rewriting  the
       system call arguments.

       Note  furthermore that a user-space notifier can be bypassed if the existing filters allow
       the use of seccomp(2) or prctl(2) to install a filter that returns an action value with  a
       higher precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).

       It  should thus be absolutely clear that the seccomp user-space notification mechanism can
       not be used to implement a security policy!  It should only  ever  be  used  in  scenarios
       where  a more privileged process supervises the system calls of a lesser privileged target
       to get around kernel-enforced security restrictions when the supervisor deems  this  safe.
       In  other  words,  in  order to continue a system call, the supervisor should be sure that
       another security mechanism or the kernel itself will sufficiently block the system call if
       its arguments are rewritten to something unsafe.

   Caveats regarding the use of /proc/[tid]/mem
       The  discussion above noted the need to use the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when
       opening the /proc/tid/mem file of the target to avoid the  possibility  of  accessing  the
       memory of the wrong process in the event that the target terminates and its ID is recycled
       by another (unrelated) thread.  However, the  use  of  this  ioctl(2)  operation  is  also
       necessary in other situations, as explained in the following paragraphs.

       Consider  the following scenario, where the supervisor tries to read the pathname argument
       of a target's blocked mount(2) system call:

       (1)  From one of its functions (func()), the target calls mount(2), which triggers a user-
            space notification and causes the target to block.

       (2)  The  supervisor  receives  the  notification, opens /proc/tid/mem, and (successfully)
            performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.

       (3)  The target receives a signal, which causes the mount(2) to abort.

       (4)  The signal handler executes in the target, and returns.

       (5)  Upon return from the handler, the execution of func() resumes, and  it  returns  (and
            perhaps other functions are called, overwriting the memory that had been used for the
            stack frame of func()).

       (6)  Using the address provided in the notification information, the supervisor reads from
            the target's memory location that used to contain the pathname.

       (7)  The  supervisor now calls mount(2) with some arbitrary bytes obtained in the previous
            step.

       The conclusion from the above scenario is this: since the target's blocked system call may
       be  interrupted  by  a  signal  handler, the supervisor must be written to expect that the
       target may abandon its system call at any time; in such an event, any information that the
       supervisor obtained from the target's memory must be considered invalid.

       To  prevent such scenarios, every read from the target's memory must be separated from use
       of the bytes so obtained by a SECCOMP_IOCTL_NOTIF_ID_VALID check.  In the  above  example,
       the  check  would  be  placed  between the two final steps.  An example of such a check is
       shown in EXAMPLES.

       Following on from the above, it should be clear that a write by the  supervisor  into  the
       target's memory can never be considered safe.

   Caveats regarding blocking system calls
       Suppose  that  the  target  performs  a  blocking  system  call (e.g., accept(2)) that the
       supervisor should handle.  The supervisor might then in turn  execute  the  same  blocking
       system call.

       In  this  scenario,  it  is  important  to  note  that  if the target's system call is now
       interrupted by a signal, the supervisor is not informed of this.  If the  supervisor  does
       not  take  suitable  steps  to  actively  discover  that the target's system call has been
       canceled, various difficulties can occur.  Taking the example of accept(2), the supervisor
       might  remain blocked in its accept(2) holding a port number that the target (which, after
       the interruption by the signal handler, perhaps closed  its listening socket) might expect
       to be able to reuse in a bind(2) call.

       Therefore,  when the supervisor wishes to emulate a blocking system call, it must do so in
       such a way that it gets informed if the target's system call is interrupted  by  a  signal
       handler.   For  example,  if the supervisor itself executes the same blocking system call,
       then it  could  employ  a  separate  thread  that  uses  the  SECCOMP_IOCTL_NOTIF_ID_VALID
       operation  to  check if the target is still blocked in its system call.  Alternatively, in
       the accept(2) example, the supervisor might use poll(2) to monitor both  the  notification
       file  descriptor (so as to discover when the target's accept(2) call has been interrupted)
       and the listening file descriptor (so as to know when a connection is available).

       If the target's system call is interrupted, the  supervisor  must  take  care  to  release
       resources (e.g., file descriptors) that it acquired on behalf of the target.

   Interaction with SA_RESTART signal handlers
       Consider the following scenario:

       (1)  The  target  process  has  used  sigaction(2)  to  install  a signal handler with the
            SA_RESTART flag.

       (2)  The target has made a system call that triggered a  seccomp  user-space  notification
            and  the  target  is  currently  blocked  until  the  supervisor sends a notification
            response.

       (3)  A signal is delivered to the target and the signal handler is executed.

       (4)  When  (if)  the  supervisor  attempts  to   send   a   notification   response,   the
            SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will fail with the ENOENT error.

       In  this  scenario,  the  kernel will restart the target's system call.  Consequently, the
       supervisor will receive another user-space notification.   Thus,  depending  on  how  many
       times  the  blocked  system  call  is  interrupted by a signal handler, the supervisor may
       receive multiple notifications for the same instance of a system call in the target.

       One oddity is that system call restarting as described in this scenario  will  occur  even
       for  the  blocking system calls listed in signal(7) that would never normally be restarted
       by the SA_RESTART flag.

       Furthermore,  if   the   supervisor   response   is   a   file   descriptor   added   with
       SECCOMP_IOCTL_NOTIF_ADDFD, then the flag SECCOMP_ADDFD_FLAG_SEND can be used to atomically
       add the file descriptor and return  that  value,  making  sure  no  file  descriptors  are
       inadvertently leaked into the target.

BUGS

       If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed after the target terminates,
       then the ioctl(2) call simply blocks (rather than returning an error to indicate that  the
       target no longer exists).

EXAMPLES

       The  (somewhat  contrived)  program  shown  below  demonstrates  the use of the interfaces
       described in this page.  The program creates a child process that serves as  the  "target"
       process.    The   child   process   installs   a   seccomp   filter   that   returns   the
       SECCOMP_RET_USER_NOTIF action value if a call is made to mkdir(2).  The child process then
       calls  mkdir(2)  once  for  each  of  the supplied command-line arguments, and reports the
       result  returned  by  the  call.   After  processing  all  arguments,  the  child  process
       terminates.

       The  parent  process  acts  as  the  supervisor,  listening for the notifications that are
       generated when the target process calls mkdir(2).  When such a  notification  occurs,  the
       supervisor examines the memory of the target process (using /proc/pid/mem) to discover the
       pathname argument that was supplied  to  the  mkdir(2)  call,  and  performs  one  of  the
       following actions:

       •  If  the pathname begins with the prefix "/tmp/", then the supervisor attempts to create
          the specified directory, and then spoofs a return for the target process based  on  the
          return  value of the supervisor's mkdir(2) call.  In the event that that call succeeds,
          the spoofed success return value is the length of the pathname.

       •  If the pathname begins with "./" (i.e., it is  a  relative  pathname),  the  supervisor
          sends  a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel to say that the kernel
          should execute the target process's mkdir(2) call.

       •  If the pathname begins with some other prefix, the supervisor spoofs  an  error  return
          for the target process, so that the target process's mkdir(2) call appears to fail with
          the error EOPNOTSUPP ("Operation  not  supported").   Additionally,  if  the  specified
          pathname is exactly "/bye", then the supervisor terminates.

       This  program  can  be  used to demonstrate various aspects of the behavior of the seccomp
       user-space notification mechanism.  To help aid  such  demonstrations,  the  program  logs
       various messages to show the operation of the target process (lines prefixed "T:") and the
       supervisor (indented lines prefixed "S:").

       In the following example, the target  attempts  to  create  the  directory  /tmp/x.   Upon
       receiving  the  notification, the supervisor creates the directory on the target's behalf,
       and spoofs a success return to be received by the target process's mkdir(2) call.

           $ ./seccomp_unotify /tmp/x
           T: PID = 23168

           T: about to mkdir("/tmp/x")
                   S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
                   S: executing: mkdir("/tmp/x", 0700)
                   S: success! spoofed return = 6
                   S: sending response (flags = 0; val = 6; error = 0)
           T: SUCCESS: mkdir(2) returned 6

           T: terminating
                   S: target has terminated; bye

       In the above output, note that the spoofed return value seen by the target  process  is  6
       (the length of the pathname /tmp/x), whereas a normal mkdir(2) call returns 0 on success.

       In the next example, the target attempts to create a directory using the relative pathname
       ./sub.    Since   this   pathname   starts   with   "./",   the   supervisor    sends    a
       SECCOMP_USER_NOTIF_FLAG_CONTINUE   response   to   the   kernel,   and   the  kernel  then
       (successfully) executes the target process's mkdir(2) call.

           $ ./seccomp_unotify ./sub
           T: PID = 23204

           T: about to mkdir("./sub")
                   S: got notification (ID 0xddb16abe25b4c12) for PID 23204
                   S: target can execute system call
                   S: sending response (flags = 0x1; val = 0; error = 0)
           T: SUCCESS: mkdir(2) returned 0

           T: terminating
                   S: target has terminated; bye

       If the target process attempts to create a directory with a pathname  that  doesn't  start
       with  "."  and  doesn't begin with the prefix "/tmp/", then the supervisor spoofs an error
       return (EOPNOTSUPP, "Operation not  supported") for the target's mkdir(2) call  (which  is
       not executed):

           $ ./seccomp_unotify /xxx
           T: PID = 23178

           T: about to mkdir("/xxx")
                   S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
                   S: spoofing error response (Operation not supported)
                   S: sending response (flags = 0; val = 0; error = -95)
           T: ERROR: mkdir(2): Operation not supported

           T: terminating
                   S: target has terminated; bye

       In  the  next example, the target process attempts to create a directory with the pathname
       /tmp/nosuchdir/b.  Upon receiving the notification, the supervisor attempts to create that
       directory,  but  the  mkdir(2)  call  fails  because the directory /tmp/nosuchdir does not
       exist.  Consequently, the supervisor spoofs an error return that passes the error that  it
       received back to the target process's mkdir(2) call.

           $ ./seccomp_unotify /tmp/nosuchdir/b
           T: PID = 23199

           T: about to mkdir("/tmp/nosuchdir/b")
                   S: got notification (ID 0x8744454293506046) for PID 23199
                   S: executing: mkdir("/tmp/nosuchdir/b", 0700)
                   S: failure! (errno = 2; No such file or directory)
                   S: sending response (flags = 0; val = 0; error = -2)
           T: ERROR: mkdir(2): No such file or directory

           T: terminating
                   S: target has terminated; bye

       If  the  supervisor  receives  a  notification  and sees that the argument of the target's
       mkdir(2) is the string "/bye", then  (as  well  as  spoofing  an  EOPNOTSUPP  error),  the
       supervisor  terminates.  If the target process subsequently executes another mkdir(2) that
       triggers its seccomp filter to return the SECCOMP_RET_USER_NOTIF action  value,  then  the
       kernel  causes  the  target process's system call to fail with the error ENOSYS ("Function
       not implemented").  This is demonstrated by the following example:

           $ ./seccomp_unotify /bye /tmp/y
           T: PID = 23185

           T: about to mkdir("/bye")
                   S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
                   S: spoofing error response (Operation not supported)
                   S: sending response (flags = 0; val = 0; error = -95)
                   S: terminating **********
           T: ERROR: mkdir(2): Operation not supported

           T: about to mkdir("/tmp/y")
           T: ERROR: mkdir(2): Function not implemented

           T: terminating

   Program source
       #define _GNU_SOURCE
       #include <err.h>
       #include <errno.h>
       #include <fcntl.h>
       #include <limits.h>
       #include <linux/audit.h>
       #include <linux/filter.h>
       #include <linux/seccomp.h>
       #include <signal.h>
       #include <stdbool.h>
       #include <stddef.h>
       #include <stdint.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/ioctl.h>
       #include <sys/prctl.h>
       #include <sys/socket.h>
       #include <sys/stat.h>
       #include <sys/syscall.h>
       #include <sys/types.h>
       #include <sys/un.h>
       #include <unistd.h>

       #define ARRAY_SIZE(arr)  (sizeof(arr) / sizeof((arr)[0]))

       /* Send the file descriptor 'fd' over the connected UNIX domain socket
          'sockfd'. Returns 0 on success, or -1 on error. */

       static int
       sendfd(int sockfd, int fd)
       {
           int             data;
           struct iovec    iov;
           struct msghdr   msgh;
           struct cmsghdr  *cmsgp;

           /* Allocate a char array of suitable size to hold the ancillary data.
              However, since this buffer is in reality a 'struct cmsghdr', use a
              union to ensure that it is suitably aligned. */
           union {
               char   buf[CMSG_SPACE(sizeof(int))];
                               /* Space large enough to hold an 'int' */
               struct cmsghdr align;
           } controlMsg;

           /* The 'msg_name' field can be used to specify the address of the
              destination socket when sending a datagram. However, we do not
              need to use this field because 'sockfd' is a connected socket. */

           msgh.msg_name = NULL;
           msgh.msg_namelen = 0;

           /* On Linux, we must transmit at least one byte of real data in
              order to send ancillary data. We transmit an arbitrary integer
              whose value is ignored by recvfd(). */

           msgh.msg_iov = &iov;
           msgh.msg_iovlen = 1;
           iov.iov_base = &data;
           iov.iov_len = sizeof(int);
           data = 12345;

           /* Set 'msghdr' fields that describe ancillary data */

           msgh.msg_control = controlMsg.buf;
           msgh.msg_controllen = sizeof(controlMsg.buf);

           /* Set up ancillary data describing file descriptor to send */

           cmsgp = CMSG_FIRSTHDR(&msgh);
           cmsgp->cmsg_level = SOL_SOCKET;
           cmsgp->cmsg_type = SCM_RIGHTS;
           cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
           memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));

           /* Send real plus ancillary data */

           if (sendmsg(sockfd, &msgh, 0) == -1)
               return -1;

           return 0;
       }

       /* Receive a file descriptor on a connected UNIX domain socket. Returns
          the received file descriptor on success, or -1 on error. */

       static int
       recvfd(int sockfd)
       {
           int            data, fd;
           ssize_t        nr;
           struct iovec   iov;
           struct msghdr  msgh;

           /* Allocate a char buffer for the ancillary data. See the comments
              in sendfd() */
           union {
               char   buf[CMSG_SPACE(sizeof(int))];
               struct cmsghdr align;
           } controlMsg;
           struct cmsghdr *cmsgp;

           /* The 'msg_name' field can be used to obtain the address of the
              sending socket. However, we do not need this information. */

           msgh.msg_name = NULL;
           msgh.msg_namelen = 0;

           /* Specify buffer for receiving real data */

           msgh.msg_iov = &iov;
           msgh.msg_iovlen = 1;
           iov.iov_base = &data;       /* Real data is an 'int' */
           iov.iov_len = sizeof(int);

           /* Set 'msghdr' fields that describe ancillary data */

           msgh.msg_control = controlMsg.buf;
           msgh.msg_controllen = sizeof(controlMsg.buf);

           /* Receive real plus ancillary data; real data is ignored */

           nr = recvmsg(sockfd, &msgh, 0);
           if (nr == -1)
               return -1;

           cmsgp = CMSG_FIRSTHDR(&msgh);

           /* Check the validity of the 'cmsghdr' */

           if (cmsgp == NULL
               || cmsgp->cmsg_len != CMSG_LEN(sizeof(int))
               || cmsgp->cmsg_level != SOL_SOCKET
               || cmsgp->cmsg_type != SCM_RIGHTS)
           {
               errno = EINVAL;
               return -1;
           }

           /* Return the received file descriptor to our caller */

           memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
           return fd;
       }

       static void
       sigchldHandler(int sig)
       {
           char msg[] = "\tS: target has terminated; bye\n";

           write(STDOUT_FILENO, msg, sizeof(msg) - 1);
           _exit(EXIT_SUCCESS);
       }

       static int
       seccomp(unsigned int operation, unsigned int flags, void *args)
       {
           return syscall(SYS_seccomp, operation, flags, args);
       }

       /* The following is the x86-64-specific BPF boilerplate code for checking
          that the BPF program is running on the right architecture + ABI. At
          completion of these instructions, the accumulator contains the system
          call number. */

       /* For the x32 ABI, all system call numbers have bit 30 set */

       #define X32_SYSCALL_BIT         0x40000000

       #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                        (offsetof(struct seccomp_data, arch))), \
               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                        (offsetof(struct seccomp_data, nr))), \
               BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)

       /* installNotifyFilter() installs a seccomp filter that generates
          user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
          calls mkdir(2); the filter allows all other system calls.

          The function return value is a file descriptor from which the
          user-space notifications can be fetched. */

       static int
       installNotifyFilter(void)
       {
           int notifyFd;

           struct sock_filter filter[] = {
               X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,

               /* mkdir() triggers notification to user-space supervisor */

               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_mkdir, 0, 1),
               BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),

               /* Every other system call is allowed */

               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
           };

           struct sock_fprog prog = {
               .len = ARRAY_SIZE(filter),
               .filter = filter,
           };

           /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
              as a result, seccomp() returns a notification file descriptor. */

           notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
                              SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
           if (notifyFd == -1)
               err(EXIT_FAILURE, "seccomp-install-notify-filter");

           return notifyFd;
       }

       /* Close a pair of sockets created by socketpair() */

       static void
       closeSocketPair(int sockPair[2])
       {
           if (close(sockPair[0]) == -1)
               err(EXIT_FAILURE, "closeSocketPair-close-0");
           if (close(sockPair[1]) == -1)
               err(EXIT_FAILURE, "closeSocketPair-close-1");
       }

       /* Implementation of the target process; create a child process that:

          (1) installs a seccomp filter with the
              SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
          (2) writes the seccomp notification file descriptor returned from
              the previous step onto the UNIX domain socket, 'sockPair[0]';
          (3) calls mkdir(2) for each element of 'argv'.

          The function return value in the parent is the PID of the child
          process; the child does not return from this function. */

       static pid_t
       targetProcess(int sockPair[2], char *argv[])
       {
           int    notifyFd, s;
           pid_t  targetPid;

           targetPid = fork();

           if (targetPid == -1)
               err(EXIT_FAILURE, "fork");

           if (targetPid > 0)          /* In parent, return PID of child */
               return targetPid;

           /* Child falls through to here */

           printf("T: PID = %ld\n", (long) getpid());

           /* Install seccomp filter(s) */

           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
               err(EXIT_FAILURE, "prctl");

           notifyFd = installNotifyFilter();

           /* Pass the notification file descriptor to the tracing process over
              a UNIX domain socket */

           if (sendfd(sockPair[0], notifyFd) == -1)
               err(EXIT_FAILURE, "sendfd");

           /* Notification and socket FDs are no longer needed in target */

           if (close(notifyFd) == -1)
               err(EXIT_FAILURE, "close-target-notify-fd");

           closeSocketPair(sockPair);

           /* Perform a mkdir() call for each of the command-line arguments */

           for (char **ap = argv; *ap != NULL; ap++) {
               printf("\nT: about to mkdir(\"%s\")\n", *ap);

               s = mkdir(*ap, 0700);
               if (s == -1)
                   perror("T: ERROR: mkdir(2)");
               else
                   printf("T: SUCCESS: mkdir(2) returned %d\n", s);
           }

           printf("\nT: terminating\n");
           exit(EXIT_SUCCESS);
       }

       /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
          operation is still valid. It will no longer be valid if the target
          process has terminated or is no longer blocked in the system call that
          generated the notification (because it was interrupted by a signal).

          This operation can be used when doing such things as accessing
          /proc/PID files in the target process in order to avoid TOCTOU race
          conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
          terminates and is reused by another process. */

       static bool
       cookieIsValid(int notifyFd, uint64_t id)
       {
           return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
       }

       /* Access the memory of the target process in order to fetch the
          pathname referred to by the system call argument 'argNum' in
          'req->data.args[]'.  The pathname is returned in 'path',
          a buffer of 'len' bytes allocated by the caller.

          Returns true if the pathname is successfully fetched, and false
          otherwise. For possible causes of failure, see the comments below. */

       static bool
       getTargetPathname(struct seccomp_notif *req, int notifyFd,
                         int argNum, char *path, size_t len)
       {
           int      procMemFd;
           char     procMemPath[PATH_MAX];
           ssize_t  nread;

           snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);

           procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
           if (procMemFd == -1)
               return false;

           /* Check that the process whose info we are accessing is still alive
              and blocked in the system call that caused the notification.
              If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
              cookieIsValid()) succeeded, we know that the /proc/PID/mem file
              descriptor that we opened corresponded to the process for which we
              received a notification. If that process subsequently terminates,
              then read() on that file descriptor will return 0 (EOF). */

           if (!cookieIsValid(notifyFd, req->id)) {
               close(procMemFd);
               return false;
           }

           /* Read bytes at the location containing the pathname argument */

           nread = pread(procMemFd, path, len, req->data.args[argNum]);

           close(procMemFd);

           if (nread <= 0)
               return false;

           /* Once again check that the notification ID is still valid. The
              case we are particularly concerned about here is that just
              before we fetched the pathname, the target's blocked system
              call was interrupted by a signal handler, and after the handler
              returned, the target carried on execution (past the interrupted
              system call). In that case, we have no guarantees about what we
              are reading, since the target's memory may have been arbitrarily
              changed by subsequent operations. */

           if (!cookieIsValid(notifyFd, req->id)) {
               perror("\tS: notification ID check failed!!!");
               return false;
           }

           /* Even if the target's system call was not interrupted by a signal,
              we have no guarantees about what was in the memory of the target
              process. (The memory may have been modified by another thread, or
              even by an external attacking process.) We therefore treat the
              buffer returned by pread() as untrusted input. The buffer should
              contain a terminating null byte; if not, then we will trigger an
              error for the target process. */

           if (strnlen(path, nread) < nread)
               return true;

           return false;
       }

       /* Allocate buffers for the seccomp user-space notification request and
          response structures. It is the caller's responsibility to free the
          buffers returned via 'req' and 'resp'. */

       static void
       allocSeccompNotifBuffers(struct seccomp_notif **req,
                                struct seccomp_notif_resp **resp,
                                struct seccomp_notif_sizes *sizes)
       {
           size_t  resp_size;

           /* Discover the sizes of the structures that are used to receive
              notifications and send notification responses, and allocate
              buffers of those sizes. */

           if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == -1)
               err(EXIT_FAILURE, "seccomp-SECCOMP_GET_NOTIF_SIZES");

           *req = malloc(sizes->seccomp_notif);
           if (*req == NULL)
               err(EXIT_FAILURE, "malloc-seccomp_notif");

           /* When allocating the response buffer, we must allow for the fact
              that the user-space binary may have been built with user-space
              headers where 'struct seccomp_notif_resp' is bigger than the
              response buffer expected by the (older) kernel. Therefore, we
              allocate a buffer that is the maximum of the two sizes. This
              ensures that if the supervisor places bytes into the response
              structure that are past the response size that the kernel expects,
              then the supervisor is not touching an invalid memory location. */

           resp_size = sizes->seccomp_notif_resp;
           if (sizeof(struct seccomp_notif_resp) > resp_size)
               resp_size = sizeof(struct seccomp_notif_resp);

           *resp = malloc(resp_size);
           if (*resp == NULL)
               err(EXIT_FAILURE, "malloc-seccomp_notif_resp");

       }

       /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
          descriptor, 'notifyFd'. */

       static void
       handleNotifications(int notifyFd)
       {
           bool                        pathOK;
           char                        path[PATH_MAX];
           struct seccomp_notif        *req;
           struct seccomp_notif_resp   *resp;
           struct seccomp_notif_sizes  sizes;

           allocSeccompNotifBuffers(&req, &resp, &sizes);

           /* Loop handling notifications */

           for (;;) {

               /* Wait for next notification, returning info in '*req' */

               memset(req, 0, sizes.seccomp_notif);
               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
                   if (errno == EINTR)
                       continue;
                   err(EXIT_FAILURE, "\tS: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
               }

               printf("\tS: got notification (ID %#llx) for PID %d\n",
                      req->id, req->pid);

               /* The only system call that can generate a notification event
                  is mkdir(2). Nevertheless, we check that the notified system
                  call is indeed mkdir() as kind of future-proofing of this
                  code in case the seccomp filter is later modified to
                  generate notifications for other system calls. */

               if (req->data.nr != SYS_mkdir) {
                   printf("\tS: notification contained unexpected "
                          "system call number; bye!!!\n");
                   exit(EXIT_FAILURE);
               }

               pathOK = getTargetPathname(req, notifyFd, 0, path, sizeof(path));

               /* Prepopulate some fields of the response */

               resp->id = req->id;     /* Response includes notification ID */
               resp->flags = 0;
               resp->val = 0;

               /* If getTargetPathname() failed, trigger an EINVAL error
                  response (sending this response may yield an error if the
                  failure occurred because the notification ID was no longer
                  valid); if the directory is in /tmp, then create it on behalf
                  of the supervisor; if the pathname starts with '.', tell the
                  kernel to let the target process execute the mkdir();
                  otherwise, give an error for a directory pathname in any other
                  location. */

               if (!pathOK) {
                   resp->error = -EINVAL;
                   printf("\tS: spoofing error for invalid pathname (%s)\n",
                          strerror(-resp->error));
               } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
                   printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
                          path, req->data.args[1]);

                   if (mkdir(path, req->data.args[1]) == 0) {
                       resp->error = 0;            /* "Success" */
                       resp->val = strlen(path);   /* Used as return value of
                                                      mkdir() in target */
                       printf("\tS: success! spoofed return = %lld\n",
                              resp->val);
                   } else {

                       /* If mkdir() failed in the supervisor, pass the error
                          back to the target */

                       resp->error = -errno;
                       printf("\tS: failure! (errno = %d; %s)\n", errno,
                              strerror(errno));
                   }
               } else if (strncmp(path, "./", strlen("./")) == 0) {
                   resp->error = resp->val = 0;
                   resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
                   printf("\tS: target can execute system call\n");
               } else {
                   resp->error = -EOPNOTSUPP;
                   printf("\tS: spoofing error response (%s)\n",
                          strerror(-resp->error));
               }

               /* Send a response to the notification */

               printf("\tS: sending response "
                      "(flags = %#x; val = %lld; error = %d)\n",
                      resp->flags, resp->val, resp->error);

               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
                   if (errno == ENOENT)
                       printf("\tS: response failed with ENOENT; "
                              "perhaps target process's syscall was "
                              "interrupted by a signal?\n");
                   else
                       perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
               }

               /* If the pathname is just "/bye", then the supervisor breaks out
                  of the loop and terminates. This allows us to see what happens
                  if the target process makes further calls to mkdir(2). */

               if (strcmp(path, "/bye") == 0)
                   break;
           }

           free(req);
           free(resp);
           printf("\tS: terminating **********\n");
           exit(EXIT_FAILURE);
       }

       /* Implementation of the supervisor process:

          (1) obtains the notification file descriptor from 'sockPair[1]'
          (2) handles notifications that arrive on that file descriptor. */

       static void
       supervisor(int sockPair[2])
       {
           int notifyFd;

           notifyFd = recvfd(sockPair[1]);

           if (notifyFd == -1)
               err(EXIT_FAILURE, "recvfd");

           closeSocketPair(sockPair);  /* We no longer need the socket pair */

           handleNotifications(notifyFd);
       }

       int
       main(int argc, char *argv[])
       {
           int               sockPair[2];
           struct sigaction  sa;

           setbuf(stdout, NULL);

           if (argc < 2) {
               fprintf(stderr, "At least one pathname argument is required\n");
               exit(EXIT_FAILURE);
           }

           /* Create a UNIX domain socket that is used to pass the seccomp
              notification file descriptor from the target process to the
              supervisor process. */

           if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
               err(EXIT_FAILURE, "socketpair");

           /* Create a child process--the "target"--that installs seccomp
              filtering. The target process writes the seccomp notification
              file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
              each directory in the command-line arguments. */

           (void) targetProcess(sockPair, &argv[optind]);

           /* Catch SIGCHLD when the target terminates, so that the
              supervisor can also terminate. */

           sa.sa_handler = sigchldHandler;
           sa.sa_flags = 0;
           sigemptyset(&sa.sa_mask);
           if (sigaction(SIGCHLD, &sa, NULL) == -1)
               err(EXIT_FAILURE, "sigaction");

           supervisor(sockPair);

           exit(EXIT_SUCCESS);
       }

SEE ALSO

       ioctl(2), pidfd_getfd(2), pidfd_open(2), seccomp(2)

       A further example program can be found in the  kernel  source  file  samples/seccomp/user-
       trap.c.