oracular (8) tc-bpf.8.gz

Provided by: iproute2_6.10.0-2ubuntu1_amd64 bug

NAME

       BPF - BPF programmable classifier and actions for ingress/egress queueing disciplines

SYNOPSIS

   eBPF classifier (filter) or action:
       tc  filter  ...  bpf  [  object-file OBJ_FILE ] [ section CLS_NAME ] [ export UDS_FILE ] [
       verbose ] [ direct-action | da ] [ skip_hw | skip_sw ] [ police  POLICE_SPEC  ]  [  action
       ACTION_SPEC ] [ classid CLASSID ]
       tc  action  ...  bpf  [  object-file OBJ_FILE ] [ section CLS_NAME ] [ export UDS_FILE ] [
       verbose ]

   cBPF classifier (filter) or action:
       tc filter ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ] [ police  POLICE_SPEC
       ] [ action ACTION_SPEC ] [ classid CLASSID ]
       tc action ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ]

DESCRIPTION

       Extended  Berkeley  Packet  Filter ( eBPF ) and classic Berkeley Packet Filter (originally
       known as BPF, for better distinction referred to as cBPF here) are  both  available  as  a
       fully  programmable and highly efficient classifier and actions. They both offer a minimal
       instruction set for implementing small programs which can safely be loaded into the kernel
       and  thus  executed  in  a  tiny  virtual machine from kernel space. An in-kernel verifier
       guarantees that a specified program always terminates and neither crashes nor  leaks  data
       from the kernel.

       In  Linux,  it's  generally  considered  that  eBPF  is the successor of cBPF.  The kernel
       internally transforms cBPF expressions into eBPF  expressions  and  executes  the  latter.
       Execution  of  them can be performed in an interpreter or at setup time, they can be just-
       in-time compiled (JIT'ed) to run as native machine code.

       Currently, the eBPF JIT compiler is available for the following architectures:

       *   x86_64 (since Linux 3.18)
       *   arm64 (since Linux 3.18)
       *   s390 (since Linux 4.1)
       *   ppc64 (since Linux 4.8)
       *   sparc64 (since Linux 4.12)
       *   mips64 (since Linux 4.13)
       *   arm32 (since Linux 4.14)
       *   x86_32 (since Linux 4.18)

       Whereas the following architectures have cBPF, but  did  not  (yet)  switch  to  eBPF  JIT
       support:

       *   ppc32
       *   sparc32
       *   mips32

       eBPF's  instruction  set has similar underlying principles as the cBPF instruction set, it
       however is  modelled  closer  to  the  underlying  architecture  to  better  mimic  native
       instruction  sets with the aim to achieve a better run-time performance. It is designed to
       be JIT'ed with a one to one mapping, which can also open up the possibility for  compilers
       to  generate  optimized  eBPF code through an eBPF backend that performs almost as fast as
       natively compiled code. Given that LLVM provides such an eBPF backend, eBPF  programs  can
       therefore  easily  be  programmed  in  a  subset  of the C language. Other than that, eBPF
       infrastructure also comes with a construct called "maps". eBPF maps are  key/value  stores
       that  are  shared  between multiple eBPF programs, but also between eBPF programs and user
       space applications.

       For the traffic control subsystem, classifier and actions that can be attached to  ingress
       and  egress qdiscs can be written in eBPF or cBPF. The advantage over other classifier and
       actions is that eBPF/cBPF provides the generic framework, while users can implement  their
       highly specialized use cases efficiently. This means that the classifier or action written
       that way will not suffer from feature bloat, and can therefore  execute  its  task  highly
       efficient.  It  allows for non-linear classification and even merging the action part into
       the classification. Combined with efficient eBPF map data structures, user space can  push
       new  policies  like  classids  into  the  kernel without reloading a classifier, or it can
       gather statistics that are pushed into one map and use another one  for  dynamically  load
       balancing traffic based on the determined load, just to provide a few examples.

PARAMETERS

   object-file
       points  to  an  object  file that has an executable and linkable format (ELF) and contains
       eBPF opcodes and eBPF map definitions. The LLVM compiler infrastructure with clang(1) as a
       C  language  front end is one project that supports emitting eBPF object files that can be
       passed to the eBPF classifier (more details in  the  EXAMPLES  section).  This  option  is
       mandatory when an eBPF classifier or action is to be loaded.

   section
       is  the  name of the ELF section from the object file, where the eBPF classifier or action
       resides. By default the section name for the classifier is called  "classifier",  and  for
       the  action  "action". Given that a single object file can contain multiple classifier and
       actions, the corresponding section name needs to be specified,  if  it  differs  from  the
       defaults.

   export
       points  to a Unix domain socket file. In case the eBPF object file also contains a section
       named "maps" with eBPF map specifications, then the map file descriptors can be handed off
       via  the  Unix domain socket to an eBPF "agent" herding all descriptors after tc lifetime.
       This can be some third party application implementing the IPC counterpart for the  import,
       that  uses  them  for  calling into bpf(2) system call to read out or update eBPF map data
       from user space, for example, for monitoring purposes or to push down new policies.

   verbose
       if set, it will dump the eBPF verifier output,  even  if  loading  the  eBPF  program  was
       successful. By default, only on error, the verifier log is being emitted to the user.

   direct-action | da
       instructs  eBPF  classifier  to not invoke external TC actions, instead use the TC actions
       return codes (TC_ACT_OK, TC_ACT_SHOT etc.) for classifiers.

   skip_hw | skip_sw
       hardware offload control flags. By default TC will try to offload filters to  hardware  if
       possible.  skip_hw explicitly disables the attempt to offload.  skip_sw forces the offload
       and disables running the eBPF program in the kernel.  If hardware offload is not  possible
       and this flag was set kernel will report an error and filter will not be installed at all.

   police
       is  an  optional  parameter  for  an eBPF/cBPF classifier that specifies a police in tc(1)
       which is attached to the classifier, for example, on an ingress qdisc.

   action
       is an optional parameter for an eBPF/cBPF classifier that specifies a subsequent action in
       tc(1) which is attached to a classifier.

   classid
   flowid
       provides  the  default traffic control class identifier for this eBPF/cBPF classifier. The
       default class identifier can also be overwritten by  the  return  code  of  the  eBPF/cBPF
       program.  A default return code of -1 specifies the here provided default class identifier
       to be used. A return code of the eBPF/cBPF program of 0 implies that no match took  place,
       and  a return code other than these two will override the default classid. This allows for
       efficient, non-linear classification with only a single eBPF/cBPF program  as  opposed  to
       having  multiple  individual  programs  for  various class identifiers which would need to
       reparse packet contents.

   bytecode
       is being used for loading cBPF classifier and actions only. The cBPF bytecode is  directly
       passed  as a text string in the form of 's,c t f k,c t f k,c t f k,...'  , where s denotes
       the number of subsequent 4-tuples. One such 4-tuple consists of c t f k decimals, where  c
       represents  the cBPF opcode, t the jump true offset target, f the jump false offset target
       and k the immediate constant/literal. There are various tools that generate code  in  this
       loadable  format,  for example, bpf_asm that ships with the Linux kernel source tree under
       tools/net/ , so it is certainly not expected  to  hack  this  by  hand.  The  bytecode  or
       bytecode-file option is mandatory when a cBPF classifier or action is to be loaded.

   bytecode-file
       also being used to load a cBPF classifier or action. It's effectively the same as bytecode
       only that the cBPF bytecode is not passed directly via command line, but rather resides in
       a text file.

EXAMPLES

   eBPF TOOLING
       A  full  blown  example  including eBPF agent code can be found inside the iproute2 source
       package under: examples/bpf/

       As prerequisites, the kernel needs to have the eBPF system call namely bpf(2) enabled  and
       ships with cls_bpf and act_bpf kernel modules for the traffic control subsystem. To enable
       eBPF/eBPF JIT support, depending which of the two the given architecture supports:

           echo 1 > /proc/sys/net/core/bpf_jit_enable

       A given restricted C file can be compiled via LLVM as:

           clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj -o bpf.o

       The compiler invocation might still simplify in future, so for now, it's  quite  handy  to
       alias this construct in one way or another, for example:

           __bcc() {
                   clang -O2 -emit-llvm -c $1 -o - | \
                   llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
           }

           alias bcc=__bcc

       A minimal, stand-alone unit, which matches on all traffic with the default classid (return
       code of -1) looks like:

           #include <linux/bpf.h>

           #ifndef __section
           # define __section(x)  __attribute__((section(x), used))
           #endif

           __section("classifier") int cls_main(struct __sk_buff *skb)
           {
                   return -1;
           }

           char __license[] __section("license") = "GPL";

       More examples can be found further below in subsection eBPF PROGRAMMING as focus here will
       be on tooling.

       There  can be various other sections, for example, also for actions.  Thus, an object file
       in eBPF can contain multiple entrance points.  Always a specific entrance point,  however,
       must  be  specified  when  configuring with tc. A license must be part of the restricted C
       code and the license string syntax is the same as with Linux kernel modules.   The  kernel
       reserves  its  right  that  some eBPF helper functions can be restricted to GPL compatible
       licenses only, and thus may reject a program from loading into  the  kernel  when  such  a
       license mismatch occurs.

       The  resulting  object  file  from  the compilation can be inspected with the usual set of
       tools that also operate on normal object files, for example objdump(1) for inspecting  ELF
       section headers:

           objdump -h bpf.o
           [...]
           3 classifier    000007f8  0000000000000000  0000000000000000  00000040  2**3
                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
           4 action-mark   00000088  0000000000000000  0000000000000000  00000838  2**3
                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
           5 action-rand   00000098  0000000000000000  0000000000000000  000008c0  2**3
                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
           6 maps          00000030  0000000000000000  0000000000000000  00000958  2**2
                           CONTENTS, ALLOC, LOAD, DATA
           7 license       00000004  0000000000000000  0000000000000000  00000988  2**0
                           CONTENTS, ALLOC, LOAD, DATA
           [...]

       Adding  an  eBPF  classifier from an object file that contains a classifier in the default
       ELF section is trivial (note that instead of "object-file" also shortcuts  such  as  "obj"
       can be used):

           bcc bpf.c
           tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1

       In  case the classifier resides in ELF section "mycls", then that same command needs to be
       invoked as:

           tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1

       Dumping the classifier configuration will tell the location of the  classifier,  in  other
       words that it's from object file "bpf.o" under section "mycls":

           tc filter show dev em1
           filter parent 1: protocol all pref 49152 bpf
           filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1 bpf.o:[mycls]

       The same program can also be installed on ingress qdisc side as opposed to egress ...

           tc qdisc add dev em1 handle ffff: ingress
           tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid ffff:1

       ... and again dumped from there:

           tc filter show dev em1 parent ffff:
           filter protocol all pref 49152 bpf
           filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[mycls]

       Attaching  a  classifier and action on ingress has the restriction that it doesn't have an
       actual underlying queueing discipline.  What  ingress  can  do  is  to  classify,  mangle,
       redirect  or  drop  packets.  When queueing is required on ingress side, then ingress must
       redirect packets to the ifb device, otherwise policing can be used. Moreover, ingress  can
       be  used  to  have an early drop point of unwanted packets before they hit upper layers of
       the networking stack, perform network accounting with eBPF maps that could be shared  with
       egress, or have an early mangle and/or redirection point to different networking devices.

       Multiple  eBPF  actions  and  classifier  can  be  placed into a single object file within
       various sections. In that case, non-default section names must be provided, which  is  the
       case for both actions in this example:

           tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \
                                    action bpf obj bpf.o sec action-mark \
                                    action bpf obj bpf.o sec action-rand ok

       The  advantage of this is that the classifier and the two actions can then share eBPF maps
       with each other, if implemented in the programs.

       In order to access eBPF maps from user space beyond tc(8) setup  lifetime,  the  ownership
       can  be  transferred to an eBPF agent via Unix domain sockets. There are two possibilities
       for implementing this:

       1) implementation of an own eBPF agent that takes care  of  setting  up  the  Unix  domain
       socket  and  implementing  the protocol that tc(8) dictates. A code example of this can be
       found inside the iproute2 source package under: examples/bpf/

       2) use tc exec for transferring the eBPF  map  file  descriptors  through  a  Unix  domain
       socket,  and  spawning an application such as sh(1) . This approach's advantage is that tc
       will place the file descriptors into the environment and thus  make  them  available  just
       like  stdin,  stdout, stderr file descriptors, meaning, in case user applications run from
       within this fd-owner shell, they can terminate and restart without losing eBPF  maps  file
       descriptors. Example invocation with the previous classifier and action mixture:

           tc exec bpf imp /tmp/bpf
           tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid 1:1 \
                                    action bpf obj bpf.o sec action-mark \
                                    action bpf obj bpf.o sec action-rand ok

       Assuming that eBPF maps are shared with classifier and actions, it's enough to export them
       once, for example, from within the classifier or action command. tc will  setup  all  eBPF
       map file descriptors at the time when the object file is first parsed.

       When  a  shell  has  been  spawned,  the  environment  will  have a couple of eBPF related
       variables. BPF_NUM_MAPS provides the total number of maps that have been transferred  over
       the  Unix  domain  socket.  BPF_MAP<X>'s  value  is the file descriptor number that can be
       accessed in eBPF agent applications, in other words, it can directly be used as  the  file
       descriptor  value  for  the  bpf(2)  system call to retrieve or alter eBPF map values. <X>
       denotes the identifier of the eBPF  map.  It  corresponds  to  the  id  member  of  struct
       bpf_elf_map  from the tc eBPF map specification.

       The environment in this example looks as follows:

           sh# env | grep BPF
               BPF_NUM_MAPS=3
               BPF_MAP1=6
               BPF_MAP0=5
               BPF_MAP2=7
           sh# ls -la /proc/self/fd
               [...]
               lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
               lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
               lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
           sh# my_bpf_agent

       eBPF  agents  are  very  useful  in  that  they can prepopulate eBPF maps from user space,
       monitor statistics via maps and based on that feedback, for example, rewrite  classids  in
       eBPF  map  values  during  runtime.  Given  that  eBPF  agents  are  implemented as normal
       applications, they can also dynamically receive traffic  control  policies  from  external
       controllers  and  thus  push  them  down  into  eBPF  maps to dynamically adapt to network
       conditions. Moreover, eBPF maps can also be shared with other  eBPF  program  types  (e.g.
       tracing), thus very powerful combination can therefore be implemented.

   eBPF PROGRAMMING
       eBPF classifier and actions are being implemented in restricted C syntax (in future, there
       could additionally be new language frontends supported).

       The header file linux/bpf.h provides eBPF helper functions that can be called from an eBPF
       program.   This  man page will only provide two minimal, stand-alone examples, have a look
       at examples/bpf from the iproute2 source  package  for  a  fully  fledged  flow  dissector
       example to better demonstrate some of the possibilities with eBPF.

       Supported 32 bit classifier return codes from the C program and their meanings:
           0 , denotes a mismatch
           -1 , denotes the default classid configured from the command line
           else  ,  everything  else  will override the default classid to provide a facility for
           non-linear matching

       Supported  32  bit  action  return  codes  from  the  C  program  and  their  meanings   (
       linux/pkt_cls.h ):
           TC_ACT_OK (0) , will terminate the packet processing pipeline and allows the packet to
           proceed
           TC_ACT_SHOT (2) , will terminate the packet processing pipeline and drops the packet
           TC_ACT_UNSPEC (-1) , will use the default action  configured  from  tc  (similarly  as
           returning -1 from a classifier)
           TC_ACT_PIPE (3) , will iterate to the next action, if available
           TC_ACT_RECLASSIFY  (1)  ,  will  terminate  the  packet  processing pipeline and start
           classification from the beginning
           else , everything else is an unspecified return code

       Both classifier and action return codes are supported in eBPF and cBPF programs.

       To demonstrate restricted C syntax, a minimal toy classifier example  is  provided,  which
       assumes  that  egress  packets, for instance originating from a container, have previously
       been marked in interval [0, 255]. The program keeps statistics on different marks for user
       space and maps the classid to the root qdisc with the marking itself as the minor handle:

           #include <stdint.h>
           #include <asm/types.h>

           #include <linux/bpf.h>
           #include <linux/pkt_sched.h>

           #include "helpers.h"

           struct tuple {
                   long packets;
                   long bytes;
           };

           #define BPF_MAP_ID_STATS        1 /* agent's map identifier */
           #define BPF_MAX_MARK            256

           struct bpf_elf_map __section("maps") map_stats = {
                   .type           =       BPF_MAP_TYPE_ARRAY,
                   .id             =       BPF_MAP_ID_STATS,
                   .size_key       =       sizeof(uint32_t),
                   .size_value     =       sizeof(struct tuple),
                   .max_elem       =       BPF_MAX_MARK,
                   .pinning        =       PIN_GLOBAL_NS,
           };

           static inline void cls_update_stats(const struct __sk_buff *skb,
                                               uint32_t mark)
           {
                   struct tuple *tu;

                   tu = bpf_map_lookup_elem(&map_stats, &mark);
                   if (likely(tu)) {
                           __sync_fetch_and_add(&tu->packets, 1);
                           __sync_fetch_and_add(&tu->bytes, skb->len);
                   }
           }

           __section("cls") int cls_main(struct __sk_buff *skb)
           {
                   uint32_t mark = skb->mark;

                   if (unlikely(mark >= BPF_MAX_MARK))
                           return 0;

                   cls_update_stats(skb, mark);

                   return TC_H_MAKE(TC_H_ROOT, mark);
           }

           char __license[] __section("license") = "GPL";

       Another  small  example  is  a  port redirector which demuxes destination port 80 into the
       interval [8080, 8087] steered by RSS, that can then be  attached  to  ingress  qdisc.  The
       exercise of adding the egress counterpart and IPv6 support is left to the reader:

           #include <asm/types.h>
           #include <asm/byteorder.h>

           #include <linux/bpf.h>
           #include <linux/filter.h>
           #include <linux/in.h>
           #include <linux/if_ether.h>
           #include <linux/ip.h>
           #include <linux/tcp.h>

           #include "helpers.h"

           static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
                                            __u16 old_port, __u16 new_port)
           {
                   bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
                                       old_port, new_port, sizeof(new_port));
                   bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
                                       &new_port, sizeof(new_port), 0);
           }

           static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
           {
                   __u16 dport, dport_new = 8080, off;
                   __u8 ip_proto, ip_vl;

                   ip_proto = load_byte(skb, nh_off +
                                        offsetof(struct iphdr, protocol));
                   if (ip_proto != IPPROTO_TCP)
                           return 0;

                   ip_vl = load_byte(skb, nh_off);
                   if (likely(ip_vl == 0x45))
                           nh_off += sizeof(struct iphdr);
                   else
                           nh_off += (ip_vl & 0xF) << 2;

                   dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
                   if (dport != 80)
                           return 0;

                   off = skb->queue_mapping & 7;
                   set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
                                 __cpu_to_be16(dport_new + off));
                   return -1;
           }

           __section("lb") int lb_main(struct __sk_buff *skb)
           {
                   int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;

                   if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
                           ret = lb_do_ipv4(skb, nh_off);

                   return ret;
           }

           char __license[] __section("license") = "GPL";

       The related helper header file helpers.h in both examples was:

           /* Misc helper macros. */
           #define __section(x) __attribute__((section(x), used))
           #define offsetof(x, y) __builtin_offsetof(x, y)
           #define likely(x) __builtin_expect(!!(x), 1)
           #define unlikely(x) __builtin_expect(!!(x), 0)

           /* Object pinning settings */
           #define PIN_NONE       0
           #define PIN_OBJECT_NS  1
           #define PIN_GLOBAL_NS  2

           /* ELF map definition */
           struct bpf_elf_map {
               __u32 type;
               __u32 size_key;
               __u32 size_value;
               __u32 max_elem;
               __u32 flags;
               __u32 id;
               __u32 pinning;
               __u32 inner_id;
               __u32 inner_idx;
           };

           /* Some used BPF function calls. */
           static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
                                             int len, int flags) =
                 (void *) BPF_FUNC_skb_store_bytes;
           static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
                                             int to, int flags) =
                 (void *) BPF_FUNC_l4_csum_replace;
           static void *(*bpf_map_lookup_elem)(void *map, void *key) =
                 (void *) BPF_FUNC_map_lookup_elem;

           /* Some used BPF intrinsics. */
           unsigned long long load_byte(void *skb, unsigned long long off)
               asm ("llvm.bpf.load.byte");
           unsigned long long load_half(void *skb, unsigned long long off)
               asm ("llvm.bpf.load.half");

       Best practice, we recommend to only have a single eBPF classifier loaded in tc and perform
       all necessary matching and mangling from there instead of a list of individual  classifier
       and  separate actions. Just a single classifier tailored for a given use-case will be most
       efficient to run.

   eBPF DEBUGGING
       Both tc filter and action commands for bpf support an optional verbose parameter that  can
       be used to inspect the eBPF verifier log. It is dumped by default in case of an error.

       In  case  the eBPF/cBPF JIT compiler has been enabled, it can also be instructed to emit a
       debug output of the resulting opcode image into the kernel log,  which  can  be  read  via
       dmesg(1) :

           echo 2 > /proc/sys/net/core/bpf_jit_enable

       The  Linux  kernel  source  tree ships additionally under tools/net/ a small helper called
       bpf_jit_disasm that reads out the opcode image dump from the  kernel  log  and  dumps  the
       resulting disassembly:

           bpf_jit_disasm -o

       Other  than  that, the Linux kernel also contains an extensive eBPF/cBPF test suite module
       called test_bpf . Upon ...

           modprobe test_bpf

       ... it performs a diversity of test cases and dumps the results into the kernel  log  that
       can  be  inspected  with  dmesg(1)  .  The results can differ depending on whether the JIT
       compiler is enabled or not. In case of failed test cases, the module will fail to load. In
       such  cases, we urge you to file a bug report to the related JIT authors, Linux kernel and
       networking mailing lists.

   cBPF
       Although we generally recommend switching to implementing eBPF classifier and actions, for
       the sake of completeness, a few words on how to program in cBPF will be lost here.

       Likewise,  the  bpf_jit_enable switch can be enabled as mentioned already. Tooling such as
       bpf_jit_disasm is also independent whether eBPF or cBPF code is being loaded.

       Unlike in eBPF, classifier and action are not implemented in restricted C, but rather in a
       minimal assembler-like language or with the help of other tooling.

       The raw interface with tc takes opcodes directly. For example, the most minimal classifier
       matching on every packet resulting in the default classid of 1:1 looks like:

           tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1

       The first decimal of the bytecode sequence denotes the number of  subsequent  4-tuples  of
       cBPF  opcodes.  As  mentioned,  such  a  4-tuple  consists  of  c  t f k decimals, where c
       represents the cBPF opcode, t the jump true offset target, f the jump false offset  target
       and  k the immediate constant/literal. Here, this denotes an unconditional return from the
       program with immediate value of -1.

       Thus, for egress classification, Willem de Bruijn implemented a minimal stand-alone helper
       tool  under  the GNU General Public License version 2 for iptables(8) BPF extension, which
       abuses the libpcap internal classic BPF compiler, his code derived  here  for  usage  with
       tc(8) :

           #include <pcap.h>
           #include <stdio.h>

           int main(int argc, char **argv)
           {
                   struct bpf_program prog;
                   struct bpf_insn *ins;
                   int i, ret, dlt = DLT_RAW;

                   if (argc < 2 || argc > 3)
                           return 1;
                   if (argc == 3) {
                           dlt = pcap_datalink_name_to_val(argv[1]);
                           if (dlt == -1)
                                   return 1;
                   }

                   ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
                                             1, PCAP_NETMASK_UNKNOWN);
                   if (ret)
                           return 1;

                   printf("%d,", prog.bf_len);
                   ins = prog.bf_insns;

                   for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
                           printf("%u %u %u %u,", ins->code,
                                  ins->jt, ins->jf, ins->k);
                   printf("%u %u %u %u",
                          ins->code, ins->jt, ins->jf, ins->k);

                   pcap_freecode(&prog);
                   return 0;
           }

       Given  this  small  helper, any tcpdump(8) filter expression can be abused as a classifier
       where a match will result in the default classid:

           bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
           tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1

       Basically, such a minimal generator is equivalent to:

           tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\n' ',' > /var/bpf/tcp-syn

       Since libpcap does not support all Linux' specific cBPF extensions in  its  compiler,  the
       Linux  kernel  also  ships  under  tools/net/  a  minimal BPF assembler called bpf_asm for
       providing full control. For detailed syntax and semantics on implementing such programs by
       hand, see references under FURTHER READING .

       Trivial  toy  example  in  bpf_asm  for classifying IPv4/TCP packets, saved in a text file
       called foobar :

           ldh [12]
           jne #0x800, drop
           ldb [23]
           jneq #6, drop
           ret #-1
           drop: ret #0

       Similarly, such a classifier can be loaded as:

           bpf_asm foobar > /var/bpf/tcp-syn
           tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1

       For BPF classifiers, the Linux kernel provides additionally under tools/net/ a  small  BPF
       debugger  called  bpf_dbg  ,  which  can  be used to test a classifier against pcap files,
       single-step or add various breakpoints into  the  classifier  program  and  dump  register
       contents during runtime.

       Implementing  an action in classic BPF is rather limited in the sense that packet mangling
       is not supported. Therefore, it's generally  recommended  to  make  the  switch  to  eBPF,
       whenever possible.

FURTHER READING

       Further  and  more  technical details about the BPF architecture can be found in the Linux
       kernel source tree under Documentation/networking/filter.txt .

       Further details on eBPF tc(8) examples can be found in  the  iproute2  source  tree  under
       examples/bpf/ .

SEE ALSO

       tc(8), tc-ematch(8) bpf(2) bpf(4)

AUTHORS

       Manpage written by Daniel Borkmann.

       Please  report  corrections  or  improvements to the Linux kernel networking mailing list:
       <netdev@vger.kernel.org>