Provided by: iproute_20111117-1ubuntu2_amd64 bug

HISTORY & INTRODUCTION

       HFSC  -  Hierarchical Fair Service Curve was first presented at SIGCOMM'97. Developed as a
       part of ALTQ (ALTernative Queuing) on NetBSD, found its way quickly to other BSD  systems,
       and then a few years ago became part of the linux kernel. Still, it's not the most popular
       scheduling algorithm - especially if compared to HTB - and it's not well  documented  from
       enduser's  perspective.  This introduction aims to explain how HFSC works without going to
       deep into math side of things (although some if it will be inevitable).

       In short HFSC aims to:

           1)  guarantee precise bandwidth and delay allocation for all  leaf  classes  (realtime
               criterion)

           2)  allocate  excess  bandwidth  fairly  as  specified by class hierarchy (linkshare &
               upperlimit criterion)

           3)  minimize any discrepancy between the  service  curve  and  the  actual  amount  of
               service provided during linksharing

       The  main  "selling"  point  of  HFSC is feature (1), which is achieved by using nonlinear
       service curves (more about what it actually is later). This is particularly useful in VoIP
       or games, where not only guarantee of consistent bandwidth is important, but initial delay
       of a data stream as well. Note that it matters only for leaf  classes  (where  the  actual
       queues are) - thus class hierarchy is ignored in realtime case.

       Feature  (2)  is  well,  obvious - any algorithm featuring class hierarchy (such as HTB or
       CBQ) strives to achieve that. HFSC does that well, although you  might  end  with  unusual
       situations,  if  you  define  service  curves  carelessly  -  see section CORNER CASES for
       examples.

       Feature (3) is mentioned due to the nature of the problem. There may be  situations  where
       it's  either not possible to guarantee service of all curves at the same time, and/or it's
       impossible to do so fairly. Both will be explained later. Note that this is mainly related
       to interior (aka aggregate) classes, as the leafs are already handled by (1). Still - it's
       perfectly possible to create a leaf class w/o realtime service, and in  such  case  -  the
       caveats will naturally extend to leaf classes as well.

ABBREVIATIONS

       For the remaining part of the document, we'll use following shortcuts:

           RT - realtime
           LS - linkshare
           UL - upperlimit
           SC - service curve

BASICS OF HFSC

       To  understand  how  HFSC works, we must first introduce a service curve.  Overall, it's a
       nondecreasing function of  some  time  unit,  returning  amount  of  service  (allowed  or
       allocated amount of bandwidth) by some specific point in time. The purpose of it should be
       subconsciously obvious - if a class was allowed to  transfer  not  less  than  the  amount
       specified by its service curve - then service curve is not violated.

       Still  -  we  need  more elaborate criterion than just the above (although in most generic
       case it can be reduced to it). The criterion has to take two things into account:

           ·   idling periods

           ·   ability to "look back", so if  during  current  active  period  service  curve  is
               violated,  maybe  it  isn't  if  we count excess bandwidth received during earlier
               active period(s)

       Let's define the criterion as follows:

           (1) For each t1, there must exist t0 in set B, so S(t1-t0) <= w(t0,t1)

       Here 'w' denotes the amount of service received during some time period between t0 and t1.
       B  is  a  set  of  all  times, where a session becomes active after idling period (further
       denoted as 'becoming backlogged'). For a clearer picture, imagine two situations:

           a)  our session was active during two periods, with a small time gap between them

           b)  as in (a), but with a larger gap

       Consider (a) - if the service received during both periods meets (1), then  all  is  good.
       But  what  if  it  doesn't do so during the 2nd period ? If the amount of service received
       during the 1st period is bigger than the service  curve,  then  it  might  compensate  for
       smaller service during the 2nd period and the gap - if the gap is small enough.

       If  the  gap  is larger (b) - then it's less likely to happen (unless the excess bandwidth
       allocated during the 1st part was really large). Still, the larger  the  gap  -  the  less
       interesting  is  what  happened  in  the  past (e.g. 10 minutes ago) - what matters is the
       current traffic that just started.

       From HFSC's perspective, more interesting is answering the following question: when should
       we  start  transferring  packets,  so  a  service  curve  of  a  class is not violated. Or
       rephrasing it: How much X() amount of service should a session receive by time t,  so  the
       service  curve  is not violated. Function X() defined as below is the basic building block
       of HFSC, used in: eligible, deadline, virtual-time and fit-time curves. Of course, X()  is
       based on equation (1) and is defined recursively:

           ·   At  the  1st  backlogged  period  beginning  function  X is initialized to generic
               service curve assigned to a class

           ·   At any subsequent backlogged period, X() is:
               min(X() from previous period ; w(t0)+S(t-t0) for t>=t0),
               ... where t0 denotes the beginning of the current backlogged period.

       HFSC uses either linear, or  two-piece  linear  service  curves.  In  case  of  linear  or
       two-piece  linear  convex  functions (first slope < second slope), min() in X's definition
       reduces to the 2nd argument. But in case of two-piece concave functions, the 1st  argument
       might  quickly become lesser for some t>=t0. Note, that for some backlogged period, X() is
       defined only from that period's beginning. We also define X^(-1)(w) as smallest t>=t0, for
       which X(t) = w. We have to define it this way, as X() is usually not an injection.

       The above generic X() can be one of the following:

           E() In realtime criterion, selects packets eligible for sending. If none are eligible,
               HFSC will use linkshare criterion. Eligible time 'et' is calculated with reference
               to  packets' heads ( et = E^(-1)(w) ). It's based on RT service curve, but in case
               of a convex curve, uses its 2nd slope only.

           D() In realtime criterion, selects the most suitable packet from the  ones  chosen  by
               E(). Deadline time 'dt' corresponds to packets' tails (dt = D^(-1)(w+l), where 'l'
               is packet's length). Based on RT service curve.

           V() In linkshare criterion, arbitrates which packet to send next.  Note  that  V()  is
               function  of a virtual time - see LINKSHARE CRITERION section for details. Virtual
               time 'vt' corresponds to packets' heads (vt =  V^(-1)(w)).  Based  on  LS  service
               curve.

           F() An  extension  to  linkshare  criterion,  used  to  limit at which speed linkshare
               criterion is allowed to dequeue. Fit-time 'ft' corresponds to  packets'  heads  as
               well (ft = F^(-1)(w)). Based on UL service curve.

       Be  sure  to make clean distinction between session's RT, LS and UL service curves and the
       above "utility" functions.

REALTIME CRITERION

       RT  criterion  ignores  class  hierarchy  and  guarantees  precise  bandwidth  and   delay
       allocation.  We  say that packet is eligible for sending, when current real time is bigger
       than eligible time. From all packets eligible, the one most suited for sending, is the one
       with the smallest deadline time. Sounds simply, but consider following example:

       Interface 10mbit, two classes, both with two-piece linear service curves:

           ·   1st class - 2mbit for 100ms, then 7mbit (convex - 1st slope < 2nd slope)

           ·   2nd class - 7mbit for 100ms, then 2mbit (concave - 1st slope > 2nd slope)

       Assume  for a moment, that we only use D() for both finding eligible packets, and choosing
       the most fitting one, thus eligible time would be computed as D^(-1)(w) and deadline  time
       would  be  computed as D^(-1)(w+l). If the 2nd class starts sending packets 1 second after
       the 1st class, it's of course impossible to guarantee 14mbit, as the interface  capability
       is  only  10mbit.   The only workaround in this scenario is to allow the 1st class to send
       the packets earlier that would normally be allowed. That's where  separate  E()  comes  to
       help.  Putting all the math aside (see HFSC paper for details), E() for RT concave service
       curve is just like D(), but for the RT convex service curve - it's constructed using  only
       RT service curve's 2nd slope (in our example - 7mbit).

       The  effect  of  such E() - packets will be sent earlier, and at the same time D() will be
       updated - so current deadline time calculated from it will be bigger. Thus, when  the  2nd
       class  starts  sending packets later, both the 1st and the 2nd class will be eligible, but
       the 2nd session's deadline time will be smaller and its packets will be sent  first.  When
       the  1st class becomes idle at some later point, the 2nd class will be able to "buffer" up
       again for later active period of the 1st class.

       A short remark - in a situation, where the total amount  of  bandwidth  available  on  the
       interface  is  bigger  than the allocated total realtime parts (imagine interface 10 mbit,
       but 1mbit/2mbit and 2mbit/1mbit classes), the sole speed of the interface could suffice to
       guarantee the times.

       Important  part of RT criterion is that apart from updating its D() and E(), also V() used
       by LS criterion is updated. Generally the RT criterion is secondary to LS  one,  and  used
       only   if   there's  a  risk  of  violating  precise  realtime  requirements.  Still,  the
       "participation" in bandwidth distributed by LS criterion  is  there,  so  V()  has  to  be
       updated  along  the  way.  LS  criterion  can  than properly compensate for non-ideal fair
       sharing situation, caused by RT scheduling. If you use UL service curve its  F()  will  be
       updated  as  well  (UL  service curve is an extension to LS one - see UPPERLIMIT CRITERION
       section).

       Anyway - careless specification of LS and  RT  service  curves  can  lead  to  potentially
       undesired  situations  (see CORNER CASES for examples). This wasn't the case in HFSC paper
       where LS and RT service curves couldn't be specified separately.

LINKSHARING CRITERION

       LS criterion's task is to distribute bandwidth according  to  specified  class  hierarchy.
       Contrary  to  RT  criterion, there're no comparisons between current real time and virtual
       time - the decision is based solely on direct comparison of virtual times  of  all  active
       subclasses  -  the  one  with  the  smallest  vt  wins  and  gets scheduled. One immediate
       conclusion from this fact is that absolute values don't matter - only ratios between  them
       (so for example, two children classes with simple linear 1mbit service curves will get the
       same treatment from LS  criterion's  perspective,  as  if  they  were  5mbit).  The  other
       conclusion is, that in perfectly fluid system with linear curves, all virtual times across
       whole class hierarchy would be equal.

       Why is VC defined in term of virtual time (and what is it) ?

       Imagine an example: class A with two children - A1 and A2, both with let's say 10mbit SCs.
       If  A2  is  idle,  A1 receives all the bandwidth of A (and update its V() in the process).
       When A2 becomes active, A1's virtual time is already far bigger than A2's one. Considering
       the  type of decision made by LS criterion, A1 would become idle for a lot of time. We can
       workaround this situation by adjusting virtual time of the class becoming active -  we  do
       that  by  getting such time "up to date". HFSC uses a mean of the smallest and the biggest
       virtual time of currently active children fit for sending. As it's not real  time  anymore
       (excluding trivial case of situation where all classes become active at the same time, and
       never become idle), it's called virtual time.

       Such approach has its price though. The problem is analogous  to  what  was  presented  in
       previous section and is caused by non-linearity of service curves:

       1)  either it's impossible to guarantee service curves and satisfy fairness during certain
           time periods:

           Recall the example from RT section, slightly modified (with 3mbit  slopes  instead  of
           2mbit ones):

           ·   1st class - 3mbit for 100ms, then 7mbit (convex - 1st slope < 2nd slope)

           ·   2nd class - 7mbit for 100ms, then 3mbit (concave - 1st slope > 2nd slope)

           They  sum  up nicely to 10mbit - interface's capacity. But if we wanted to only use LS
           for guarantees and fairness - it simply won't work. In LS context, only  V()  is  used
           for  making decision which class to schedule. If the 2nd class becomes active when the
           1st one is in its second slope, the fairness will be preserved -  ratio  will  be  1:1
           (7mbit:7mbit),  but  LS  itself  is  of course unable to guarantee the absolute values
           themselves - as it would have to go beyond of what the interface is capable of.

       2)  and/or it's impossible to guarantee service curves of all classes  at  the  same  time
           [fairly or not]:

           This  is  similar  to  the  above  case,  but  a bit more subtle. We will consider two
           subtrees, arbitrated by their common (root here) parent:

           R (root) - 10mbit

           A  - 7mbit, then 3mbit
           A1 - 5mbit, then 2mbit
           A2 - 2mbit, then 1mbit

           B  - 3mbit, then 7mbit

           R arbitrates between left subtree (A)  and  right  (B).  Assume  that  A2  and  B  are
           constantly  backlogged,  and at some later point A1 becomes backlogged (when all other
           classes are in their 2nd linear part).

           What happens now ? B (choice made  by  R)  will  always  get  7  mbit  as  R  is  only
           (obviously)  concerned with the ratio between its direct children. Thus A subtree gets
           3mbit, but its children would want (at the point when A1 became  backlogged)  5mbit  +
           1mbit.  That's  of  course  impossible,  as  they  can only get 3mbit due to interface
           limitation.

           In the left subtree - we have the same situation as previously (fair split between  A1
           and  A2,  but violated guarantees), but in the whole tree - there's no fairness (B got
           7mbit, but A1 and A2 have to fit together in 3mbit) and there's no guarantees for  all
           classes (only B got what it wanted). Even if we violated fairness in the A subtree and
           set A2's service curve to 0, A1 would still not get the required bandwidth.

UPPERLIMIT CRITERION

       UL criterion is an extensions to LS one, that permits sending packets only if current real
       time  is  bigger  than  fit-time  ('ft'). So the modified LS criterion becomes: choose the
       smallest virtual time from all active children, such that fit-time  <  current  real  time
       also  holds.  Fit-time  is calculated from F(), which is based on UL service curve. As you
       can see, it's role is kinda similar to E() used in RT criterion. Also, for obvious reasons
       - you can't specify UL service curve without LS one.

       Main  purpose  of UL service curve is to limit HFSC to bandwidth available on the upstream
       router (think adsl home modem/router, and linux server as nat/firewall/etc. with  100mbit+
       connection  to  mentioned  modem/router).   Typically,  it's used to create a single class
       directly under root, setting linear UL service curve to available  bandwidth  -  and  then
       creating  your class structure from that class downwards. Of course, you're free to add UL
       service (linear or not) curve to any class with LS criterion.

       Important part about UL service curve is, that whenever at some  point  in  time  a  class
       doesn't  qualify  for  linksharing  due to its fit-time, the next time it does qualify, it
       will update its virtual time to the smallest virtual time of all active children  fit  for
       linksharing.  This way, one of the main things LS criterion tries to achieve - equality of
       all virtual times across whole hierarchy - is preserved (in perfectly  fluid  system  with
       only linear curves, all virtual times would be equal).

       Without  that,  'vt'  would  lag  behind  other  virtual  times, and could cause problems.
       Consider interface with capacity 10mbit, and following leaf classes (just in  case  you're
       skipping this text quickly - this example shows behavior that doesn't happen):

       A - ls 5.0mbit
       B - ls 2.5mbit
       C - ls 2.5mbit, ul 2.5mbit

       If B was idle, while A and C were constantly backlogged, they would normally (as far as LS
       criterion is concerned) divide bandwidth in 2:1 ratio. But due  to  UL  service  curve  in
       place,  C would get at most 2.5mbit, and A would get the remaining 7.5mbit. The longer the
       backlogged period, the more virtual times of A and  C  would  drift  apart.  If  B  became
       backlogged   at   some   later   point   in  time,  its  virtual  time  would  be  set  to
       (A's vt + C's vt)/2, thus blocking A from sending any  traffic,  until  B's  virtual  time
       catches up with A.

SEPARATE LS / RT SCs

       Another  difference  from  original  HFSC  paper,  is  that RT and LS SCs can be specified
       separately. Moreover - leaf classes are allowed to have only either RT SC or  LS  SC.  For
       interior classes, only LS SCs make sense - Any RT SC will be ignored.

CORNER CASES

       Separate  service  curves for LS and RT criteria can lead to certain traps, that come from
       "fighting" between ideal linksharing and enforced realtime  guarantees.  Those  situations
       didn't  exist in original HFSC paper, where specifying separate LS / RT service curves was
       not discussed.

       Consider interface with capacity 10mbit, with following leaf classes:

       A - ls 5.0mbit, rt 8mbit
       B - ls 2.5mbit
       C - ls 2.5mbit

       Imagine A and C are constantly backlogged. As B is idle, A and C would divide bandwidth in
       2:1  ratio, considering LS service curve (so in theory - 6.66 and 3.33). Alas RT criterion
       takes priority, so A will get 8mbit and LS will be able to compensate class C for  only  2
       mbit - this will cause discrepancy between virtual times of A and C.

       Assume this situation lasts for a lot of time with no idle periods, and suddenly B becomes
       active. B's virtual time will be updated to (A's vt + C's vt)/2,  effectively  landing  in
       the middle between A's and C's virtual time. The effect - B, having no RT guarantees, will
       be punished and will not be allowed to transfer until C's virtual time catches up.

       If the interface had higher capacity - for example  100mbit,  this  example  would  behave
       perfectly fine though.

       Let's  look a bit closer at the above example - it "cleverly" invalidates one of the basic
       things LS criterion tries to  achieve  -  equality  of  all  virtual  times  across  class
       hierarchy.  Leaf  classes  without  RT service curves are literally left to their own fate
       (governed by messed up virtual times).

       Also - it doesn't make much sense. Class A will always be guaranteed up to 8mbit, and this
       is  more  than  any  absolute bandwidth that could happen from its LS criterion (excluding
       trivial case of only A being active). If the bandwidth taken by A is smaller than absolute
       value  from  LS  criterion, the unused part will be automatically assigned to other active
       classes (as A has idling periods in such case). The only "advantage" is, that even in case
       of low bandwidth on average, bursts would be handled at the speed defined by RT criterion.
       Still, if extra speed is needed (e.g. due to latency), non linear service curves should be
       used in such case.

       In the other words - LS criterion is meaningless in the above example.

       You  can  quickly  "workaround"  it  by  making  sure each leaf class has RT service curve
       assigned (thus guaranteeing all of them will get some bandwidth), but it doesn't  make  it
       any more valid.

       Keep  in mind - if you use nonlinear curves and irregularities explained above happen only
       in the first segment, then there's little wrong with "overusing" RT curve a bit:

       A - ls 5.0mbit, rt 9mbit/30ms, then 1mbit
       B - ls 2.5mbit
       C - ls 2.5mbit

       Here, the vt of A will "spike" in the initial period, but then A will never get more  than
       1mbit, until B & C catch up. Then everything will be back to normal.

LINUX AND TIMER RESOLUTION

       In  certain  situations, the scheduler can throttle itself and setup so called watchdog to
       wakeup dequeue function at some time later. In case of HFSC it happens when for example no
       packet  is  eligible  for  scheduling,  and UL service curve is used to limit the speed at
       which LS criterion is allowed to dequeue packets. It's called throttling, and accuracy  of
       it is dependent on how the kernel is compiled.

       There're  3  important  options  in  modern  kernels,  as  far as timers' resolution goes:
       'tickless system', 'high resolution timer support' and 'timer frequency'.

       If you have 'tickless system' enabled, then the timer interrupt will trigger as slowly  as
       possible,  but  each  time  a  scheduler throttles itself (or any other part of the kernel
       needs better accuracy), the rate will be increased as needed / possible.  The  ceiling  is
       either  'timer  frequency'  if  'high  resolution  timer  support' is not available or not
       compiled in, or it's  hardware  dependent  and  can  go  far  beyond  the  highest  'timer
       frequency' setting available.

       If  'tickless  system' is not enabled, the timer will trigger at a fixed rate specified by
       'timer frequency' - regardless if high resolution timers are or aren't available.

       This is important to keep those settings in mind, as in scenario like: no tickless, no  HR
       timers,  frequency  set  to  100hz  -  throttling  accuracy  would  be at 10ms. It doesn't
       automatically mean you would be limited to ~0.8mbit/s (assuming packets at ~1KB) - as long
       as  your  queues  are  prepared  to cover for timer inaccuracy. Of course, in case of e.g.
       locally generated udp traffic - appropriate socket size is needed as well.  Short  example
       to  make  it  more  understandable (assume hardcore anti-schedule settings - HZ=100, no HR
       timers, no tickless):

       tc qdisc add dev eth0 root handle 1:0 hfsc default 1
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10mbit

       Assuming packet of ~1KB size and HZ=100, that averages to ~0.8mbit -  anything  beyond  it
       (e.g.  the  above  example  with  specified rate over 10x bigger) will require appropriate
       queuing and cause bursts every ~10 ms. As you can imagine, any HFSC's RT  guarantees  will
       be  seriously invalidated by that.  Aforementioned example is mainly important if you deal
       with old hardware - as it's particularly popular for home server chores.  Even  then,  you
       can easily set HZ=1000 and have very accurate scheduling for typical adsl speeds.

       Anything  modern  (apic  or  even  hpet msi based timers + 'tickless system') will provide
       enough accuracy for superb 1gbit scheduling. For example, on one of basically  cheap  dual
       core AMD boards I have with following settings:

       tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1
       tc class add dev eth0 paretn 1:0 classid 1:1 hfsc rt m2 300mbit

       And simple:

       nc -u dst.host.com 54321 </dev/zero
       nc -l -p 54321 >/dev/null

       ...will yield following effects over period of ~10 seconds (taken from /proc/interrupts):

       319: 42124229   0  HPET_MSI-edge  hpet2 (before)
       319: 42436214   0  HPET_MSI-edge  hpet2 (after 10s.)

       That's roughly 31000/s. Now compare it with HZ=1000 setting. The obvious drawback of it is
       that cpu load can be rather extensive with servicing that many timer  interrupts.  Example
       with  300mbit RT service curve on 1gbit link is particularly ugly, as it requires a lot of
       throttling with minuscule delays.

       Also note that it's just an example showing capability of  current  hardware.   The  above
       example  (essentially  300mbit  TBF  emulator) is pointless on internal interface to begin
       with - you will pretty much always want regular  LS  service  curve  there,  and  in  such
       scenario HFSC simply doesn't throttle at all.

       300mbit RT service curve (selected columns from mpstat -P ALL 1):

       10:56:43 PM  CPU  %sys     %irq   %soft   %idle
       10:56:44 PM  all  20.10    6.53   34.67   37.19
       10:56:44 PM    0  35.00    0.00   63.00    0.00
       10:56:44 PM    1   4.95   12.87    6.93   73.27

       So,  in  rare  case  you  need those speeds with only RT service curve, or with UL service
       curve - remember about drawbacks.

CAVEAT: RANDOM ONLINE EXAMPLES

       For reasons unknown (though well guessed), many examples you can google love to overuse UL
       criterion  and stuff it in every node possible. This makes no sense and works against what
       HFSC tries to do (and does pretty damn well). Use  UL  where  it  makes  sense  -  on  the
       uppermost  node to match upstream router's uplink capacity. Or - in special cases, such as
       testing (limit certain subtree to some speed) or customers that must never get  more  than
       certain  speed.  In  the  last  case  you  can  usually  achieve the same by just using RT
       criterion without LS+UL on leaf nodes.

       As for router case - remember it's good  to  differentiate  between  "traffic  to  router"
       (remote console, web config, etc.) and "outgoing traffic", so for example:

       tc qdisc add dev eth0 root handle 1:0 hfsc default 0x8002
       tc class add dev eth0 parent 1:0 classid 1:999 hfsc rt m2 50mbit
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc ls m2 2mbit ul m2 2mbit

       ... so "internet" tree under 1:1 and "router itself" as 1:999

LAYER2 ADAPTATION

       Please refer to tc-stab(8)

SEE ALSO

       tc(8), tc-hfsc(8), tc-stab(8)

       Please direct bugreports and patches to: <net...@vger.kernel.org>

AUTHOR

       Manpage created by Michal Soltys (sol...@ziu.info)