Ubuntu Manpage: tc-hfcs - Hierarchical Fair Service Curve

name
history & introduction
abbreviations
basics of hfsc
realtime criterion
linksharing criterion
upperlimit criterion
separate ls / rt scs
corner cases
linux and timer resolution
caveat: random online examples
layer2 adaptation
see also
author

Provided by: iproute2_5.5.0-1ubuntu1_amd64

NAME

       tc-hfcs - Hierarchical Fair Service Curve

HISTORY & INTRODUCTION

       HFSC  (Hierarchical Fair Service Curve) is a network packet scheduling algorithm that was first presented
       at SIGCOMM'97. Developed as a part of ALTQ (ALTernative Queuing) on NetBSD,  found  its  way  quickly  to
       other  BSD  systems,  and  then a few years ago became part of the linux kernel. Still, it's not the most
       popular scheduling algorithm - especially if compared to HTB - and  it's  not  well  documented  for  the
       enduser. This introduction aims to explain how HFSC works without using too much math (although some math
       it will be inevitable).

       In short HFSC aims to:

           1)  guarantee precise bandwidth and delay allocation for all leaf classes (realtime criterion)

           2)  allocate excess bandwidth  fairly  as  specified  by  class  hierarchy  (linkshare  &  upperlimit
               criterion)

           3)  minimize  any  discrepancy  between  the  service curve and the actual amount of service provided
               during linksharing

       The main "selling" point of HFSC is feature (1), which is achieved  by  using  nonlinear  service  curves
       (more  about  what  it actually is later). This is particularly useful in VoIP or games, where not only a
       guarantee of consistent bandwidth is important, but also limiting the initial delay  of  a  data  stream.
       Note  that  it  matters  only  for  leaf  classes (where the actual queues are) - thus class hierarchy is
       ignored in the realtime case.

       Feature (2) is well, obvious - any algorithm featuring class hierarchy (such as HTB or  CBQ)  strives  to
       achieve  that. HFSC does that well, although you might end with unusual situations, if you define service
       curves carelessly - see section CORNER CASES for examples.

       Feature (3) is mentioned due to the nature of the problem. There may be situations where it's either  not
       possible  to  guarantee  service  of all curves at the same time, and/or it's impossible to do so fairly.
       Both will be explained later. Note that this is mainly related to interior (aka  aggregate)  classes,  as
       the  leafs  are  already  handled  by  (1). Still, it's perfectly possible to create a leaf class without
       realtime service, and in such a case the caveats will naturally extend to leaf classes as well.

ABBREVIATIONS

       For the remaining part of the document, we'll use following shortcuts:

           RT - realtime
           LS - linkshare
           UL - upperlimit
           SC - service curve

BASICS OF HFSC

       To understand how HFSC works, we must first introduce a service curve.   Overall,  it's  a  nondecreasing
       function of some time unit, returning the amount of service (an allowed or allocated amount of bandwidth)
       at some specific point in time. The purpose of it should  be  subconsciously  obvious:  if  a  class  was
       allowed  to  transfer  not less than the amount specified by its service curve, then the service curve is
       not violated.

       Still, we need more elaborate criterion than just the above (although in the most generic case it can  be
       reduced to it). The criterion has to take two things into account:

           •   idling periods

           •   the  ability  to  "look  back", so if during current active period the service curve is violated,
               maybe it isn't if we count excess bandwidth received during earlier active period(s)

       Let's define the criterion as follows:

           (1) For each t1, there must exist t0 in set B, so S(t1-t0) <= w(t0,t1)

       Here 'w' denotes the amount of service received during some time period between t0 and t1. B is a set  of
       all times, where a session becomes active after idling period (further denoted as 'becoming backlogged').
       For a clearer picture, imagine two situations:

           a)  our session was active during two periods, with a small time gap between them

           b)  as in (a), but with a larger gap

       Consider (a): if the service received during both periods meets (1), then all is well.  But  what  if  it
       doesn't  do  so  during the 2nd period? If the amount of service received during the 1st period is larger
       than the service curve, then it might compensate for smaller service during the 2nd period and the gap  -
       if the gap is small enough.

       If  the gap is larger (b) - then it's less likely to happen (unless the excess bandwidth allocated during
       the 1st part was really large). Still, the larger the gap - the less interesting is what happened in  the
       past (e.g. 10 minutes ago) - what matters is the current traffic that just started.

       From  HFSC's  perspective,  more  interesting  is  answering the following question: when should we start
       transferring packets, so a service curve of a class is not violated.  Or  rephrasing  it:  How  much  X()
       amount  of service should a session receive by time t, so the service curve is not violated. Function X()
       defined as below is the basic building block of HFSC,  used  in:  eligible,  deadline,  virtual-time  and
       fit-time curves. Of course, X() is based on equation (1) and is defined recursively:

           •   At  the  1st  backlogged  period  beginning  function  X  is initialized to generic service curve
               assigned to a class

           •   At any subsequent backlogged period, X() is:
               min(X() from previous period ; w(t0)+S(t-t0) for t>=t0),
               ... where t0 denotes the beginning of the current backlogged period.

       HFSC uses either linear, or two-piece linear service curves. In case of linear or two-piece linear convex
       functions  (first slope < second slope), min() in X's definition reduces to the 2nd argument. But in case
       of two-piece concave functions, the 1st argument might quickly become lesser for some t>=t0.  Note,  that
       for some backlogged period, X() is defined only from that period's beginning. We also define X^(-1)(w) as
       smallest t>=t0, for which X(t) = w. We have to define it this way, as X() is usually not an injection.

       The above generic X() can be one of the following:

           E() In realtime criterion, selects packets eligible for sending. If none are eligible, HFSC will  use
               linkshare  criterion.  Eligible  time  'et'  is  calculated  with  reference  to packets' heads (
               et = E^(-1)(w) ). It's based on RT service curve, but in case of a convex  curve,  uses  its  2nd
               slope only.

           D() In  realtime  criterion,  selects  the most suitable packet from the ones chosen by E(). Deadline
               time 'dt' corresponds to packets' tails (dt = D^(-1)(w+l), where 'l' is packet's  length).  Based
               on RT service curve.

           V() In  linkshare  criterion,  arbitrates  which  packet to send next. Note that V() is function of a
               virtual time - see LINKSHARE CRITERION section for details.  Virtual  time  'vt'  corresponds  to
               packets' heads (vt = V^(-1)(w)). Based on LS service curve.

           F() An  extension to linkshare criterion, used to limit at which speed linkshare criterion is allowed
               to dequeue. Fit-time 'ft' corresponds to packets' heads as well (ft =  F^(-1)(w)).  Based  on  UL
               service curve.

       Be  sure to make clean distinction between session's RT, LS and UL service curves and the above "utility"
       functions.

REALTIME CRITERION

       RT criterion ignores class hierarchy and guarantees precise bandwidth and delay allocation. We say that a
       packet is eligible for sending, when the current real time is later than the eligible time of the packet.
       From all eligible packets, the one most suited for sending is the one with the  shortest  deadline  time.
       This sounds simple, but consider the following example:

       Interface 10Mbit, two classes, both with two-piece linear service curves:

           •   1st class - 2Mbit for 100ms, then 7Mbit (convex - 1st slope < 2nd slope)

           •   2nd class - 7Mbit for 100ms, then 2Mbit (concave - 1st slope > 2nd slope)

       Assume  for  a  moment,  that  we  only  use D() for both finding eligible packets, and choosing the most
       fitting one, thus eligible time would be computed as D^(-1)(w) and deadline time  would  be  computed  as
       D^(-1)(w+l).  If  the  2nd  class  starts  sending  packets  1 second after the 1st class, it's of course
       impossible to guarantee 14Mbit, as the interface capability is only 10Mbit.  The only workaround in  this
       scenario  is  to  allow  the 1st class to send the packets earlier that would normally be allowed. That's
       where separate E() comes to help. Putting all the math aside (see HFSC paper for  details),  E()  for  RT
       concave service curve is just like D(), but for the RT convex service curve - it's constructed using only
       RT service curve's 2nd slope (in our example
        7Mbit).

       The effect of such E() - packets will be sent earlier, and at the same time D() will be updated - so  the
       current  deadline  time calculated from it will be later. Thus, when the 2nd class starts sending packets
       later, both the 1st and the 2nd class will be eligible, but the  2nd  session's  deadline  time  will  be
       smaller  and its packets will be sent first. When the 1st class becomes idle at some later point, the 2nd
       class will be able to "buffer" up again for later active period of the 1st class.

       A short remark - in a situation, where the total amount of bandwidth available on the interface is larger
       than  the  allocated  total  realtime parts (imagine a 10 Mbit interface, but 1Mbit/2Mbit and 2Mbit/1Mbit
       classes), the sole speed of the interface could suffice to guarantee the times.

       Important part of RT criterion is that apart from updating its D() and E(), also V() used by LS criterion
       is  updated.  Generally  the  RT  criterion  is  secondary  to LS one, and used only if there's a risk of
       violating precise realtime requirements. Still,  the  "participation"  in  bandwidth  distributed  by  LS
       criterion is there, so V() has to be updated along the way. LS criterion can than properly compensate for
       non-ideal fair sharing situation, caused by RT scheduling. If you use UL service curve its  F()  will  be
       updated as well (UL service curve is an extension to LS one - see UPPERLIMIT CRITERION section).

       Anyway  - careless specification of LS and RT service curves can lead to potentially undesired situations
       (see CORNER CASES for examples). This wasn't the case in HFSC  paper  where  LS  and  RT  service  curves
       couldn't be specified separately.

LINKSHARING CRITERION

       LS  criterion's  task  is  to distribute bandwidth according to specified class hierarchy. Contrary to RT
       criterion, there're no comparisons between current real time and virtual time -  the  decision  is  based
       solely on direct comparison of virtual times of all active subclasses - the one with the smallest vt wins
       and gets scheduled. One immediate conclusion from this fact is that absolute values don't matter  -  only
       ratios  between  them  (so for example, two children classes with simple linear 1Mbit service curves will
       get the same treatment from LS criterion's perspective, as if they were 5Mbit). The other conclusion  is,
       that  in  perfectly fluid system with linear curves, all virtual times across whole class hierarchy would
       be equal.

       Why is VC defined in term of virtual time (and what is it)?

       Imagine an example: class A with two children - A1 and A2, both with let's say 10Mbit SCs. If A2 is idle,
       A1  receives  all  the  bandwidth  of A (and update its V() in the process). When A2 becomes active, A1's
       virtual time is already far later than A2's one. Considering the type of decision made by  LS  criterion,
       A1  would  become idle for a long time. We can workaround this situation by adjusting virtual time of the
       class becoming active - we do that by getting such time "up to date". HFSC uses a mean  of  the  smallest
       and  the biggest virtual time of currently active children fit for sending. As it's not real time anymore
       (excluding trivial case of situation where all classes become active at the same time, and  never  become
       idle), it's called virtual time.

       Such  approach  has  its price though. The problem is analogous to what was presented in previous section
       and is caused by non-linearity of service curves:

       1)  either it's impossible to guarantee service curves and satisfy fairness during certain time periods:

           Recall the example from RT section, slightly modified (with 3Mbit slopes instead of 2Mbit ones):

           •   1st class - 3Mbit for 100ms, then 7Mbit (convex - 1st slope < 2nd slope)

           •   2nd class - 7Mbit for 100ms, then 3Mbit (concave - 1st slope > 2nd slope)

           They sum up nicely to 10Mbit - the interface's capacity.  But  if  we  wanted  to  only  use  LS  for
           guarantees  and  fairness - it simply won't work. In LS context, only V() is used for making decision
           which class to schedule. If the 2nd class becomes active when the 1st one is in its second slope, the
           fairness  will  be  preserved - ratio will be 1:1 (7Mbit:7Mbit), but LS itself is of course unable to
           guarantee the absolute values themselves - as it would have to go beyond of  what  the  interface  is
           capable of.

       2)  and/or it's impossible to guarantee service curves of all classes at the same time [fairly or not]:

           This  is  similar to the above case, but a bit more subtle. We will consider two subtrees, arbitrated
           by their common (root here) parent:

           R (root) - 10Mbit

           A  - 7Mbit, then 3Mbit
           A1 - 5Mbit, then 2Mbit
           A2 - 2Mbit, then 1Mbit

           B  - 3Mbit, then 7Mbit

           R arbitrates between left subtree (A) and right (B). Assume that A2 and B are constantly  backlogged,
           and at some later point A1 becomes backlogged (when all other classes are in their 2nd linear part).

           What happens now? B (choice made by R) will always get 7 Mbit as R is only (obviously) concerned with
           the ratio between its direct children. Thus A subtree gets 3Mbit, but its children would want (at the
           point  when  A1  became  backlogged) 5Mbit + 1Mbit. That's of course impossible, as they can only get
           3Mbit due to interface limitation.

           In the left subtree - we have the same situation as previously (fair split between  A1  and  A2,  but
           violated guarantees), but in the whole tree - there's no fairness (B got 7Mbit, but A1 and A2 have to
           fit together in 3Mbit) and there's no guarantees for all classes (only B got what it wanted). Even if
           we  violated  fairness  in  the A subtree and set A2's service curve to 0, A1 would still not get the
           required bandwidth.

UPPERLIMIT CRITERION

       UL criterion is an extensions to LS one, that permits sending packets only if current real time is  later
       than  fit-time  ('ft').  So  the modified LS criterion becomes: choose the smallest virtual time from all
       active children, such that fit-time < current real time also holds.  Fit-time  is  calculated  from  F(),
       which  is  based  on  UL  service  curve.  As  you  can  see, its role is kinda similar to E() used in RT
       criterion. Also, for obvious reasons - you can't specify UL service curve without LS one.

       The main purpose of the UL service curve is to limit HFSC to bandwidth available on the  upstream  router
       (think  adsl  home  modem/router,  and  linux  server  as  NAT/firewall/etc.  with 100Mbit+ connection to
       mentioned modem/router).  Typically, it's used to create a single class directly under  root,  setting  a
       linear  UL  service curve to available bandwidth - and then creating your class structure from that class
       downwards. Of course, you're free to add a UL service  curve  (linear  or  not)  to  any  class  with  LS
       criterion.

       An  important  part  about  the  UL  service curve is that whenever at some point in time a class doesn't
       qualify for linksharing due to its fit-time, the next time it does qualify it  will  update  its  virtual
       time  to  the smallest virtual time of all active children fit for linksharing. This way, one of the main
       things the LS criterion tries to achieve - equality of all virtual times  across  whole  hierarchy  -  is
       preserved (in perfectly fluid system with only linear curves, all virtual times would be equal).

       Without  that, 'vt' would lag behind other virtual times, and could cause problems. Consider an interface
       with a capacity of 10Mbit, and the following leaf classes (just in case you're skipping this text quickly
       - this example shows behavior that doesn't happen):

       A - ls 5.0Mbit
       B - ls 2.5Mbit
       C - ls 2.5Mbit, ul 2.5Mbit

       If  B  was idle, while A and C were constantly backlogged, A and C would normally (as far as LS criterion
       is concerned) divide bandwidth in 2:1 ratio. But due to UL service curve in place, C would  get  at  most
       2.5Mbit,  and  A  would get the remaining 7.5Mbit. The longer the backlogged period, the more the virtual
       times of A and C would drift apart. If B became backlogged at some later point in time, its virtual  time
       would  be  set  to  (A's  vt + C's vt)/2, thus blocking A from sending any traffic until B's virtual time
       catches up with A.

SEPARATE LS / RT SCs

       Another difference from the original HFSC paper is that RT  and  LS  SCs  can  be  specified  separately.
       Moreover,  leaf classes are allowed to have only either RT SC or LS SC. For interior classes, only LS SCs
       make sense: any RT SC will be ignored.

CORNER CASES

Separate service curves for LS and RT criteria can lead to certain traps that come from "fighting"
between ideal linksharing and enforced realtime guarantees. Those situations didn't exist in original
HFSC paper, where specifying separate LS / RT service curves was not discussed.

Consider an interface with a 10Mbit capacity, with the following leaf classes:

A - ls 5.0Mbit, rt 8Mbit
B - ls 2.5Mbit
C - ls 2.5Mbit

Imagine A and C are constantly backlogged. As B is idle, A and C would divide bandwidth in 2:1 ratio,
considering LS service curve (so in theory - 6.66 and 3.33). Alas RT criterion takes priority, so A will
get 8Mbit and LS will be able to compensate class C for only 2 Mbit - this will cause discrepancy between
virtual times of A and C.

Assume this situation lasts for a long time with no idle periods, and suddenly B becomes active. B's
virtual time will be updated to (A's vt + C's vt)/2, effectively landing in the middle between A's and
C's virtual time. The effect - B, having no RT guarantees, will be punished and will not be allowed to
transfer until C's virtual time catches up.

If the interface had a higher capacity, for example 100Mbit, this example would behave perfectly fine
though.

Let's look a bit closer at the above example - it "cleverly" invalidates one of the basic things LS
criterion tries to achieve - equality of all virtual times across class hierarchy. Leaf classes without
RT service curves are literally left to their own fate (governed by messed up virtual times).

Also, it doesn't make much sense. Class A will always be guaranteed up to 8Mbit, and this is more than
any absolute bandwidth that could happen from its LS criterion (excluding trivial case of only A being
active). If the bandwidth taken by A is smaller than absolute value from LS criterion, the unused part
will be automatically assigned to other active classes (as A has idling periods in such case). The only
"advantage" is, that even in case of low bandwidth on average, bursts would be handled at the speed
defined by RT criterion. Still, if extra speed is needed (e.g. due to latency), non linear service curves
should be used in such case.

In the other words: the LS criterion is meaningless in the above example.

You can quickly "workaround" it by making sure each leaf class has RT service curve assigned (thus
guaranteeing all of them will get some bandwidth), but it doesn't make it any more valid.

Keep in mind - if you use nonlinear curves and irregularities explained above happen only in the first
segment, then there's little wrong with "overusing" RT curve a bit:

A - ls 5.0Mbit, rt 9Mbit/30ms, then 1Mbit
B - ls 2.5Mbit
C - ls 2.5Mbit

Here, the vt of A will "spike" in the initial period, but then A will never get more than 1Mbit until B &
C catch up. Then everything will be back to normal.

LINUX AND TIMER RESOLUTION

In certain situations, the scheduler can throttle itself and setup so called watchdog to wakeup dequeue
function at some time later. In case of HFSC it happens when for example no packet is eligible for
scheduling, and UL service curve is used to limit the speed at which LS criterion is allowed to dequeue
packets. It's called throttling, and accuracy of it is dependent on how the kernel is compiled.

There're 3 important options in modern kernels, as far as timers' resolution goes: 'tickless system',
'high resolution timer support' and 'timer frequency'.

If you have 'tickless system' enabled, then the timer interrupt will trigger as slowly as possible, but
each time a scheduler throttles itself (or any other part of the kernel needs better accuracy), the rate
will be increased as needed / possible. The ceiling is either 'timer frequency' if 'high resolution timer
support' is not available or not compiled in, or it's hardware dependent and can go far beyond the
highest 'timer frequency' setting available.

If 'tickless system' is not enabled, the timer will trigger at a fixed rate specified by 'timer
frequency' - regardless if high resolution timers are or aren't available.

This is important to keep those settings in mind, as in scenario like: no tickless, no HR timers,
frequency set to 100hz - throttling accuracy would be at 10ms. It doesn't automatically mean you would be
limited to ~0.8Mbit/s (assuming packets at ~1KB) - as long as your queues are prepared to cover for timer
inaccuracy. Of course, in case of e.g. locally generated UDP traffic - appropriate socket size is needed
as well. Short example to make it more understandable (assume hardcore anti-schedule settings - HZ=100,
no HR timers, no tickless):

tc qdisc add dev eth0 root handle 1:0 hfsc default 1
tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10Mbit

Assuming packet of ~1KB size and HZ=100, that averages to ~0.8Mbit - anything beyond it (e.g. the above
example with specified rate over 10x larger) will require appropriate queuing and cause bursts every ~10
ms. As you can imagine, any HFSC's RT guarantees will be seriously invalidated by that. Aforementioned
example is mainly important if you deal with old hardware - as is particularly popular for home server
chores. Even then, you can easily set HZ=1000 and have very accurate scheduling for typical adsl speeds.

Anything modern (apic or even hpet msi based timers + 'tickless system') will provide enough accuracy for
superb 1Gbit scheduling. For example, on one of my cheap dual-core AMD boards I have the following
settings:

tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1
tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 300mbit

And a simple:

nc -u dst.host.com 54321 </dev/zero
nc -l -p 54321 >/dev/null

...will yield the following effects over a period of ~10 seconds (taken from /proc/interrupts):

319: 42124229 0 HPET_MSI-edge hpet2 (before)
319: 42436214 0 HPET_MSI-edge hpet2 (after 10s.)

That's roughly 31000/s. Now compare it with HZ=1000 setting. The obvious drawback of it is that cpu load
can be rather high with servicing that many timer interrupts. The example with 300Mbit RT service curve
on 1Gbit link is particularly ugly, as it requires a lot of throttling with minuscule delays.

Also note that it's just an example showing the capabilities of current hardware. The above example
(essentially a 300Mbit TBF emulator) is pointless on an internal interface to begin with: you will pretty
much always want a regular LS service curve there, and in such a scenario HFSC simply doesn't throttle at
all.

300Mbit RT service curve (selected columns from mpstat -P ALL 1):

10:56:43 PM CPU %sys %irq %soft %idle
10:56:44 PM all 20.10 6.53 34.67 37.19
10:56:44 PM 0 35.00 0.00 63.00 0.00
10:56:44 PM 1 4.95 12.87 6.93 73.27

So, in the rare case you need those speeds with only a RT service curve, or with a UL service curve:
remember the drawbacks.

CAVEAT: RANDOM ONLINE EXAMPLES

       For reasons unknown (though well guessed), many examples you can google love to overuse UL criterion  and
       stuff  it  in  every node possible. This makes no sense and works against what HFSC tries to do (and does
       pretty damn well). Use UL where it makes sense: on the uppermost node to match upstream  router's  uplink
       capacity.  Or  in special cases, such as testing (limit certain subtree to some speed), or customers that
       must never get more than certain speed. In the last case you can usually achieve the same by just using a
       RT criterion without LS+UL on leaf nodes.

       As for the router case - remember it's good to differentiate between "traffic to router" (remote console,
       web config, etc.) and "outgoing traffic", so for example:

       tc qdisc add dev eth0 root handle 1:0 hfsc default 0x8002
       tc class add dev eth0 parent 1:0 classid 1:999 hfsc rt m2 50Mbit
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc ls m2 2Mbit ul m2 2Mbit

       ... so "internet" tree under 1:1 and "router itself" as 1:999

LAYER2 ADAPTATION

       Please refer to tc-stab(8)

AUTHOR

       Manpage created by Michal Soltys (soltys@ziu.info)