Ubuntu Manpage: BPFTUNE-TCP-CONN - TCP connection bpftune plugin for auto-selection of congestion control algorithm

Provided by: bpftune_0.0~git20250314.8fd59cc-1_amd64

NAME

       BPFTUNE-TCP-CONN - TCP connection bpftune plugin for auto-selection of congestion control algorithm

DESCRIPTION

          The TCP connection algorithm tuner sets congestion control algorithm on TCP sockets.  Linux uses cubic
          by default, and it works well in a wide range of settings.

          However,  in  situations  where  losses  are  observed, it can underestimate network capacity and as a
          result throughput can drop excessively.  In such cases, BBR  is  a  good  fit  since  it  continuously
          estimates bottleneck bandwidth and attempts to fit the congestion algorithm to it.

          When   we   have   limited   information   about   a   remote   host   -   i.e.  we  have  not  had  >
          REMOTE_HOST_MIN_INSTANCES connections involving it, the only auto-selection involved is to use BBR  in
          cases  where  loss  rates  exceed 1/(2^DROP_SHIFT) (1.5%) of the packet sent rate - in such cases, BBR
          performs better than other congestion control algorithms.

          For cases where we connect multiple times we can try different algorithms to select the best.

          In selecting the appropriate congestion control algorithm, a reinforcement  learning-based  method  is
          used  whereby  we  choose  the congestion control algorithm that best fits the optimal bandwidth delay
          product (BDP):

              BDP = BottleneckBandwidth * MinRoundTripTime

          The algorithm works as follows; BPF maintains a map of metrics keyed by remote IP address.   For  each
          remote  IP address, we track the minimum RTT observed across all TCP connections and the max bandwidth
          observed.  The former tells us - as closely as we can determine - what the true RTT of  the  link  is.
          The  latter  estimates  the bandwidth limit of the link.  Knowing both of these allows us to determine
          the optimum operating state for a congestion control algorithm, where we feed the pipe enough to reach
          bandwidth limits but do not overwhelm it.

          Tracking both of these allows us to determine that optimum BDP,  so  any  loss  function  we  use  for
          optimization  should  drive  us towards congestion control algorithms that realize that optimal BDP by
          being as close as possible to the minimum RTT and as close as possible to the maximum packet  delivery
          rate.  We cannot use raw BDP alone because it is composed of the delivery rate and the RTT, so instead
          the metric used is:

              (current_min_rtt - overall_min_rtt)*S/overall_min_rtt +
              (overall_max_delivery_rate - cong_alg_max_delivery_rate)*S/overall_max_delivery_rate

          Both  denominators  are scaled by a scaling factor S to ensure integer division yields nonzero values.
          See ../src/tcp_conn_tuner.h for the metric compuatation.

          Note that while we use the current RTT for the connection, we use the maximum delivery  rate  observed
          for  the congestion algorithm to compare with the overall maximum.  The reasoning here is that because
          the delivery rate fluctuates so much for different connections (depending on service type etc), it  is
          too  unstable  to  use it on a per-connection basis. RTT is less variable across connections so we can
          use the current RTT in metric calcuation.

          For a TCP connection with optimal BDP (minimum RTT + max delivery rate), the loss function  yields  0.
          Otherwise  it  yields  a  positive  cost.  This is used to update the cost for that congestion control
          algorithm via the usual reinforcement learning algorithm, i.e.:

              cong_alg_cost = cong_alg_cost +
                              learning_rate*(curr_cost - cong_alg_cost)

          We use an epsilon-greedy approach, whereby the vast majority of the time the lowest-cost algorithm  is
          used,  but  5%  of  the time we randomly select an algorithm.  This ensures that if network conditions
          change we can adapt accordingly - without this, we can get  stuck  and  never  discover  that  another
          algorithm is doing better.

          How  does  this  work  in  practice?  To  benchmark  this we tested iperf3 performance between network
          namespaces on the same system, with a 10% loss rate imposed via netem.  What we see  is  that  bpftune
          converges to using BBR:

              IPAddress      CongAlg     Metric    Count   Greedy   MinRtt MaxRtDlvr
              192.168.168.1    cubic    2338876        9        9        3     1737
              192.168.168.1      bbr     923173       61       59        3    10024
              192.168.168.1     htcp    2318283        5        4        3      620
              192.168.168.1    dctcp    3506360        3        1        9      160

          Note  that we selected the BBR congestion control algorithm 61 out of 78 times and its associated cost
          was less than half of that of other algorithms.  This due to it exhibiting the maximum  delivery  rate
          and lowest RTT.

          iperf3  performance also improved as a result of selecting BBR, from a baseline of 58MBit/Sec (running
          the Linux default cubic algorithm) to 490MBit/Sec running bpftune and auto-selecting BBR.

          So this approach appears to find the right answer and converge quickly  under  loss  conditions;  what
          about normal network conditions?

          We  might  worry  that  grounding  our  model in assumptions closely tied to BBR's design might unduly
          favour BBR in all circumstances; do we see this  in  practice  outside  of  conditions  where  BBR  is
          optimal?

          Thankfully no; we see a convergence to dctcp as the optimal congestion control algorithm; again it has
          the maximum delivery rate and minimum RTT:

              IPAddress      CongAlg     Metric    Count   Greedy   MinRtt MaxRtDlvr
              192.168.168.1    cubic    1710535        6        4        3     8951
              192.168.168.1      bbr    2309881        1        1        7      206
              192.168.168.1     htcp    3333333        3        3        3     8784
              192.168.168.1    dctcp    1466296       71       70        3     9377

          Note  however  that  it  is  a close-run thing; the metric for cubic is close and it matches dctcp for
          minimum RTT (3us) and maximum delivery rate is close (9377 for dctcp, 8951 for cubic).

          References:

          BBR: Congestion-Based Congestion Control

           <https://queue.acm.org/detail.cfm?id=3022184>

NAME

DESCRIPTION

SEE ALSO