oracular (7) xl-numa-placement.7.gz

Provided by: xen-utils-common_4.17.3+10-g091466ba55-1.1ubuntu3_amd64 bug

NAME

       xl-numa-placement - Guest Automatic NUMA Placement in libxl and xl

DESCRIPTION

   Rationale
       NUMA (which stands for Non-Uniform Memory Access) means that the memory accessing times of
       a program running on a CPU depends on the relative distance between that CPU and that
       memory. In fact, most of the NUMA systems are built in such a way that each processor has
       its local memory, on which it can operate very fast. On the other hand, getting and
       storing data from and on remote memory (that is, memory local to some other processor) is
       quite more complex and slow. On these machines, a NUMA node is usually defined as a set of
       processor cores (typically a physical CPU package) and the memory directly attached to the
       set of cores.

       NUMA awareness becomes very important as soon as many domains start running memory-
       intensive workloads on a shared host. In fact, the cost of accessing non node-local memory
       locations is very high, and the performance degradation is likely to be noticeable.

       For more information, have a look at the Xen NUMA Introduction
       <https://wiki.xenproject.org/wiki/Xen_on_NUMA_Machines> page on the Wiki.

   Xen and NUMA machines: the concept of node-affinity
       The Xen hypervisor deals with NUMA machines throughout the concept of node-affinity. The
       node-affinity of a domain is the set of NUMA nodes of the host where the memory for the
       domain is being allocated (mostly, at domain creation time). This is, at least in
       principle, different and unrelated with the vCPU (hard and soft, see below) scheduling
       affinity, which instead is the set of pCPUs where the vCPU is allowed (or prefers) to run.

       Of course, despite the fact that they belong to and affect different subsystems, the
       domain node-affinity and the vCPUs affinity are not completely independent.  In fact, if
       the domain node-affinity is not explicitly specified by the user, via the proper libxl
       calls or xl config item, it will be computed basing on the vCPUs' scheduling affinity.

       Notice that, even if the node affinity of a domain may change on-line, it is very
       important to "place" the domain correctly when it is fist created, as the most of its
       memory is allocated at that time and can not (for now) be moved easily.

   Placing via pinning and cpupools
       The simplest way of placing a domain on a NUMA node is setting the hard scheduling
       affinity of the domain's vCPUs to the pCPUs of the node. This also goes under the name of
       vCPU pinning, and can be done through the "cpus=" option in the config file (more about
       this below). Another option is to pool together the pCPUs spanning the node and put the
       domain in such a cpupool with the "pool=" config option (as documented in our Wiki
       <https://wiki.xenproject.org/wiki/Cpupools_Howto>).

       In both the above cases, the domain will not be able to execute outside the specified set
       of pCPUs for any reasons, even if all those pCPUs are busy doing something else while
       there are others, idle, pCPUs.

       So, when doing this, local memory accesses are 100% guaranteed, but that may come at he
       cost of some load imbalances.

   NUMA aware scheduling
       If using the credit1 scheduler, and starting from Xen 4.3, the scheduler itself always
       tries to run the domain's vCPUs on one of the nodes in its node-affinity. Only if that
       turns out to be impossible, it will just pick any free pCPU. Locality of access is less
       guaranteed than in the pinning case, but that comes along with better chances to exploit
       all the host resources (e.g., the pCPUs).

       Starting from Xen 4.5, credit1 supports two forms of affinity: hard and soft, both on a
       per-vCPU basis. This means each vCPU can have its own soft affinity, stating where such
       vCPU prefers to execute on. This is less strict than what it (also starting from 4.5) is
       called hard affinity, as the vCPU can potentially run everywhere, it just prefers some
       pCPUs rather than others.  In Xen 4.5, therefore, NUMA-aware scheduling is achieved by
       matching the soft affinity of the vCPUs of a domain with its node-affinity.

       In fact, as it was for 4.3, if all the pCPUs in a vCPU's soft affinity are busy, it is
       possible for the domain to run outside from there. The idea is that slower execution (due
       to remote memory accesses) is still better than no execution at all (as it would happen
       with pinning). For this reason, NUMA aware scheduling has the potential of bringing
       substantial performances benefits, although this will depend on the workload.

       Notice that, for each vCPU, the following three scenarios are possbile:

       •   a vCPU is pinned to some pCPUs and does not have any soft affinity In this case, the
           vCPU is always scheduled on one of the pCPUs to which it is pinned, without any
           specific peference among them.

       •   a vCPU has its own soft affinity and is not pinned to any particular pCPU. In this
           case, the vCPU can run on every pCPU. Nevertheless, the scheduler will try to have it
           running on one of the pCPUs in its soft affinity;

       •   a vCPU has its own vCPU soft affinity and is also pinned to some pCPUs. In this case,
           the vCPU is always scheduled on one of the pCPUs onto which it is pinned, with, among
           them, a preference for the ones that also forms its soft affinity. In case pinning and
           soft affinity form two disjoint sets of pCPUs, pinning "wins", and the soft affinity
           is just ignored.

   Guest placement in xl
       If using xl for creating and managing guests, it is very easy to ask for both manual or
       automatic placement of them across the host's NUMA nodes.

       Note that xm/xend does a very similar thing, the only differences being the details of the
       heuristics adopted for automatic placement (see below), and the lack of support (in both
       xm/xend and the Xen versions where that was the default toolstack) for NUMA aware
       scheduling.

   Placing the guest manually
       Thanks to the "cpus=" option, it is possible to specify where a domain should be created
       and scheduled on, directly in its config file. This affects NUMA placement and memory
       accesses as, in this case, the hypervisor constructs the node-affinity of a VM basing
       right on its vCPU pinning when it is created.

       This is very simple and effective, but requires the user/system administrator to
       explicitly specify the pinning for each and every domain, or Xen won't be able to
       guarantee the locality for their memory accesses.

       That, of course, also mean the vCPUs of the domain will only be able to execute on those
       same pCPUs.

       It is is also possible to have a "cpus_soft=" option in the xl config file, to specify the
       soft affinity for all the vCPUs of the domain. This affects the NUMA placement in the
       following way:

       •   if only "cpus_soft=" is present, the VM's node-affinity will be equal to the nodes to
           which the pCPUs in the soft affinity mask belong;

       •   if both "cpus_soft=" and "cpus=" are present, the VM's node-affinity will be equal to
           the nodes to which the pCPUs present both in hard and soft affinity belong.

   Placing the guest automatically
       If neither "cpus=" nor "cpus_soft=" are present in the config file, libxl tries to figure
       out on its own on which node(s) the domain could fit best.  If it finds one (some), the
       domain's node affinity get set to there, and both memory allocations and NUMA aware
       scheduling (for the credit scheduler and starting from Xen 4.3) will comply with it.
       Starting from Xen 4.5, this also means that the mask resulting from this "fitting"
       procedure will become the soft affinity of all the vCPUs of the domain.

       It is worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an host
       is an incarnation of the Bin Packing Problem. In fact, the various VMs with different
       memory sizes are the items to be packed, and the host nodes are the bins. As such problem
       is known to be NP-hard, we will be using some heuristics.

       The first thing to do is find the nodes or the sets of nodes (from now on referred to as
       'candidates') that have enough free memory and enough physical CPUs for accommodating the
       new domain. The idea is to find a spot for the domain with at least as much free memory as
       it has configured to have, and as much pCPUs as it has vCPUs.  After that, the actual
       decision on which candidate to pick happens accordingly to the following heuristics:

       •   candidates involving fewer nodes are considered better. In case two (or more)
           candidates span the same number of nodes,

       •   candidates with a smaller number of vCPUs runnable on them (due to previous placement
           and/or plain vCPU pinning) are considered better. In case the same number of vCPUs can
           run on two (or more) candidates,

       •   the candidate with with the greatest amount of free memory is considered to be the
           best one.

       Giving preference to candidates with fewer nodes ensures better performance for the guest,
       as it avoid spreading its memory among different nodes. Favoring candidates with fewer
       vCPUs already runnable there ensures a good balance of the overall host load. Finally, if
       more candidates fulfil these criteria, prioritizing the nodes that have the largest
       amounts of free memory helps keeping the memory fragmentation small, and maximizes the
       probability of being able to put more domains there.

   Guest placement in libxl
       xl achieves automatic NUMA placement because that is what libxl does by default. No API is
       provided (yet) for modifying the behaviour of the placement algorithm. However, if your
       program is calling libxl, it is possible to set the "numa_placement" build info key to
       "false" (it is "true" by default) with something like the below, to prevent any placement
       from happening:

           libxl_defbool_set(&domain_build_info->numa_placement, false);

       Also, if "numa_placement" is set to "true", the domain's vCPUs must not be pinned (i.e.,
       "domain_build_info->cpumap" must have all its bits set, as it is by default), or domain
       creation will fail with "ERROR_INVAL".

       Starting from Xen 4.3, in case automatic placement happens (and is successful), it will
       affect the domain's node-affinity and not its vCPU pinning. Namely, the domain's vCPUs
       will not be pinned to any pCPU on the host, but the memory from the domain will come from
       the selected node(s) and the NUMA aware scheduling (if the credit scheduler is in use)
       will try to keep the domain's vCPUs there as much as possible.

       Besides than that, looking and/or tweaking the placement algorithm search "Automatic NUMA
       placement" in libxl_internal.h.

       Note this may change in future versions of Xen/libxl.

   Xen < 4.5
       The concept of vCPU soft affinity has been introduced for the first time in Xen 4.5. In
       4.3, it is the domain's node-affinity that drives the NUMA-aware scheduler. The main
       difference is soft affinity is per-vCPU, and so each vCPU can have its own mask of pCPUs,
       while node-affinity is per-domain, that is the equivalent of having all the vCPUs with the
       same soft affinity.

   Xen < 4.3
       As NUMA aware scheduling is a new feature of Xen 4.3, things are a little bit different
       for earlier version of Xen. If no "cpus=" option is specified and Xen 4.2 is in use, the
       automatic placement algorithm still runs, but the results is used to pin the vCPUs of the
       domain to the output node(s).  This is consistent with what was happening with xm/xend.

       On a version of Xen earlier than 4.2, there is not automatic placement at all in xl or
       libxl, and hence no node-affinity, vCPU affinity or pinning being introduced/modified.

   Limitations
       Analyzing various possible placement solutions is what makes the algorithm flexible and
       quite effective. However, that also means it won't scale well to systems with arbitrary
       number of nodes.  For this reason, automatic placement is disabled (with a warning) if it
       is requested on a host with more than 16 NUMA nodes.