noble (7) xen-tscmode.7.gz

Provided by: xen-utils-common_4.17.3+10-g091466ba55-1.1ubuntu3_amd64 bug

NAME

       xen-tscmode - Xen TSC (time stamp counter) and timekeeping discussion

OVERVIEW

       As of Xen 4.0, a new config option called tsc_mode may be specified for each domain.  The default for
       tsc_mode handles the vast majority of hardware and software environments.  This document is targeted for
       Xen users and administrators that may need to select a non-default tsc_mode.

       Proper selection of tsc_mode depends on an understanding not only of the guest operating system (OS), but
       also of the application set that will ever run on this guest OS.  This is because tsc_mode applies
       equally to both the OS and ALL apps that are running on this domain, now or in the future.

       Key questions to be answered for the OS and/or each application are:

       •   Does the OS/app use the rdtsc instruction at all?  (We will explain below how to determine this.)

       •   At what frequency is the rdtsc instruction executed by either the OS or any running apps?  If the sum
           exceeds about 10,000 rdtsc instructions per second per processor, we call this a "high-TSC-frequency"
           OS/app/environment.  (This is relatively rare, and developers of OS's and apps that are high-TSC-
           frequency are usually aware of it.)

       •   If the OS/app does use rdtsc, will it behave incorrectly if "time goes backwards" or if the frequency
           of the TSC suddenly changes?  If so, we call this a "TSC-sensitive" app or OS; otherwise it is "TSC-
           resilient".

       This last is the US$64,000 question as it may be very difficult (or, for legacy apps, even impossible) to
       predict all possible failure cases.  As a result, unless proven otherwise, any app that uses rdtsc must
       be assumed to be TSC-sensitive and, as we will see, this is the default starting in Xen 4.0.

       Xen's new tsc_mode parameter determines the circumstances under which the family of rdtsc instructions
       are executed "natively" vs emulated.  Roughly speaking, native means rdtsc is fast but TSC-sensitive apps
       may, under unpredictable circumstances, run incorrectly; emulated means there is some performance
       degradation (unobservable in most cases), but TSC-sensitive apps will always run correctly.  Prior to Xen
       4.0, all rdtsc instructions were native: "fast but potentially incorrect."  Starting at Xen 4.0, the
       default is that all rdtsc instructions are "correct but potentially slow".  The tsc_mode parameter in 4.0
       provides an intelligent default but allows system administrator's to adjust how rdtsc instructions are
       executed differently for different domains.

       The non-default choices for tsc_mode are:

       •   tsc_mode=1 (always emulate).

           All rdtsc instructions are emulated; this is the best choice when TSC-sensitive apps are running and
           it is necessary to understand worst-case performance degradation for a specific hardware environment.

       •   tsc_mode=2 (never emulate).

           This is the same as prior to Xen 4.0 and is the best choice if it is certain that all apps running in
           this VM are TSC-resilient and highest performance is required.

       •   tsc_mode=3 (PVRDTSCP).

           This mode has been removed.

       If tsc_mode is left unspecified (or set to tsc_mode=0), a hybrid algorithm is utilized to ensure
       correctness while providing the best performance possible given:

       •   the requirement of correctness,

       •   the underlying hardware, and

       •   whether or not the VM has been saved/restored/migrated

       To understand this in more detail, the rest of this document must be read.

DETERMINING RDTSC FREQUENCY

       To determine the frequency of rdtsc instructions that are emulated, an "xl" command can be used by a
       privileged user of domain0.  The command:

           # xl debug-key s; xl dmesg | tail

       provides information about TSC usage in each domain where TSC emulation is currently enabled.

TSC HISTORY

       To understand tsc_mode completely, some background on TSC is required:

       The x86 "timestamp counter", or TSC, is a 64-bit register on each processor that increases monotonically.
       Historically, TSC incremented every processor cycle, but on recent processors, it increases at a constant
       rate even if the processor changes frequency (for example, to reduce processor power usage).  TSC is
       known by x86 programmers as the fastest, highest-precision measurement of the passage of time so it is
       often used as a foundation for performance monitoring.  And since it is guaranteed to be monotonically
       increasing and, at 64 bits, is guaranteed to not wraparound within 10 years, it is sometimes used as a
       random number or a unique sequence identifier, such as to stamp transactions so they can be replayed in a
       specific order.

       On most older SMP and early multi-core machines, TSC was not synchronized between processors.  Thus if an
       application were to read the TSC on one processor, then was moved by the OS to another processor, then
       read TSC again, it might appear that "time went backwards".  This loss of monotonicity resulted in many
       obscure application bugs when TSC-sensitive apps were ported from a uniprocessor to an SMP environment;
       as a result, many applications -- especially in the Windows world -- removed their dependency on TSC and
       replaced their timestamp needs with OS-specific functions, losing both performance and precision. On some
       more recent generations of multi-core machines, especially multi-socket multi-core machines, the TSC was
       synchronized but if one processor were to enter certain low-power states, its TSC would stop, destroying
       the synchrony and again causing obscure bugs.  This reinforced decisions to avoid use of TSC altogether.
       On the most recent generations of multi-core machines, however, synchronization is provided across all
       processors in all power states, even on multi-socket machines, and provide a flag that indicates that TSC
       is synchronized and "invariant".  Thus TSC is once again useful for applications, and even newer
       operating systems are using and depending upon TSC for critical timekeeping tasks when running on these
       recent machines.

       We will refer to hardware that ensures TSC is both synchronized and invariant as "TSC-safe" and any
       hardware on which TSC is not (or may not remain) synchronized as "TSC-unsafe".

       As a result of TSC's sordid history, two classes of applications use TSC: old applications designed for
       single processors, and the most recent enterprise applications which require high-frequency high-
       precision timestamping.

       We will refer to apps that might break if running on a TSC-unsafe machine as "TSC-sensitive"; apps that
       don't use TSC, or do use TSC but use it in a way that monotonicity and frequency invariance are
       unimportant as "TSC-resilient".

       The emergence of virtualization once again complicates the usage of TSC.  When features such as
       save/restore or live migration are employed, a guest OS and all its currently running applications may be
       invisibly transported to an entirely different physical machine.  While TSC may be "safe" on one machine,
       it is essentially impossible to precisely synchronize TSC across a data center or even a pool of
       machines.  As a result, when run in a virtualized environment, rare and obscure "time going backwards"
       problems might once again occur for those TSC-sensitive applications.  Worse, if a guest OS moves from,
       for example, a 3GHz machine to a 1.5GHz machine, attempts by an OS/app to measure time intervals with TSC
       may without notice be incorrect by a factor of two.

       The rdtsc (read timestamp counter) instruction is used to read the TSC register.  The rdtscp instruction
       is a variant of rdtsc on recent processors.  We refer to these together as the rdtsc family of
       instructions, or just "rdtsc".  Instructions in the rdtsc family are non-privileged, but privileged
       software may set a cpuid bit to cause all rdtsc family instructions to trap.  This trap can be detected
       by Xen, which can then transparently "emulate" the results of the rdtsc instruction and return control to
       the code following the rdtsc instruction.

       To provide a "safe" TSC, i.e. to ensure both TSC monotonicity and a fixed rate, Xen provides rdtsc
       emulation whenever necessary or when explicitly specified by a per-VM configuration option.  TSC
       emulation is relatively slow -- roughly 15-20 times slower than the rdtsc instruction when executed
       natively.  However, except when an OS or application uses the rdtsc instruction at a high frequency (e.g.
       more than about 10,000 times per second per processor), this performance degradation is not noticeable
       (i.e. <0.3%).  And, TSC emulation is nearly always faster than OS-provided alternatives (e.g. Linux's
       gettimeofday).  For environments where it is certain that all apps are TSC-resilient (e.g.  "TSC-
       safeness" is not necessary) and highest performance is a requirement, TSC emulation may be entirely
       disabled (tsc_mode==2).

       The default mode (tsc_mode==0) checks TSC-safeness of the underlying hardware on which the virtual
       machine is launched.  If it is TSC-safe, rdtsc will execute at hardware speed; if it is not, rdtsc will
       be emulated.  Once a virtual machine is save/restored or migrated, however, there are two possibilities:
       TSC remains native IF the source physical machine and target physical machine have the same TSC frequency
       (or, for HVM/PVH guests, if TSC scaling support is available); else TSC is emulated.  Note that, though
       emulated, the "apparent" TSC frequency will be the TSC frequency of the initial physical machine, even
       after migration.

       Finally, tsc_mode==1 always enables TSC emulation, regardless of the underlying physical hardware. The
       "apparent" TSC frequency will be the TSC frequency of the initial physical machine, even after migration.
       This mode is useful to measure any performance degradation that might be encountered by a tsc_mode==0
       domain after migration occurs, or a tsc_mode==3 domain when it is running on TSC-unsafe hardware.

       Note that while Xen ensures that an emulated TSC is "safe" across migration, it does not ensure that it
       continues to tick at the same rate during the actual migration.  As an oversimplified example, if TSC is
       ticking once per second in a guest, and the guest is saved when the TSC is 1000, then restored 30 seconds
       later, TSC is only guaranteed to be greater than or equal to 1001, not precisely 1030.  This has some OS
       implications as will be seen in the next section.

TSC INVARIANT BIT and NO_MIGRATE

       Related to TSC emulation, the "TSC Invariant" bit is architecturally defined in a cpuid bit on the most
       recent x86 processors.  If set, TSC invariance ensures that the TSC is "safe", that is it will increment
       at a constant rate regardless of power events, will be synchronized across all processors, and was
       properly initialized to zero on all processors at boot-time by system hardware/BIOS.  As long as system
       software never writes to TSC, TSC will be safe and continuously incremented at a fixed rate and thus can
       be used as a system "clocksource".

       This bit is used by some OS's, and specifically by Linux starting with version 2.6.30(?), to select TSC
       as a system clocksource.  Once selected, TSC remains the Linux system clocksource unless manually
       overridden.  In a virtualized environment, since it is not possible to synchronize TSC across all the
       machines in a pool or data center, a migration may "break" TSC as a usable clocksource; while time will
       not go backwards, it may not track wallclock time well enough to avoid certain time-sensitive
       consequences.  As a result, Xen can only expose the TSC Invariant bit to a guest OS if it is certain that
       the domain will never migrate.  As of Xen 4.0, the "no_migrate=1" VM configuration option may be
       specified to disable migration.  If no_migrate is selected and the VM is running on a physical machine
       with "TSC Invariant", Linux 2.6.30+ will safely use TSC as the system clocksource.  But, attempts to
       migrate or, once saved, restore this domain will fail.

       There is another cpuid-related complication: The x86 cpuid instruction is non-privileged.  HVM domains
       are configured to always trap this instruction to Xen, where Xen can "filter" the result.  In a PV OS,
       all cpuid instructions have been replaced by a paravirtualized equivalent of the cpuid instruction
       ("pvcpuid") and also trap to Xen.  But apps in a PV guest that use a cpuid instruction execute it
       directly, without a trap to Xen.  As a result, an app may directly examine the physical TSC Invariant
       cpuid bit and make decisions based on that bit.

HARDWARE TSC SCALING

       Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read by guest rdtsc/p increasing in a
       different frequency than the host TSC frequency.

       If a HVM container in default TSC mode (tsc_mode=0) is created on a host that provides constant TSC, its
       guest TSC frequency will be the same as the host. If it is later migrated to another host that provides
       constant TSC and supports Intel VMX TSC scaling/AMD SVM TSC ratio, its guest TSC frequency will be the
       same before and after migration.

       For above HVM container in default TSC mode (tsc_mode=0), if above hosts support rdtscp, both guest rdtsc
       and rdtscp instructions will be executed natively before and after migration.

AUTHORS

       Dan Magenheimer <dan.magenheimer@oracle.com>