Provided by: numatop_1.0.4-8_amd64 bug

NAME

       numatop - a tool for memory access locality characterization and analysis.

SYNOPSIS

       numatop [-s] [-l] [-f] [-d]

       numatop [-h]

DESCRIPTION

       This manual page briefly documents the numatop command.

       Most  modern systems use a Non-Uniform Memory Access (NUMA) design for multiprocessing. In
       NUMA systems, memory and processors are organized in such a way that some parts of  memory
       are  closer  to  a given processor, while other parts are farther from it. A processor can
       access memory that is closer to it much faster than the memory that is  farther  from  it.
       Hence,  the  latency between the processors and different portions of the memory in a NUMA
       machine may be significantly different.

       numatop is an observation tool for runtime memory locality characterization  and  analysis
       of  processes  and threads running on a NUMA system. It helps the user to characterize the
       NUMA behavior of processes and threads and to identify where the NUMA-related  performance
       bottlenecks  reside.  The tool uses hardware performance counter sampling technologies and
       associates the performance data with Linux system runtime information to provide real-time
       analysis in production systems. The tool can be used to:

       A)  Characterize  the locality of all running processes and threads to identify those with
       the poorest locality in the system.

       B) Identify the "hot" memory areas, report average memory access latency, and provide  the
       location   where   accessed   memory   is   allocated.   A  "hot"  memory  area  is  where
       process/thread(s) accesses are most frequent. numatop has a metric called  "ACCESS%"  that
       specifies what percentage of memory accesses are attributable to each memory area.

       Note:  numatop  records  only  the  memory  accesses  which  have latencies greater than a
       predefined threshold (128 CPU cycles).

       C) Provide the call-chain(s) in the process/thread code that accesses a given  hot  memory
       area.

       D)  Provide  the  call-chain(s)  when  the process/thread generates certain counter events
       (RMA/LMA/IR/CYCLE). The call-chain(s) helps to locate the source code that  generates  the
       events.

       RMA: Remote Memory Access.
       LMA: Local Memory Access.
       IR: Instruction Retired.
       CYCLE: CPU cycles.

       E)  Provide  per-node  statistics  for  memory and CPU utilization. A node is: a region of
       memory in which every byte has the same distance from each CPU.

       F) Show, using a user-friendly interface, the list of  processes/threads  sorted  by  some
       metrics  (by  default, sorted by CPU utilization), with the top process having the highest
       CPU utilization in the system and the bottom one having the lowest CPU utilization.  Users
       can  also  use  hotkeys to resort the output by these metrics: RMA, LMA, RMA/LMA, CPI, and
       CPU%.

       RMA/LMA: ratio of RMA/LMA.
       CPI: CPU cycle per instruction.
       CPU%: CPU utilization.

       numatop is a GUI tool that periodically tracks and analyzes the NUMA activity of processes
       and  threads and displays useful metrics. Users can scroll up/down by using the up or down
       key to navigate in the current window and can use several hot keys shown at the bottom  of
       the  window,  to  switch  between windows or to change the running state of the tool.  For
       example, hotkey 'R' refreshes the data in the current window.

       Below is a detailed description of the various display windows and  the  data  items  that
       they display:

       [WIN1 - Monitoring processes and threads]:
       Get the locality characterization of all processes. This is the first window upon startup,
       it's numatop's "Home" window. This window displays a list of processes.  The  top  process
       has  the  highest  system  CPU utilization (CPU%), while the bottom process has the lowest
       CPU% in the system. Generally, the memory-intensive process is also CPU-intensive, so  the
       processes  shown  in this window are sorted by CPU% by default. The user can press hotkeys
       '1', '2', '3', '4', or '5' to resort the output by  "RMA",  "LMA",  "RMA/LMA",  "CPI",  or
       "CPU%".

       [KEY METRICS]:
       RMA(K): number of Remote Memory Access (unit is 1000).
               RMA(K) = RMA / 1000;
       LMA(K): number of Local Memory Access (unit is 1000).
               LMA(K) = LMA / 1000;
       RMA/LMA: ratio of RMA/LMA.
       CPI: CPU cycles per instruction.
       CPU%: system CPU utilization (busy time across all CPUs).

       [HOTKEY]:
       Q: Quit the application.
       H: WIN1 refresh.
       R: Refresh to show the latest data.
       I: Switch to WIN2 to show the normalized data.
       N: Switch to WIN11 to show the per-node statistics.
       1: Sort by RMA.
       2: Sort by LMA.
       3: Sort by RMA/LMA.
       4: Sort by CPI.
       5: Sort by CPU%

       [WIN2 - Monitoring processes and threads (normalized)]:
       Get the normalized locality characterization of all processes.

       [KEY METRICS]:
       RPI(K): RMA normalized by 1000 instructions.
               RPI(K) = RMA / (IR / 1000);
       LPI(K): LMA normalized by 1000 instructions.
               LPI(K) = LMA / (IR / 1000);
       Other metrics remain the same.

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to previous window.
       R: Refresh to show the latest data.
       N: Switch to WIN11 to show the per-node statistics.
       1: Sort by RPI.
       2: Sort by LPI.
       3: Sort by RMA/LMA.
       4: Sort by CPI.
       5: Sort by CPU%

       [WIN3 - Monitoring the process]:
       Get the locality characterization with node affinity of a specified process.

       [KEY METRICS]:
       NODE: the node ID.
       CPU%: per-node CPU utilization.
       Other metrics remain the same.

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to previous window.
       R: Refresh to show the latest data.
       N: Switch to WIN11 to show the per-node statistics.
       L: Show the latency information.
       C: Show the call-chain.

       [WIN4 - Monitoring all threads]:
       Get the locality characterization of all threads in a specified process.

       [KEY METRICS]:
       CPU%: per-CPU CPU utilization.
       Other metrics remain the same.

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to previous window.
       R: Refresh to show the latest data.
       N: Switch to WIN11 to show the per-node statistics.

       [WIN5 - Monitoring the thread]:
       Get the locality characterization with node affinity of a specified thread.

       [KEY METRICS]:
       CPU%: per-CPU CPU utilization.
       Other metrics remain the same.

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to previous window.
       R: Refresh to show the latest data.
       N: Switch to WIN11 to show the per-node statistics.
       L: Show the latency information.
       C: Show the call-chain.

       [WIN6 - Monitoring memory areas]:
       Get   the   memory  area  use  with  the  associated  accessing  latency  of  a  specified
       process/thread.

       [KEY METRICS]:
       ADDR: starting address of the memory area.
       SIZE: size of memory area (K/M/G bytes).
       ACCESS%: percentage of memory accesses are to this memory area.
       LAT(ns): the average latency (nanoseconds) of memory accesses.
       DESC: description of memory area (from /proc/<pid>/maps).

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to previous window.
       R: Refresh to show the latest data.
       A: Show the memory access node distribution.
       C: Show the call-chain when process/thread accesses the memory area.

       [WIN7 - Memory access node distribution overview]:
       Get the percentage of memory accesses originated from the process/thread to each node.

       [KEY METRICS]:
       NODE: the node ID.
       ACCESS%: percentage of memory accesses are to this node.
       LAT(ns): the average latency (nanoseconds) of memory accesses to this node.

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to previous window.
       R: Refresh to show the latest data.

       [WIN8 - Break down the memory area into physical memory on node]:
       Break down the memory area into the physical mapping on node with the associated accessing
       latency of a process/thread.

       [KEY METRICS]:
       NODE: the node ID.
       Other metrics remain the same.

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to previous window.
       R: Refresh to show the latest data.

       [WIN9 - Call-chain when process/thread generates the event ("RMA"/"LMA"/"CYCLE"/"IR")]:
       Determine the call-chains to the code that generates "RMA"/"LMA"/"CYCLE"/"IR".

       [KEY METRICS]:
       Call-chain list: a list of call-chains.

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to the previous window.
       R: Refresh to show the latest data.
       1: Locate call-chain when process/thread generates "RMA"
       2: Locate call-chain when process/thread generates "LMA"
       3: Locate call-chain when process/thread generates "CYCLE" (CPU cycle)
       4: Locate call-chain when process/thread generates "IR" (Instruction Retired)

       [WIN10 - Call-chain when process/thread access the memory area]:
       Determine  the call-chains to the code that references this memory area.  The latency must
       be greater than the predefined latency threshold (128 CPU cycles).

       [KEY METRICS]:
       Call-chain list: a list of call-chains.
       Other metrics remain the same.

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to previous window.
       R: Refresh to show the latest data.

       [WIN11 - Node Overview]:
       Show the basic per-node statistics for this system

       [KEY METRICS]:
       MEM.ALL: total usable RAM (physical RAM minus a few reserved bits and  the  kernel  binary
       code).
       MEM.FREE: sum of LowFree + HighFree (overall stat) .
       CPU%: per-node CPU utilization.
       Other metrics remain the same.

       [WIN12 - Information of Node N]:
       Show the memory use and CPU utilization for the selected node.

       [KEY METRICS]:
       CPU: array of logical CPUs which belong to this node.
       CPU%: per-node CPU utilization.
       MEM  active:  the  amount  of  memory  that has been used more recently and is not usually
       reclaimed unless absolute necessary.
       MEM inactive: the amount of memory that has not been used for a while and is  eligible  to
       be swapped to disk.
       Dirty: the amount of memory waiting to be written back to the disk.
       Writeback: the amount of memory actively being written back to the disk.
       Mapped: all pages mapped into a process.

       [HOTKEY]:
       Q: Quit the application.
       H: Switch to WIN1.
       B: Back to previous window.
       R: Refresh to show the latest data.

OPTIONS

       The following options are supported by numatop:

       -s sampling_precision
       normal: balance precision and overhead (default)
       high: high sampling precision (high overhead)
       low: low sampling precision, suitable for high load system

       -l log_level
       Specifies the level of logging in the log file. Valid values are:
       1: unknown (reserved for future use)
       2: all

       -f log_file
       Specifies  the log file where output will be written. If the log file is not writable, the
       tool will prompt "Cannot open '<file name>' for writting.".

       -d dump_file
       Specifies the dump file where the screen data will be written. Generally the dump file  is
       used  for  automated  test. If the dump file is not writable, the tool will prompt "Cannot
       open <file name> for dump writing."

       -h
       Displays the command's usage.

EXAMPLES

       Example 1: Launch numatop with high sampling precision
       numatop -s high

       Example 2: Write all warning messages in /tmp/numatop.log
       numatop -l 2 -o /tmp/numatop.log

       Example 3: Dump screen data in /tmp/dump.log
       numatop -d /tmp/dump.log

EXIT STATUS

       0: successful operation.
       Other value: an error occurred.

USAGE

       You must have root privileges to run numatop.
       Or set -1 in /proc/sys/kernel/perf_event_paranoid

       Note: The perf_event_paranoid setting  has  security  implications  and  a  non-root  user
       probably  doesn't  have  authority to access /proc. It is highly recommended that the user
       runs numatop as root.

VERSION

       numatop requires a patch set to support PEBS Load Latency functionality in the kernel. The
       patch  set  has  not  been  integrated  in 3.8. Probably it will be integrated in 3.9. The
       following steps show how to get and apply the patch set.

       1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
       2. cd tip
       3. git checkout perf/x86
       4. build kernel as usual

       numatop supports the Intel Xeon processors: 5500-series,  6500/7500-series,  5600  series,
       E7-x8xx-series,  and  E5-16xx/24xx/26xx/46xx-series.  Note: CPU microcode version 0x618 or
       0x70c or later is required on E5-16xx/24xx/26xx/46xx-series. It also supports  IBM  Power8
       and Power9 processors.

                                          April 3, 2013                                NUMATOP(8)