Provided by: linux-tools-common_6.11.0-13.14_all bug

NAME

       perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool

SYNOPSIS

       perf record -e ibs_op//
       perf record -e ibs_fetch//

DESCRIPTION

       Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP) profiling
       support on AMD platforms. IBS has two independent components: IBS Op and IBS Fetch. IBS Op
       sampling provides information about instruction execution (micro-op execution to be
       precise) with details like d-cache hit/miss, d-TLB hit/miss, cache miss latency,
       load/store data source, branch behavior etc. IBS Fetch sampling provides information about
       instruction fetch with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc.
       IBS is per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.

       Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited using the
       Linux perf utility. The following files will be created at boot time if IBS is supported
       by the hardware and kernel.

           /sys/bus/event_source/devices/ibs_op/
           /sys/bus/event_source/devices/ibs_fetch/

       IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports one event:
       fetch ops.

       IBS PMUs do not have user/kernel filtering capability and thus it requires CAP_SYS_ADMIN
       or CAP_PERFMON privilege.

IBS VS. REGULAR CORE PMU

       IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has no skid.
       Whereas the IP recorded by regular core PMU will have some skid (sample was generated at
       IP X but perf would record it at IP X+n). Hence, regular core PMU might not help for
       profiling with instruction level precision. Further, IBS provides additional information
       about the sample in question. On the other hand, regular core PMU has it’s own advantages
       like plethora of events, counting mode (less interference), up to 6 parallel counters,
       event grouping support, filtering capabilities etc.

       Three regular core PMU events are internally forwarded to IBS Op PMU when precise_ip
       attribute is set:

           -e cpu-cycles:p becomes -e ibs_op//
           -e r076:p becomes -e ibs_op//
           -e r0C1:p becomes -e ibs_op/cnt_ctl=1/

EXAMPLES

   IBS Op PMU
       System-wide profile, cycles event, sampling period: 100000

           # perf record -e ibs_op// -c 100000 -a

       Per-cpu profile (cpu10), cycles event, sampling period: 100000

           # perf record -e ibs_op// -c 100000 -C 10

       Per-cpu profile (cpu10), cycles event, sampling freq: 1000

           # perf record -e ibs_op// -F 1000 -C 10

       System-wide profile, uOps event, sampling period: 100000

           # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a

       Same command, but also capture IBS register raw dump along with perf sample:

           # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples

       System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)

           # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a

       Per process(upstream v6.2 onward), uOps event, sampling period: 100000

           # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234

       Per process(upstream v6.2 onward), uOps event, sampling period: 100000

           # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls

       To analyse recorded profile in aggregate mode

           # perf report
           /* Select a line and press 'a' to drill down at instruction level. */

       To go over each sample

           # perf script

       Raw dump of IBS registers when profiled with --raw-samples

           # perf report -D
           /* Look for PERF_RECORD_SAMPLE */

           Example register raw dump:

           ibs_op_ctl:     000002c30006186a MaxCnt    100000 L3MissOnly 0 En 1
                   Val 1 CntCtl 0=cycles CurCnt       707
           IbsOpRip:       ffffffff8204aea7
           ibs_op_data:    0000010002550001 CompToRetCtr     1 TagToRetCtr   597
                   BrnRet 0  RipInvalid 0 BrnFuse 0 Microcode 1
           ibs_op_data2:   0000000000000013 RmtNode 1 DataSrc 3=DRAM
           ibs_op_data3:   0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
                   DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
                   DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
                   DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
                   DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
                   OpDcMissOpenMemReqs 12 DcMissLat     0 TlbRefillLat     0
           IbsDCLinAd:     ff110008a5398920
           IbsDCPhysAd:    00000008a5398920

       IBS applied in a real world usecase

           ~90% regression was observed in tbench with specific scheduler hint
           which was counter intuitive. IBS profile of good and bad run captured
           using perf helped in identifying exact cause of the problem:

           https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com

   IBS Fetch PMU
       Similar commands can be used with Fetch PMU as well.

       System-wide profile, fetch ops event, sampling period: 100000

           # perf record -e ibs_fetch// -c 100000 -a

       System-wide profile, fetch ops event, sampling period: 100000, Random enable

           # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a

           Random enable adds small degree of variability to sample period. This
           helps in cases like long running loops where PMU is tagging the same
           instruction over and over because of fixed sample period.

       etc.

PERF MEM AND PERF C2C

       perf mem is a memory access profiler tool and perf c2c is a shared data cacheline analyser
       tool. Both of them internally uses IBS Op PMU on AMD. Below is a simple example of the
       perf mem tool.

           # perf mem record -c 100000 -- make
           # perf mem report

       A normal perf mem report output will provide detailed memory access profile. However, it
       can also be aggregated based on output fields. For example:

           # perf mem report -F mem,sample,snoop
           Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876
           Memory access                                 Samples  Snoop
           N/A                                           1903343  N/A
           L1 hit                                        1056754  N/A
           L2 hit                                          75231  N/A
           L3 hit                                           9496  HitM
           L3 hit                                           2270  N/A
           RAM hit                                          8710  N/A
           Remote node, same socket RAM hit                 3241  N/A
           Remote core, same node Any cache hit             1572  HitM
           Remote core, same node Any cache hit              514  N/A
           Remote node, same socket Any cache hit           1216  HitM
           Remote node, same socket Any cache hit            350  N/A
           Uncached hit                                       18  N/A

       Please refer to their man page for more detail.

SEE ALSO

       perf-record(1), perf-script(1), perf-report(1), perf-mem(1), perf-c2c(1)