Ubuntu Manpage: likwid-bench - low-level benchmark suite and microbenchmarking framework

NAME

       likwid-bench - low-level benchmark suite and microbenchmarking framework

SYNOPSIS

       likwid-bench  [-hap]  [-t  <testname>]  [-s  <min_time>]  [-w  <workgroup_expression>] [-W
       <workgroup_expression_short>] [-l  <testname>]  [-d  <delimiter>]  [-i  <iterations>]  [-f
       <filepath>]

DESCRIPTION

       likwid-bench  is  a  benchmark  suite  for  low-level  (assembly)  benchmarks  to  measure
       bandwidths and instruction throughput for specific instruction code on  x86  systems.  The
       currently included benchmark codes include common data access patterns like load and store
       but also calculations like vector  triad  and  sum.   likwid-bench  includes  architecture
       specific benchmarks for x86, x86_64 and x86 for Intel Xeon Phi coprocessors. With LIKWID 5
       also ARM and POWER  benchmarks  are  supported.  The  performance  values  can  either  be
       calculated  by likwid-bench or measured using performance counters by using likwid-perfctr
       as a wrapper to likwid-bench.  This requires to build  likwid-bench  with  instrumentation
       enabled  in  config.mk.  Benchmarks  can  be  dynamically  added when a proper ptt file is
       present at $HOME/.likwid/bench/<arch>/<testname>.ptt . The files are compiled to a .S file
       and  compiled  using  either  gcc,  icc or pgcc (searched in $PATH). The default folder is
       /tmp/<PID>. Possible values for <arch> are 'x86', 'x86-64',  'phi',  armv7',  'armv8'  and
       'power'.

OPTIONS

       -h     prints a help message to standard output, then exits.

       -a     list available benchmark codes for the current system.

       -p     list available thread domains.

       -s <min_time>
              Run  the  benchmark  for  at least <min_time> seconds.  The amount of iterations is
              determined using this value. Default: 1 second.

       -t <testname>
              Name of the benchmark code to run (mandatory).

       -w <workgroup_expression>
              Specify the affinity domain, thread  count  and  data  set  size  for  the  current
              benchmarking  run  (-w  or -W mandatory). First thread in thread domain initializes
              the stream.

       -W <workgroup_expression_short>
              Specify the affinity domain, thread  count  and  data  set  size  for  the  current
              benchmarking run (-w or -W mandatory). Each thread in the workgroup initializes its
              own chunk of the stream.

       -l <testname>
              list properties of a benchmark code.

       -i <iterations>
              Set the number of iterations per thread (optional)

       -f <filepath>
              Filepath for the dynamic generation of benchmarks. Default /tmp/. <PID>  is  always
              attached

WORKGROUP SYNTAX

       <thread_domain>:<size>  [:<num_threads>[:<chunk_size>:<stride>]] [-<streamId>:<domain_id>]
       with size in kB, MB or GB. The <thread_domain>  defines  where  the  threads  are  placed.
       <size>  is  the total data set size for the benchmark, the allocated vectors in memory sum
       up  to  this  size.   <num_threads>  specifies  how  many  threads   are   used   in   the
       <thread_domain>.   Threads are always placed using a compact policy in likwid-bench.  This
       means that per default all SMT threads are used. Optionally similar a the expression based
       syntax  in  likwid-pin  a  <chunk_size> and <stride> can be provided. Optionally for every
       stream (array, vector) the placement can be controlled. Per default all arrays are  placed
       in  the  same <thread_domain> the threads are running in. To place the data in a different
       domain for every stream of a benchmark case (the total number of streams can  be  acquired
       by  the  -l option) the domain to place the data in can be specified. Multiple streams are
       comma separated. Either the placement is provided or all streams  have  to  be  explicitly
       placed.  Please refer to the Wiki pages on https://github.com/RRZE-HPC/likwid/wiki/Likwid-
       Bench for further details and examples on usage.  With -W each thread initializes its  own
       chunk of the streams but pleacement of the streams is deactivated.

EXAMPLE

1. Run the copy benchmark on socket 0 ( S0 ) with a total data set size of 100kB.

likwid-bench -t copy -w S0:100kB

Since no <num_threads> is given in the workload expression, each core of socket 0 gets one
thread. The workload is split up between all threads and the number of iterations is
determined automatically.

2. Run the triad benchmark code with explicitly 100 iterations per thread with 2 threads
on the socket 0 ( S0 ) and a data size of 1GB.

likwid-bench -t triad -i 100 -w S0:1GB:2:1:2

Assuming socket 0 ( S0 ) has 2 physical cores with SMT enabled, hence in total 4 hardware
threads, one thread is assigned to each physical core of socket 0.

3. Run the update benchmark on socket 0 ( S0 ) with a workload of 100kB and on socket 1 (
S1 ) with the same workload.

likwid-bench -t update -w S0:100kB -w S1:100kB

The results of both workgroups are combinded for the output. Hence the workload in each
workgroup expression should have the same size.

4. Run the copy benchmark but measure the memory traffic with likwid-perfctr. The option
INSTRUMENT_BENCH in config.mk needs to be true at compile time to use that feature.

likwid-perfctr -c E:S0:4 -g MEM -m likwid-bench -t update -w S0:100kB

likwid-perfctr will configure and start the performance counters on socket 0 ( S0 ) with 4
threads prior to the execution of likwid-bench. The performance counters are read right
before and after running the benchmarking code to minimize the interferences of the
measurement.

5. Run the copy benchmark and place the data on another socket

likwid-bench -t copy -w S0:1GB:10:1:2-0:S1,1:S1

Stream id 0 and 1 are placed in thread domains S1, which is socket 1. This can be verified
as the initialization threads output where they are running.

WARNING

       Since LIKWID 5.0, it is possible to have different numbers of threads in workgroups.  Also
       different  sizes  are  allowed.  Both  features  seem  promising, but they show a range of
       problems. If you have a NUMA system and run with multiple threads on NUMA node 0 but  with
       less  on  NUMA  node  1,  the  threads  on  NUMA node 1 cause less preassure on the memory
       interface and consequently achieve higher throughput. They will finish early  compared  to
       the  threads  on  NUMA node 0. The runtime used for caluclating the bandwidth and MFlops/s
       values use the maximal runtime of all threads, hence one of NUMA node 0.  Similar problems
       exist  with  different  sizes.  One workgroup might run in cache while the other waits for
       data from the memory interface.

AUTHOR

       Written by Thomas Gruber <thomas.roehl@googlemail.com>.

BUGS

       Report Bugs on <https://github.com/RRZE-HPC/likwid/issues>.