Ubuntu Manpage: likwid-pin - pin a sequential or threaded application to dedicated processors

NAME

       likwid-pin - pin a sequential or threaded application to dedicated processors

SYNOPSIS

       likwid-pin [-vhSpqim] [-V <verbosity>] [-c/-C <corelist>] [-s <skip_mask>] [-d <delim>]

DESCRIPTION

likwid-pin is a command line application to pin a sequential or multithreaded application
to dedicated processors. It can be used as replacement for taskset. Opposite to taskset
no affinity mask but single processors are specified. For multithreaded applications
based on the pthread library the pthread_create library call is overloaded through
LD_PRELOAD and each created thread is pinned to a dedicated processor as specified in
core_list .

Per default every generated thread is pinned to the core in the order of calls to
pthread_create it is possible to skip single threads.

The OpenMP implementations of GCC and ICC compilers are explicitly supported. Clang's
OpenMP backend should also work as it is built on top of Intel's OpenMP runtime library.
Others may also work likwid-pin sets the environment variable OMP_NUM_THREADS for you if
not already present. It will set as many threads as present in the pin expression. Be
aware that with pthreads the parent thread is always pinned. If you create for example 4
threads with pthread_create and do not use the parent process as worker you still have to
provide num_threads+1 processor ids.

likwid-pin supports different numberings for pinning. See section CPU EXPRESSION for
details.

For applications where first touch policy on NUMA systems cannot be employed likwid-pin
can be used to turn on interleave memory placement. This can significantly speed up the
performance of memory bound multithreaded codes. All NUMA nodes the user pinned threads to
are used for interleaving.

OPTIONS

       -h,--help
              prints a help message to standard output, then exits.

       -v,--version
              prints version information to standard output, then exits.

       -V, --verbose <level>
              verbose   output  during  execution  for  debugging.  0  for  only  errors,  1  for
              informational output, 2 for detailed output and 3 for developer output

       -c,-C <cpu expression>
              specify a numerical list of processors.  The  list  may  contain  multiple   items,
              separated  by  comma, and ranges. For example 0,3,9-11. Other format are available,
              see the CPU EXPRESSION section.

       -s, --skip <skip_mask>
              Specify skip mask as HEX number. For each  set  bit  the  corresponding  thread  is
              skipped.

       -S,--sweep
              All  ccNUMA  memory  domains belonging to the specified thread list will be cleaned
              before the run. Can solve file buffer cache problems on Linux.

       -p     prints the available thread domains for logical pinning

       -i     set NUMA memory policy to interleave involving all NUMA nodes involved in pinning

       -m     set NUMA memory policy to membind involving all NUMA nodes involved in pinning

       -d <delim>
              usable with -p to specify the CPU delimiter in the cpulist

       -q,--quiet
              silent execution without output

CPU EXPRESSION

1. The most intuitive CPU selection method is a comma-separated list of hardware thread
IDs. An example for this is 0,2 which schedules the threads on hardware threads 0 and
2. The physical numbering also allows the usage of ranges like 0-2 which results in
the list 0,1,2.

2. The CPUs can be selected by their indices inside of an affinity domain. The affinity
domain is optional and if not given, Likwid assumes the domain 'N' for the whole node.
The format is L:<indexlist> for selecting the CPUs inside of domain 'N' or
L:<domain>:<indexlist> for selecting the CPUs inside the given domain. Assuming an
virtual affinity domain 'P' that contains the CPUs 0,4,1,5,2,6,3,7. After sorting it
to have physical hardware threads first we get: 0,1,2,3,4,5,6,7. The logical
numbering L:P:0-2 results in the selection 0,1,2 from the physical hardware threads
first list.

3. The expression syntax enables the selection according to an selection function with
variable input parameters. The format is either E:<affinity domain>:<numberOfThreads>
to use the first <numberOfThreads> threads in affinity domain <affinity domain> or
E:<affinity domain>:<numberOfThreads>:<chunksize>:<stride> to use <numberOfThreads>
threads with <chunksize> threads selected in row while skipping <stride> threads in
affinity domain <affinity domain>. Examples are E:N:4:1:2 for selecting the first four
physical CPUs on a system with 2 hardware thread per CPU core or E:P:4:2:4 for
choosing the first two threads in affinity domain P, skipping 2 threads and selecting
again two threads. The resulting CPU list for virtual affinity domain P is 0,4,2,6

3. The last format schedules the threads not only in a single affinity domain but
distributed them evenly over all available affinity domains of the same kind. In
contrast to the other formats, the selection is done using the physical hardware
threads first and then the virtual hardware threads (aka SMT threads). The format is
<affinity domain without number>:scatter like M:scatter to schedule the threads evenly
in all available memory affinity domains. Assuming the two socket domains S0 = 0,4,1,5
and S1 = 2,6,3,7 the expression S:scatter results in the CPU list 0,2,1,3,4,6,5,7

EXAMPLE

       1.   For standard pthread application:

       likwid-pin -c 0,2,4-6 ./myApp

       The parent process is pinned to processor 0 which is likely to be  thread  0  in  ./myApp.
       Thread  1  is  pinned to processor 2, thread 2 to processor 4, thread 3 to processor 5 and
       thread 4 to processor 6. If more threads are created than specified in the processor list,
       these threads are pinned to processor 0 as fallback.

       2.   For  selection  of  CPUs  inside  of  a CPUset only the logical numbering is allowed.
            Assuming CPUset 0,4,1,5:

       likwid-pin -c L:1,3 ./myApp

       This command pins ./myApp on CPU 4 and the thread started by ./myApp on CPU 5

       3.   A common use-case for the numbering by expression is pinning of an application on the
            Intel Xeon Phi coprocessor with its 60 cores each having 4 SMT threads.

       likwid-pin -c E:N:60:1:4 ./myApp

       This command schedules one applicationn thread per physical CPU core for ./myApp.

IMPORTANT NOTICE

       The detection of shepard threads works for Intel's/LLVM OpenMP runtime (>=12.0), for GCC's
       OpenMP runtime as well as for  PGI's  OpenMP  runtime.  If  you  encounter  problems  with
       pinning,  please  set  a proper skip mask to skip the not-detected shepard threads.  Intel
       OpenMP runtime 11.0/11.1 requires to set a skip mask of 0x1.

AUTHOR

       Written by Thomas Gruber <thomas.roehl@googlemail.com>.

BUGS

       Report Bugs on <https://github.com/RRZE-HPC/likwid/issues>.