Provided by: gromacs-data_2023.1-2ubuntu1_all bug

NAME

       gmx-nonbonded-benchmark - Benchmarking tool for the non-bonded pair kernels.

SYNOPSIS

          gmx nonbonded-benchmark [-o [<.csv>]] [-size <int>] [-nt <int>]
                       [-simd <enum>] [-coulomb <enum>] [-[no]table]
                       [-combrule <enum>] [-[no]halflj] [-[no]energy]
                       [-[no]all] [-cutoff <real>] [-iter <int>]
                       [-warmup <int>] [-[no]cycles] [-[no]time]

DESCRIPTION

       gmx  nonbonded-benchmark  runs  benchmarks for one or more so-called Nbnxm non-bonded pair
       kernels. The non-bonded pair kernels are the most compute intensive part of MD simulations
       and  usually  comprise  60  to 90 percent of the runtime.  For this reason they are highly
       optimized and several  different  setups  are  available  to  compute  the  same  physical
       interactions.    In   addition,   there  are  different  physical  treatments  of  Coulomb
       interactions and optimizations for atoms without  Lennard-Jones  interactions.  There  are
       also different physical treatments of Lennard-Jones interactions, but only a plain cut-off
       is supported in this tool, as that is by far the  most  common  treatment.   And  finally,
       while  force  output is always necessary, energy output is only required at certain steps.
       In total there are 12 relevant combinations of options. The combinations double to 24 when
       two  different  SIMD  setups  are  supported.  These combinations can be run with a single
       invocation using the -all option.  The behavior of each  kernel  is  affected  by  caching
       behavior,  which  is determined by the hardware used together with the system size and the
       cut-off radius. The larger the number of atoms per thread, the more L1 cache is needed  to
       avoid L1 cache misses.  The cut-off radius mainly affects the data reuse: a larger cut-off
       results in more data reuse and makes the kernel less sensitive to cache misses.

       OpenMP parallelization is used to utilize multiple hardware threads within a compute node.
       In  these  benchmarks  there  is  no  interaction between threads, apart from starting and
       closing a single OpenMP parallel region  per  iteration.  Additionally,  threads  interact
       through sharing and evicting data from shared caches.  The number of threads to use is set
       with the -nt option.  Thread affinity is important, especially with SMT and shared caches.
       Affinities  can  be set through the OpenMP library using the GOMP_CPU_AFFINITY environment
       variable.

       The benchmark tool times one or more kernels by running them repeatedly for  a  number  of
       iterations  set  by  the  -iter option. An initial kernel call is done to avoid additional
       initial cache misses. Times are recording in cycles read  from  efficient,  high  accuracy
       counters  in  the CPU. Note that these often do not correspond to actual clock cycles. For
       each kernel, the tool reports the total number of cycles, cycles per iteration, and (total
       and  useful)  pair interactions per cycle.  Because a cluster pair list is used instead of
       an atom pair list, interactions are also computed for some atom pairs that are beyond  the
       cut-off distance. These pairs are not useful (except for additional buffering, but that is
       not of interest here), only a side effect of the cluster-pair setup. The SIMD 2xMM  kernel
       has  a  higher  useful pair ratio then the 4xM kernel due to a smaller cluster size, but a
       lower total pair throughput.  It is best to run this, or for that  matter  any,  benchmark
       with  locked  CPU  clocks,  as thermal throttling can significantly affect performance. If
       that is not an option, the -warmup option can be used to run initial,  untimed  iterations
       to warm up the processor.

       The  most relevant regime is between 0.1 to 1 millisecond per iteration. Thus it is useful
       to run with system sizes that cover both ends of this regime.

       The -simd and -table options select different implementations to compute the same physics.
       The  choice  of  these  options  should  ideally  be  optimized  for  the target hardware.
       Historically, we only found tabulated Ewald correction to be  useful  on  2-wide  SIMD  or
       4-wide SIMD without FMA support. As all modern architectures are wider and support FMA, we
       do not use tables by default. The only exceptions are kernels  without  SIMD,  which  only
       support  tables.   Options  -coulomb,  -combrule and -halflj depend on the force field and
       composition  of  the  simulated  system.   The  optimization  of  computing  Lennard-Jones
       interactions  for  only half of the atoms in a cluster is useful for water, which does not
       use Lennard-Jones on hydrogen atoms in most water models.  In the MD engine, any  clusters
       where  at  most half of the atoms have LJ interactions will automatically use this kernel.
       And finally, the -energy option selects the computation of  energies,  which  are  usually
       only needed infrequently.

OPTIONS

       Options to specify output files:

       -o [<.csv>] (nonbonded-benchmark.csv) (Optional)
              Also output results in csv format

       Other options:

       -size <int> (1)
              The system size is 3000 atoms times this value

       -nt <int> (1)
              The number of OpenMP threads to use

       -simd <enum> (auto)
              SIMD  type,  auto  runs  all  supported  SIMD  setups  or  no SIMD when SIMD is not
              supported: auto, no, 4xm, 2xmm

       -coulomb <enum> (ewald)
              The functional form for the Coulomb interactions: ewald, reaction-field

       -[no]table (no)
              Use lookup table for Ewald correction instead of analytical

       -combrule <enum> (geometric)
              The LJ combination rule: geometric, lb, none

       -[no]halflj (no)
              Use optimization for LJ on half of the atoms

       -[no]energy (no)
              Compute energies in addition to forces

       -[no]all (no)
              Run all 12 combinations of options for coulomb, halflj, combrule

       -cutoff <real> (1)
              Pair-list and interaction cut-off distance

       -iter <int> (100)
              The number of iterations for each kernel

       -warmup <int> (0)
              The number of iterations for initial warmup

       -[no]cycles (no)
              Report cycles/pair instead of pairs/cycle

       -[no]time (no)
              Report micro-seconds instead of cycles

SEE ALSO

       gmx(1)

       More information about GROMACS is available at <http://www.gromacs.org/>.

COPYRIGHT

       2023, GROMACS development team