bionic (8) par_mem.8.gz

Provided by: lmbench_3.0-a9+debian.1-2_amd64 bug

NAME

       par_mem - memory parallelism benchmark

SYNOPSIS

       par_mem [ -L <line size> ] [ -M <len> ] [ -W <warmups> ] [ -N <repetitions> ]

DESCRIPTION

       par_mem  measures  the available parallelism in the memory hierarchy, up to len bytes.  Modern processors
       can often service multiple memory requests in parallel, while older processors typically blocked on  LOAD
       instructions  and  had no available parallelism (other than that provided by cache prefetching).  par_mem
       measures the available parallelism at a variety of points, since the available  parallelism  is  often  a
       function of the data location in the memory hierarchy.

       In  order  to  measure the available parallelism par_mem conducts a variety of experiments at each memory
       size; one for each level of parallelism.  It builds a pointer chain  of  the  desired  length.   It  then
       creates an array of pointers which point to chain entries which are evenly spaced across the chain.  Then
       it starts running the pointers forward through the chain in parallel.  It can then  measure  the  average
       memory latency for each level of parallelism, and the available parallelism is the minimum average memory
       latency for parallelism 1  divided  by  the  average  memory  latency  across  all  levels  of  available
       parallelism.

       For example, the inner loop which measures parallelism 2 would look something like:

       for (i = 0; i < N; ++i) {      p0 = (char **)*p0;      p1 = (char **)*p1; }

       in  a  for  loop (the overhead of the for loop is not significant; the loop is an unrolled loop 100 loads
       long).  In this case, if the hardware can process two LOAD  operations  in  parallel,  then  the  overall
       latency  of  the loop should be equivalent to that of a single pointer chain, so the measured parallelism
       would be roughly two.  If, however, the hardware can only process a single LOAD operation at once, or  if
       there  is  (significant)  resource contention between the two LOAD operations, then the loop will be much
       slower than a loop with a single pointer chain, so the measured parallelism will be less  than  two,  and
       probably no smaller than one.

OUTPUT

       Output  format is intended as input to xgraph or some similar program (we use a perl script that produces
       pic input).  There is a set of data produced for each stride.  The data set title is the stride size  and
       the  data  points  are  the  array size in megabytes (floating point value) and the load latency over all
       points in that array.

SEE ALSO

       lmbench(8), line(8), cache(8), tlb(8), par_ops(8).

AUTHOR

       Carl Staelin and Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)2000 Carl Staelin and Larry McVoy                 $Date$                                           PAR_MEM(8)