Provided by: lmbench_3.0-a9+debian.1-6_amd64 bug


       par_mem - memory parallelism benchmark


       par_mem [ -L <line size> ] [ -M <len> ] [ -W <warmups> ] [ -N <repetitions> ]


       par_mem  measures  the  available  parallelism  in  the memory hierarchy, up to len bytes.
       Modern processors can often service multiple memory  requests  in  parallel,  while  older
       processors  typically blocked on LOAD instructions and had no available parallelism (other
       than that provided by cache prefetching).  par_mem measures the available parallelism at a
       variety  of  points,  since  the  available  parallelism  is  often a function of the data
       location in the memory hierarchy.

       In order to measure the available parallelism par_mem conducts a variety of experiments at
       each  memory  size;  one  for each level of parallelism.  It builds a pointer chain of the
       desired length.  It then creates an array of pointers which point to chain  entries  which
       are  evenly  spaced across the chain.  Then it starts running the pointers forward through
       the chain in parallel.  It can then measure the average memory latency for each  level  of
       parallelism,  and  the  available  parallelism  is  the minimum average memory latency for
       parallelism 1 divided by the  average  memory  latency  across  all  levels  of  available

       For example, the inner loop which measures parallelism 2 would look something like:

       for (i = 0; i < N; ++i) {      p0 = (char **)*p0;      p1 = (char **)*p1; }

       in  a  for  loop (the overhead of the for loop is not significant; the loop is an unrolled
       loop 100 loads long).  In this case, if the hardware can process two  LOAD  operations  in
       parallel,  then  the  overall latency of the loop should be equivalent to that of a single
       pointer chain, so the measured  parallelism  would  be  roughly  two.   If,  however,  the
       hardware  can  only  process a single LOAD operation at once, or if there is (significant)
       resource contention between the two LOAD operations, then the loop  will  be  much  slower
       than  a  loop  with  a single pointer chain, so the measured parallelism will be less than
       two, and probably no smaller than one.


       Output format is intended as input to xgraph or some similar program (we use a perl script
       that  produces pic input).  There is a set of data produced for each stride.  The data set
       title is the stride size and the data points are the array  size  in  megabytes  (floating
       point value) and the load latency over all points in that array.


       lmbench(8), line(8), cache(8), tlb(8), par_ops(8).


       Carl Staelin and Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)2000 Carl Staelin and Larry McVoy          $Date$                                   PAR_MEM(8)