Ubuntu Manpage: par_mem - memory parallelism benchmark

Provided by: lmbench-doc_3.0-a9-1.1ubuntu0.1_all

NAME

       par_mem - memory parallelism benchmark

SYNOPSIS

       par_mem [ -L <line size> ] [ -M <len> ] [ -W <warmups> ] [ -N <repetitions> ]

DESCRIPTION

par_mem measures the available parallelism in the memory hierarchy, up to len bytes.
Modern processors can often service multiple memory requests in parallel, while older
processors typically blocked on LOAD instructions and had no available parallelism (other
than that provided by cache prefetching). par_mem measures the available parallelism at a
variety of points, since the available parallelism is often a function of the data
location in the memory hierarchy.

In order to measure the available parallelism par_mem conducts a variety of experiments at
each memory size; one for each level of parallelism. It builds a pointer chain of the
desired length. It then creates an array of pointers which point to chain entries which
are evenly spaced across the chain. Then it starts running the pointers forward through
the chain in parallel. It can then measure the average memory latency for each level of
parallelism, and the available parallelism is the minimum average memory latency for
parallelism 1 divided by the average memory latency across all levels of available
parallelism.

For example, the inner loop which measures parallelism 2 would look something like:

for (i = 0; i < N; ++i) { p0 = (char **)*p0; p1 = (char **)*p1; }

in a for loop (the overhead of the for loop is not significant; the loop is an unrolled
loop 100 loads long). In this case, if the hardware can process two LOAD operations in
parallel, then the overall latency of the loop should be equivalent to that of a single
pointer chain, so the measured parallelism would be roughly two. If, however, the
hardware can only process a single LOAD operation at once, or if there is (significant)
resource contention between the two LOAD operations, then the loop will be much slower
than a loop with a single pointer chain, so the measured parallelism will be less than
two, and probably no smaller than one.

OUTPUT

       Output format is intended as input to xgraph or some similar program (we use a perl script
       that  produces pic input).  There is a set of data produced for each stride.  The data set
       title is the stride size and the data points are the array  size  in  megabytes  (floating
       point value) and the load latency over all points in that array.

AUTHOR

       Carl Staelin and Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)2000 Carl Staelin and Larry McVoy          $Date$                                   PAR_MEM(8)

NAME

SYNOPSIS

DESCRIPTION

OUTPUT

SEE ALSO

AUTHOR