Provided by: lmbench_3.0-a9-1.1_i386 bug


       lat_mem_rd - memory read latency benchmark


       lat_mem_rd  [  -P <parallelism> ] [ -W <warmups> ] [ -N <repetitions> ]
       size_in_megabytes stride [ stride stride...  ]


       lat_mem_rd measures memory read latency for varying  memory  sizes  and
       strides.   The  results  are  reported in nanoseconds per load and have
       been verified accurate to within a few nanoseconds on an SGI Indy.

       The entire  memory  hierarchy  is  measured,  including  onboard  cache
       latency and size, external cache latency and size, main memory latency,
       and TLB miss latency.

       Only data accesses are measured; the instruction cache is not measured.

       The benchmark runs as two nested loops.  The outer loop is  the  stride
       size.   The  inner  loop  is  the array size.  For each array size, the
       benchmark creates a ring of pointers that point  backward  one  stride.
       Traversing the array is done by

            p = (char **)*p;

       in  a  for  loop (the over head of the for loop is not significant; the
       loop is an unrolled loop 100 loads long).

       The size of the array  varies  from  512  bytes  to  (typically)  eight
       megabytes.  For the small sizes, the cache will have an effect, and the
       loads will be much faster.  This becomes much more  apparent  when  the
       data is plotted.

       Since this benchmark uses fixed-stride offsets in the pointer chain, it
       may  be  vulnerable  to  smart,  stride-sensitive   cache   prefetching
       policies.    Older   machines  were  typically  able  to  prefetch  for
       sequential access patterns, and some were able to prefetch for  strided
       forward  access  patterns,  but  only a few could prefetch for backward
       strided patterns.  These capabilities are becoming more  widespread  in
       newer processors.


       Output  format  is  intended as input to xgraph or some similar program
       (we use a perl script that produces pic input).  There is a set of data
       produced  for  each  stride.  The data set title is the stride size and
       the data points are the array size in megabytes (floating point  value)
       and the load latency over all points in that array.


       The  output is best examined in a graph where you typically get a graph
       that has four plateaus.  The graph should plotted in log base 2 of  the
       array size on the X axis and the latency on the Y axis.  Each stride is
       then plotted as a curve.  The plateaus that appear  correspond  to  the
       onboard  cache  (if  present), external cache (if present), main memory
       latency, and TLB miss latency.

       As a rough guide, you may be able  to  extract  the  latencies  of  the
       various  parts  as  follows,  but you should really look at the graphs,
       since these rules of thumb do not always work (some systems do not have
       onboard cache, for example).

       onboard cache   Try stride of 128 and array size of .00098.

       external cache  Try stride of 128 and array size of .125.

       main memory     Try stride of 128 and array size of 8.

       TLB miss        Try the largest stride and the largest array.


       This  program  is dependent on the correct operation of mhz(8).  If you
       are getting numbers that seem off, check that mhz(8) is  giving  you  a
       clock rate that you believe.


       Funding   for  the  development  of  this  tool  was  provided  by  Sun
       Microsystems Computer Corporation.


       lmbench(8), tlb(8), cache(8), line(8).


       Carl Staelin and Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)1994 Larry McVoy                 $Date$                       LAT_MEM_RD(8)