Provided by: lmbench-doc_3.0-a9-1.1_all bug


       lat_mem_rd - memory read latency benchmark


       lat_mem_rd  [  -P  <parallelism> ] [ -W <warmups> ] [ -N <repetitions> ] size_in_megabytes
       stride [ stride stride...  ]


       lat_mem_rd measures memory read latency for varying memory sizes and strides.  The results
       are  reported  in  nanoseconds  per  load  and have been verified accurate to within a few
       nanoseconds on an SGI Indy.

       The entire memory hierarchy  is  measured,  including  onboard  cache  latency  and  size,
       external cache latency and size, main memory latency, and TLB miss latency.

       Only data accesses are measured; the instruction cache is not measured.

       The  benchmark  runs  as  two nested loops.  The outer loop is the stride size.  The inner
       loop is the array size.  For each array size, the benchmark creates  a  ring  of  pointers
       that point backward one stride.  Traversing the array is done by

            p = (char **)*p;

       in  a  for loop (the over head of the for loop is not significant; the loop is an unrolled
       loop 100 loads long).

       The size of the array varies from 512 bytes to (typically) eight megabytes.  For the small
       sizes,  the  cache  will  have an effect, and the loads will be much faster.  This becomes
       much more apparent when the data is plotted.

       Since this benchmark uses fixed-stride offsets in the pointer chain, it may be  vulnerable
       to smart, stride-sensitive cache prefetching policies.  Older machines were typically able
       to prefetch for sequential access patterns, and some were able  to  prefetch  for  strided
       forward  access  patterns,  but  only  a few could prefetch for backward strided patterns.
       These capabilities are becoming more widespread in newer processors.


       Output format is intended as input to xgraph or some similar program (we use a perl script
       that  produces pic input).  There is a set of data produced for each stride.  The data set
       title is the stride size and the data points are the array  size  in  megabytes  (floating
       point value) and the load latency over all points in that array.


       The  output  is  best  examined  in  a graph where you typically get a graph that has four
       plateaus.  The graph should plotted in log base 2 of the array size on the X axis and  the
       latency  on the Y axis.  Each stride is then plotted as a curve.  The plateaus that appear
       correspond to the onboard cache (if present), external cache  (if  present),  main  memory
       latency, and TLB miss latency.

       As  a  rough  guide,  you  may  be  able  to extract the latencies of the various parts as
       follows, but you should really look at the graphs, since  these  rules  of  thumb  do  not
       always work (some systems do not have onboard cache, for example).

       onboard cache   Try stride of 128 and array size of .00098.

       external cache  Try stride of 128 and array size of .125.

       main memory     Try stride of 128 and array size of 8.

       TLB miss        Try the largest stride and the largest array.


       This  program is dependent on the correct operation of mhz(8).  If you are getting numbers
       that seem off, check that mhz(8) is giving you a clock rate that you believe.


       Funding for the development of  this  tool  was  provided  by  Sun  Microsystems  Computer


       lmbench(8), tlb(8), cache(8), line(8).


       Carl Staelin and Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)1994 Larry McVoy                           $Date$                                LAT_MEM_RD(8)