Provided by: trafficserver_7.1.2+ds-3_amd64 bug

NAME

       metrics.config  -  this  configuration file is used to define dynamic metrics on Traffic Server activity.
       Metrics defined here are available through all normal means of metrics reporting, including  traffic_line
       and admin-plugins-stats-over-http

FORMAT

       The  configuration  file itself is a Lua script. As with normal Lua code, comments begin with --, you may
       declare your own functions, and you may define global variables.

METRIC DEFINITIONS

       Metrics are defined by calling the supplied metric generator functions. There is one for  each  supported
       type, and their parameters are identical:

          <typefn> '<name>' [[
            <metric generating function body>
          ]]

       In practice, this will look like:

          float 'proxy.node.useful_metric' [[
              return math.random()
          ]]

       With  perhaps something more useful in the body of the metric generator. The string containing the metric
       generating function's body (everything between [[ and ]], which is a multiline literal string in Lua)  is
       stored and then evaluated as an anonymous function, which will receive a single argument: the name of the
       metric  (in  the  example  above: proxy.node.useful_metric). If necessary, you can capture this parameter
       using the ... operator, which returns the remaining parameters of the enclosing function.

   Metric Types
   float
       A gauge style metric which  will  return  floating  point  numbers.  Floating  point  gauge  metrics  are
       appropriate  for  values  which  may increase or decrease arbitrarily (e.g. disk usage, cache hit ratios,
       average document sizes, and so on).

   integer
       A gauge style metric which will return integers. Integer gauge metrics are appropriate for  values  which
       may increase or descrease arbitrarily, and do not need any decimal components.

   counter
       A metric which will supply integer only values used almost exclusively to report on the number of events,
       whatever  they may be, that have occurred.  Frequent uses are the number of requests served, responses by
       specific HTTP status codes, the number of failed DNS lookups, and so on.

   Metric Scopes
       All dynamic metrics, like their built-in counterparts, exist within a scope which determines whether they
       reflect the state of the current Traffic Server node, or the state of the entire Traffic  Server  cluster
       for which the current node is a member.

       The  scope  of  a  metric is derived from its name. All metric names begin with proxy. followed by either
       node. or cluster..

       Thus, proxy.node.active_origin_connections might be used for the number of  open  connections  to  origin
       servers   on  just  the  current  node,  whereas  proxy.cluster.active_origin_connections  would  be  the
       counterpart for the total open connections to origin  servers  from  all  Traffic  Server  nodes  in  the
       cluster,  including  the  current node. (Note that these names are contrived, and you are advised to both
       pick as clear and detailed a metric name as possible and  also  to  ensure  there  is  no  conflict  with
       existing metric names).

   Support Functions
       Several  supporting  functions  are  defined in the default configuration file.  Existing dynamic metrics
       shipped with metrics.config make extensive use of these functions, and your own  custom  metrics  may  as
       necessary, too.

   cluster(name)
       Returns  the sum of metric name for the entire cluster of which the current node is a member. Memoization
       is used to avoid additional cost from calling this function multiple times within a single metrics  pass.
       The name must be a metric within the node scope.

   mbits(bytes)
       Converts and returns bytes as megabits (bytes * 8 / 1000000).

   mbytes(bytes)
       Converts and returns bytes as mebibytes (bytes / (1024 * 1024)).

   now()
       Returns the current node's time in milliseconds-from-epoch.

   rate_of(msec, key, fn)
       Returns  the  rate  of change over a period of msec milliseconds for the metric value of key (obtained by
       invoking the function fn).

       This is accomplished by effectively snapshotting the value of the metric at the beginning and end of  the
       given period expressed by msec, multiplying their difference by 1,000 and dividing that by msec.

   rate_of_10s(key, fn)
       Returns the rate of change for the past 10 seconds for the metric key, as calculated by function fn. This
       function simply wraps rate_of and supplies an msec value of 10 * 1000.

   Definition Examples
       For  illustrative  purposes,  a few of the dynamic metric definitions you may find in your Traffic Server
       installation's default metrics.config are explained  here.   The  actual  file  will  contain  many  more
       definitions, and of course you may add your own, as well.

   Returning a single value
       The  simplest  example  is  a  dynamic node metric which does nothing but return the current value for an
       underlying process metric:

          counter 'proxy.node.http.user_agents_total_documents_served' [[
            return proxy.process.http.incoming_requests
          ]]

       This uses the built-in function counter, which takes two parameters: the name of the  dynamic  metric  to
       create  followed  by  the function used to calculate the value. In this case, the function body is just a
       return of the named, underlying process statistic. No calculations, aggregates, or other  processing  are
       performed.

   Returning a rate-of-change
       Slightly  more  involved  than just returning a point-in-time value from a given statistic is calculating
       the rate of change:

          integer 'proxy.node.dns.lookups_per_second' [[
            local self = ...

            return rate_of_10s(self,
              function() return proxy.process.dns.total_dns_lookups end
            )
          ]]

       Similar to the previous example, we are returning another metric's value, but  in  this  case  we  do  so
       within  a  function  that we're passing into rate_of_10s. This function, explained earlier, wraps rate_of
       which tracks the given metric over a specific interval and returns the average per-second rate of change,
       obtaining the values it uses to calculate this rate by invoking the function passed to it.

   Calculating a rate-of-change's delta
       A more complicated example involves calculating the variance in the  rate  of  change  of  an  underlying
       statistic  over  a  given  period  of time. This is not an average of a statistic, nor is it just the raw
       delta between two samplings of that statistic, and while inappropriate to know how much of an  event  has
       occurred, it is useful to know how erratic or unstable the frequency of that event occurring is.

       In  other words, a large absolute value indicates a deviance from the usual pattern of behavior/activity.
       For example, if your Traffic Server cache (using the example dynamic metric function below) sees  between
       10,000  and  10,250  HostDB  hits  every 10 seconds, the value returned by this metric will remain fairly
       small. If all of a sudden 50,000 hits make it to HostDB in the span of that same averaging interval, this
       value will increase significantly. This could then be used to trigger various alerts that something might
       be up with HostDB lookups on the Traffic Server cluster.

          integer 'proxy.node.hostdb.total_hits_avg_10s' [[
            local self = ...

            return interval_delta_of_10s(self,
              function() return proxy.process.hostdb.total_hits end
            )
          ]]

       The catch is that if the dramatic increase is actually the new norm, the metric will return  to  emitting
       small  absolute values again - even though the statistic underneath is now consistently and significantly
       higher or lower than it used to be. If what you are trying to measure, though,  is  the  stability  of  a
       metric that's, long-term, a good thing.

   Converting a metric to a ratio
       Using  a  very  simplified version of the Traffic Server cache hit reporting, we can demonstrate taking a
       metric which expresses the occurrence of one type of event within a set of possibilities  and  converting
       its absolute value into a ratio of that set's total.

       In  this  example,  we assume we have three cache hit states (misses, hits, and revalidates) and they are
       tracked in the metrics proxy.node.cache.<state>.  These are not the real metric names in Traffic  Server,
       and there are much finer grained reporting states available, but we'll use these for brevity.

          float 'proxy.node.cache.hits_ratio' [[
            return
              proxy.node.cache.hits /
              ( proxy.node.cache.hits +
                proxy.node.cache.misses +
                proxy.node.cache.revalidates
              )
          ]]

   Summing across a cluster
       When  running  a Traffic Server cluster of multiple nodes, there are many metrics which are useful to see
       at both the node and cluster level. Dynamic metrics make it very easy to collect the metric's  value  for
       every node in the cluster and return the sum, as seen here with cache connections:

          counter 'proxy.cluster.http.cache_current_connections_count' [[
            return cluster('proxy.node.http.cache_current_connections_count')
          ]]

FURTHER READING

       The following resources may be useful when writing dynamic metrics:

       • Lua Documentation

COPYRIGHT

       2018, dev@trafficserver.apache.org

7.1                                               Feb 16, 2018                                 METRICS.CONFIG(5)