Ubuntu Manpage: metrics.config - this configuration file is used to define dynamic metrics on Traffic

Provided by: trafficserver_7.1.2+ds-3_amd64

NAME

       metrics.config  -  this  configuration  file  is used to define dynamic metrics on Traffic
       Server activity.  Metrics defined here are available through all normal means  of  metrics
       reporting, including traffic_line and admin-plugins-stats-over-http

FORMAT

       The  configuration  file  itself  is a Lua script. As with normal Lua code, comments begin
       with --, you may declare your own functions, and you may define global variables.

METRIC DEFINITIONS

       Metrics are defined by calling the supplied metric generator functions. There is  one  for
       each supported type, and their parameters are identical:

          <typefn> '<name>' [[
            <metric generating function body>
          ]]

       In practice, this will look like:

          float 'proxy.node.useful_metric' [[
              return math.random()
          ]]

       With  perhaps  something  more  useful  in  the  body  of the metric generator. The string
       containing the metric generating function's body (everything between [[ and ]], which is a
       multiline  literal  string  in Lua) is stored and then evaluated as an anonymous function,
       which will receive a single argument: the name  of  the  metric  (in  the  example  above:
       proxy.node.useful_metric).  If  necessary,  you  can  capture this parameter using the ...
       operator, which returns the remaining parameters of the enclosing function.

   Metric Types
   float
       A gauge style metric which will  return  floating  point  numbers.  Floating  point  gauge
       metrics  are  appropriate for values which may increase or decrease arbitrarily (e.g. disk
       usage, cache hit ratios, average document sizes, and so on).

   integer
       A gauge style metric which will return integers. Integer gauge metrics are appropriate for
       values  which  may  increase  or  descrease  arbitrarily,  and  do  not  need  any decimal
       components.

   counter
       A metric which will supply integer only values used almost exclusively to  report  on  the
       number  of events, whatever they may be, that have occurred.  Frequent uses are the number
       of requests served, responses by specific HTTP status codes,  the  number  of  failed  DNS
       lookups, and so on.

   Metric Scopes
       All  dynamic  metrics,  like  their  built-in  counterparts,  exist  within  a scope which
       determines whether they reflect the state of the current Traffic Server node, or the state
       of the entire Traffic Server cluster for which the current node is a member.

       The  scope  of  a  metric  is  derived  from  its name. All metric names begin with proxy.
       followed by either node. or cluster..

       Thus,  proxy.node.active_origin_connections  might  be  used  for  the  number   of   open
       connections    to    origin    servers    on    just    the    current    node,    whereas
       proxy.cluster.active_origin_connections would  be  the  counterpart  for  the  total  open
       connections  to origin servers from all Traffic Server nodes in the cluster, including the
       current node. (Note that these names are contrived, and you are advised to  both  pick  as
       clear  and detailed a metric name as possible and also to ensure there is no conflict with
       existing metric names).

   Support Functions
       Several supporting functions are defined in  the  default  configuration  file.   Existing
       dynamic  metrics  shipped  with  metrics.config make extensive use of these functions, and
       your own custom metrics may as necessary, too.

   cluster(name)
       Returns the sum of metric name for the entire cluster of  which  the  current  node  is  a
       member.  Memoization  is used to avoid additional cost from calling this function multiple
       times within a single metrics pass. The name must be a metric within the node scope.

   mbits(bytes)
       Converts and returns bytes as megabits (bytes * 8 / 1000000).

   mbytes(bytes)
       Converts and returns bytes as mebibytes (bytes / (1024 * 1024)).

   now()
       Returns the current node's time in milliseconds-from-epoch.

   rate_of(msec, key, fn)
       Returns the rate of change over a period of msec milliseconds for the metric value of  key
       (obtained by invoking the function fn).

       This  is accomplished by effectively snapshotting the value of the metric at the beginning
       and end of the given period expressed by msec, multiplying their difference by  1,000  and
       dividing that by msec.

   rate_of_10s(key, fn)
       Returns  the  rate  of change for the past 10 seconds for the metric key, as calculated by
       function fn. This function simply wraps rate_of and supplies an msec value of 10 * 1000.

   Definition Examples
       For illustrative purposes, a few of the dynamic metric definitions you may  find  in  your
       Traffic  Server installation's default metrics.config are explained here.  The actual file
       will contain many more definitions, and of course you may add your own, as well.

   Returning a single value
       The simplest example is a dynamic node metric which does nothing but  return  the  current
       value for an underlying process metric:

          counter 'proxy.node.http.user_agents_total_documents_served' [[
            return proxy.process.http.incoming_requests
          ]]

       This  uses  the  built-in  function  counter,  which takes two parameters: the name of the
       dynamic metric to create followed by the function used to calculate  the  value.  In  this
       case,  the  function  body is just a return of the named, underlying process statistic. No
       calculations, aggregates, or other processing are performed.

   Returning a rate-of-change
       Slightly more involved than just returning a point-in-time value from a given statistic is
       calculating the rate of change:

          integer 'proxy.node.dns.lookups_per_second' [[
            local self = ...

            return rate_of_10s(self,
              function() return proxy.process.dns.total_dns_lookups end
            )
          ]]

       Similar to the previous example, we are returning another metric's value, but in this case
       we do so within a function that we're passing into rate_of_10s. This  function,  explained
       earlier,  wraps rate_of which tracks the given metric over a specific interval and returns
       the average per-second rate of change, obtaining the values it uses to calculate this rate
       by invoking the function passed to it.

   Calculating a rate-of-change's delta
       A  more  complicated example involves calculating the variance in the rate of change of an
       underlying statistic over a given period of time. This is not an average of  a  statistic,
       nor  is  it  just  the  raw  delta  between  two  samplings  of  that statistic, and while
       inappropriate to know how much of an event has occurred, it is useful to know how  erratic
       or unstable the frequency of that event occurring is.

       In  other  words,  a  large  absolute value indicates a deviance from the usual pattern of
       behavior/activity. For example, if your Traffic Server cache (using  the  example  dynamic
       metric  function  below)  sees between 10,000 and 10,250 HostDB hits every 10 seconds, the
       value returned by this metric will remain fairly small. If all of  a  sudden  50,000  hits
       make  it  to  HostDB in the span of that same averaging interval, this value will increase
       significantly. This could then be used to trigger various alerts that something  might  be
       up with HostDB lookups on the Traffic Server cluster.

          integer 'proxy.node.hostdb.total_hits_avg_10s' [[
            local self = ...

            return interval_delta_of_10s(self,
              function() return proxy.process.hostdb.total_hits end
            )
          ]]

       The  catch  is  that  if  the  dramatic increase is actually the new norm, the metric will
       return to emitting small absolute values again - even though the statistic  underneath  is
       now  consistently  and  significantly  higher or lower than it used to be. If what you are
       trying to measure, though, is the stability of a metric that's, long-term, a good thing.

   Converting a metric to a ratio
       Using a very simplified version  of  the  Traffic  Server  cache  hit  reporting,  we  can
       demonstrate  taking  a metric which expresses the occurrence of one type of event within a
       set of possibilities and converting its absolute value into a ratio of that set's total.

       In this example, we assume we have three cache hit states (misses, hits, and  revalidates)
       and  they  are  tracked  in  the metrics proxy.node.cache.<state>.  These are not the real
       metric names in Traffic  Server,  and  there  are  much  finer  grained  reporting  states
       available, but we'll use these for brevity.

          float 'proxy.node.cache.hits_ratio' [[
            return
              proxy.node.cache.hits /
              ( proxy.node.cache.hits +
                proxy.node.cache.misses +
                proxy.node.cache.revalidates
              )
          ]]

   Summing across a cluster
       When  running a Traffic Server cluster of multiple nodes, there are many metrics which are
       useful to see at both the node and cluster level. Dynamic metrics make  it  very  easy  to
       collect  the metric's value for every node in the cluster and return the sum, as seen here
       with cache connections:

          counter 'proxy.cluster.http.cache_current_connections_count' [[
            return cluster('proxy.node.http.cache_current_connections_count')
          ]]

COPYRIGHT

       2018, dev@trafficserver.apache.org

NAME

FORMAT

METRIC DEFINITIONS

FURTHER READING

COPYRIGHT