Ubuntu Manpage: Catmandu::Exporter::Stat

Provided by: libcatmandu-stat-perl_0.13-2_all

NAME

       Catmandu::Exporter::Stat - a statistical export

SYNOPSIS

           # Calculate statistics on the availabity of the ISBN fields in the dataset
           cat data.json | catmandu convert -v JSON to Stat --fields isbn

           # Export the statistics as YAML
           cat data.json | catmandu convert -v JSON to Stat --fields isbn --as YAML

DESCRIPTION

       The Catmandu::Stat package can be used to calculate statistics on the availability of
       fields in a data file. Use this exporter to count the availability of fields or count the
       number of duplicate values. For each field the exporter calculates the following
       statistics:

         * name    : the name of a field
         * count   : the number of occurrences of a field in all records
         * zeros   : the number of records without a field
         * zeros%  : the percentage of records without a field
         * min     : the minimum number of occurrences of a field in any record
         * max     : the maximum number of occurrences of a field in any record
         * mean    : the mean number of occurrences of a field in all records
         * variance : the variance of the field number
         * stdev   : the standard deviation of the field number
         * uniq~   : the estimated number of unique records
         * uniq%   : the estimated percentage of uniq values
         * entropy : the minimum and maximum entropy in the field values (estimated value)

       Details:

         * entropy is an indication in the variation of field values (are some values more unique than others)
         * entropy values are displayed as : minimum/maximum entropy
         * when the minimum entropy = 0, then all the field values are equal
         * when the minimum and maximum entropy are equal, then all the field values are different
         * the 'uniq%' and 'entropy' fields are estimated and are normally within 1% of the
           correct value (this is done to keep the memory requirements of this module low)

       Each statistical report contains one row named hash '#' which contains the total number of
       records.

CONFIGURATION

v Verbose output. Show the processing speed.

fix FIX
A fix or a fix file containing one or more fixes applied to the input data before the
statistics are calculated.

fields KEY[,KEY,...]
One or more fields in the data for which statistics need to be calculated. No deep
nested fields are allowed. The exporter will collect statistics on the availability of
a field in all records. For instance, the following record contains one 'title' field,
zero 'isbn' fields and 3 'author' fields

---
title: ABCDEF
author:
- Davis, Miles
- Parker, Charly
- Mingus, Charles
year: 1950

Examples of operation:

# Calculate statistics on the number of records that contain a 'title'
cat data.json | catmandu convert JSON to Stat --fields title

# Calculate statistics on the number of records that contain a 'title', 'isbn' or 'subject' fields
cat data.json | catmandu convert JSON to Stat --fields title,isbn,subject

# The next example will not work: no deeply nested fields allowed
cat data.json | catmandu convert JSON to Stat --fields foo.bar.x.y

When no fields parameter is available, then all fields are read from the first input
record.

as Table | CSV | YAML | JSON | ...
By default the statistics are exported in a Table format. The use 'as' option to
change the export format.

topk NUMBER
To calculate the entropy an estimate of the probability distribution of the data set
needs to be calculated. Topk is the expected lower bound on the number of field values
which have repeated entries. By default it is set to 100. If there are more fields
values with doubles, then this number needs to be increased.

hll NUMBER
This is the Algorithm::HyperLogLog parameter calculating the estimation of cardinality
(uniqueness) of a data set. The HLL register parameter, which should be between 4 and
16, gives an estimate on the precision of the calculation. The bigger the number, the
better precision but also more memory will be used. Default: 14.

NAME

SYNOPSIS

DESCRIPTION

CONFIGURATION

SEE ALSO