Provided by: datamash_1.0.7-1_amd64
NAME
datamash - command-line calculations
SYNOPSIS
datamash [OPTION] op [col] [op col ...]
DESCRIPTION
Performs numeric/string operations on input from stdin. 'op' is the operation to perform; For grouping,per-line operations 'col' is the input field to use; 'col' can be a number (1=first field), or a column name when using -H or --header-in options. File operations: transpose, reverse Line-Filtering operations: rmdup Per-Line operations: base64, debase64, md5, sha1, sha256, sha512 Numeric Grouping operations: sum, min, max, absmin, absmax Textual/Numeric Grouping operations: count, first, last, rand, unique, collapse, countunique Statistical Grouping operations: mean, median, q1, q3, iqr, mode, antimode, pstdev, sstdev, pvar svar, mad, madraw, pskew, sskew, pkurt, skurt, dpo, jarque
OPTIONS
Grouping Options: -f, --full print entire input line before op results (default: print only the grouped keys) -g, --group=X[,Y,Z] group via fields X,[Y,Z] --header-in first input line is column headers --header-out print column headers as first line -H, --headers same as '--header-in --header-out' -i, --ignore-case ignore upper/lower case when comparing text; this affects grouping, and string operations -s, --sort sort the input before grouping; this removes the need to manually pipe the input through 'sort' File Operation Options: --no-strict allow lines with varying number of fields --filler=X fill missing values with X (default %s) General Options: -t, --field-separator=X use X instead of TAB as field delimiter --narm skip NA/NaN values -W, --whitespace use whitespace (one or more spaces and/or tabs) for field delimiters -z, --zero-terminated end lines with 0 byte, not newline --help display this help and exit --version output version information and exit
AVAILABLE OPERATIONS
File operations: transpose transpose rows, columns of the input file reverse reverse field order in each line Line-Filtering operations: rmdup remove lines with duplicated key value Per-Line operations: base64 Encode the field as base64 debase64 Decode the field as base64, exit with error if invalid base64 string md5/sha1/sha256/sha512 Calculate md5/sha1/sha256/sha512 hash of the field value reverse reverse field order in each line Numeric Grouping operations sum sum the of values min minimum value max maximum value absmin minimum of the absolute values absmax maximum of the absolute values Textual/Numeric Grouping operations count count number of elements in the group first the first value of the group last the last value of the group rand one random value from the group unique comma-separated sorted list of unique values collapse comma-separated list of all input values countunique number of unique/distinct values Statistical Grouping operations mean mean of the values median median value q1 1st quartile value q3 3rd quartile value iqr inter-quartile range mode mode value (most common value) antimode anti-mode value (least common value) pstdev population standard deviation sstdev sample standard deviation pvar population variance svar sample variance mad median absolute deviation, scaled by constant 1.4826 for normal distributions madraw median absolute deviation, unscaled sskew skewness of the (sample) group pskew skewness of the (population) group values x reported by 'sskew' and 'pskew' operations: x > 0 - positively skewed / skewed right 0 > x - negatively skewed / skewed left x > 1 - highly skewed right 1 > x > 0.5 - moderately skewed right 0.5 > x > -0.5 - approximately symmetric -0.5 > x > -1 - moderately skewed left -1 > x - highly skewed left skurt excess Kurtosis of the (sample) group pkurt excess Kurtosis of the (population) group jarque p-value of the Jarque-Beta test for normality dpo p-value of the D'Agostino-Pearson Omnibus test for normality; for 'jarque' and 'dpo' operations: null hypothesis is normality; low p-Values indicate non-normal data; high p-Values indicate null-hypothesis cannot be rejected.
EXAMPLES
Print the sum and the mean of values from column 1: $ seq 10 | datamash sum 1 mean 1 55 5.5 Group input based on field 1, and sum values (per group) on field 2: $ cat example.txt A 10 A 5 B 9 B 11 $ datamash -g 1 sum 2 < example.txt A 15 B 20 Unsorted input must be sorted (with '-s'): $ cat example.txt A 10 C 4 B 9 C 1 A 5 B 11 $ datamash -s -g1 sum 2 < example.txt A 15 B 20 C 5 Which is equivalent to: $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2 Use -h (--headers) if the input file has a header line: # Given a file with student name, field, test score... $ head -n5 scores_h.txt Name Major Score Shawn Engineering 47 Caleb Business 87 Christian Business 88 Derek Arts 60 # Calculate the mean and standard devian for each major $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt (or use short form) $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt (or use named columns) $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt GroupBy(Major) mean(Score) pstdev(Score) Arts 68.9 10.1 Business 87.3 4.9 Engineering 66.5 19.1 Health-Medicine 90.6 8.8 Life-Sciences 55.3 19.7 Social-Sciences 60.2 16.6 Reverse field order in each line: $ seq 6 | paste - - | datamash reverse 2 1 4 3 6 5 Transpose rows, columns: $ seq 6 | paste - - | datamash transpose 1 3 5 2 4 6 Remove lines with duplicate key value from column 1 (Unlike first,last operations, rmdup is much faster and does not require sorting the file with -s): # Given a list of files and sample IDs: $ cat INPUT SampleID File 2 cc.txt 3 dd.txt 1 ab.txt 2 ee.txt 3 ff.txt # Remove lines with duplicated Sample-ID (column 1): $ datamash rmdup 1 < INPUT (or used named column) $ datamash -H rmdup SampleID < INPUT SampleID File 2 cc.txt 3 dd.txt 1 ab.txt Calculate the sha1 hash value of each TXT file, after calculating the sha1 value of each file's content: $ sha1sum *.txt | datamash -Wf sha1 2
ADDITIONAL INFORMATION
See GNU Datamash Website (http://www.gnu.org/software/datamash)
AUTHOR
Written by Assaf Gordon.
COPYRIGHT
Copyright © 2015 Assaf Gordon License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
SEE ALSO
The full documentation for datamash is maintained as a Texinfo manual. If the info and datamash programs are properly installed at your site, the command info datamash should give you access to the complete manual.