Provided by: ent_1.2debian-1build1_amd64 bug


       ent - pseudorandom number sequence test


       ent [options] [file]


       ENT Logo

       ent  performs  a  variety of tests on the stream of bytes in file (or standard input if no
       file is specified) and produces output on standard output; for example:

       Entropy = 7.980627 bits per character.

       Optimum compression would reduce the size
       of this 51768 character file by 0 percent.

       Chi square distribution for 51768 samples is 1542.26, and randomly
       would exceed this value 0.01 percent of the times.

       Arithmetic mean value of data bytes is 125.93 (127.5 = random).
       Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
       Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0).

       The values calculated are as follows:

       The information density of the contents of the file, expressed as a  number  of  bits  per
       character. The results above, which resulted from processing an image file compressed with
       JPEG, indicate that the file is extremely dense in information—essentially random.  Hence,
       compression  of the file is unlikely to reduce its size. By contrast, the C source code of
       the program has  entropy  of  about  4.9  bits  per  character,  indicating  that  optimal
       compression of the file would reduce its size by 38%. [Hamming, pp. 104-108]

       The  chi-square  test  is  the  most commonly used test for the randomness of data, and is
       extremely  sensitive  to  errors  in  pseudorandom  sequence  generators.  The  chi-square
       distribution  is  calculated  for  the  stream  of  bytes  in the file and expressed as an
       absolute number and a percentage which indicates how frequently a  truly  random  sequence
       would  exceed the value calculated. We interpret the percentage as the degree to which the
       sequence tested is suspected of being non-random. If the percentage is greater than 99% or
       less  than  1%,  the sequence is almost certainly not random. If the percentage is between
       99% and 95% or between 1% and 5%, the sequence is suspect.  Percentages  between  90%  and
       95%  and  5%  and  10% indicate the sequence is "almost suspect". Note that our JPEG file,
       while very dense in information, is far from random as revealed by the chi-square test.

       Applying  this  test  to  the  output  of  various  pseudorandom  sequence  generators  is
       interesting.  The  low-order  8  bits  returned by the standard Unix rand(1) function, for
       example, yields:

       Chi square distribution for 500000 samples is 0.01, and randomly
       would exceed this value 99.99 percent of the times.

       While an improved generator [Park & Miller] reports:

       Chi square distribution for 500000 samples is 212.53, and randomly
       would exceed this value 95.00 percent of the times.

       Thus, the standard Unix generator  (or  at  least  the  low-order  bytes  it  returns)  is
       unacceptably   non-random,   while  the  improved  generator  is  much  better  but  still
       sufficiently non-random to cause concern for demanding  applications.   Contrast  both  of
       these  software generators with the chi-square result of a genuine random sequence created
       by timing radioactive decay events[1]:

       Chi square distribution for 32768 samples is 237.05, and randomly
       would exceed this value 75.00 percent of the times.

       See [Knuth, pp. 35-40] for more information on the chi-square test.  An  interactive  chi-
       square calculator[2] is available at this site.

       This is simply the result of summing all the bytes (bits if the -b option is specified) in
       the file and dividing by the file length. If the data are close to random, this should  be
       about  127.5  (0.5  for -b option output). If the mean departs from this value, the values
       are consistently high or low.

       Each successive sequence of six bytes is used as 24 bit  X  and  Y  coordinates  within  a
       square.  If  the  distance  of  the  randomly-generated point is less than the radius of a
       circle inscribed within the square, the six-byte  sequence  is  considered  a  "hit".  The
       percentage  of hits can be used to calculate the value of Pi. For very large streams (this
       approximation converges very slowly), the value will approach the correct value of  Pi  if
       the sequence is close to random. A 32768 byte file created by radioactive decay yielded:

       Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).

       This quantity measures the extent to which each byte in the file depends upon the previous
       byte. For random sequences, this value (which  can  be  positive  or  negative)  will,  of
       course, be close to zero. A non-random byte stream such as a C program will yield a serial
       correlation coefficient on the order of 0.5. Wildly predictable data such as  uncompressed
       bitmaps will exhibit serial correlation coefficients approaching 1. See [Knuth, pp. 64-65]
       for more details.


       -b     The input is treated as a stream of bits rather than  of  8-bit  bytes.  Statistics
              reported reflect the properties of the bitstream.

       -c     Print a table of the number of occurrences of each possible byte (or bit, if the -b
              option is also specified) value, and the fraction of the overall file  made  up  by
              that  value.   Printable  characters  in the ISO-8859-1 (Latin-1) character set are
              shown along with their decimal byte values. In non-terse output mode,  values  with
              zero occurrences are not printed.

       -f     Fold  upper case letters to lower case before computing statistics. Folding is done
              based on the ISO-8859-1 (Latin-1) character set, with  accented  letters  correctly

       -t     Terse  mode:  output is written in Comma Separated Value (CSV) format, suitable for
              loading into a spreadsheet and easily read by any programming language.  See  Terse
              Mode Output Format below for additional details.

       -u     Print how-to-call information.


       If  no  file  is  specified,  ent obtains its input from standard input.  Output is always
       written to standard output.


       Terse mode is selected by specifying the -t option on the command line. Terse mode  output
       is  written  in Comma Separated Value (CSV) format, which can be directly loaded into most
       spreadsheet programs and is easily read by any programming language. Each  record  in  the
       CSV  file  begins  with a record type field, which identifies the content of the following
       fields. If the -c option is not specified, the terse  mode  output  will  consist  of  two
       records, as follows:


       where  the  italicised  values  in  the  type  1  record  are the numerical values for the
       quantities named in the type 0 column title record. If the_ -b_ option is  specified,  the
       second field of the type 0 record will be "File-bits", and the file_length field in type 1
       record will be given in bits instead of bytes. If the -c option is  specified,  additional
       records are appended to the terse mode output which contain the character counts:


       If  the -b option is specified, only two type 3 records will appear for the two bit values
       v=0 and v=1. Otherwise, 256 type 3 records are included, one for each possible byte value.
       The  second  field of a type 3 record indicates how many bytes (or bits) of value v appear
       in the input, and fraction gives the decimal fraction of the file which has value v (which
       is  equal to the count value of this record divided by the file_length field in the type 1


       Note that the "optimal compression" shown for the file is computed from the byte- or  bit-
       stream  entropy  and  thus reflects compressibility based on a reading frame of the chosen
       width (8-bit bytes or individual bits if the -b option is specified). Algorithms which use
       a  larger  reading  frame,  such  as  the Lempel-Ziv [Lempel & Ziv] algorithm, may achieve
       greater compression if the file contains repeated sequences of multiple bytes.


       This software is in the public domain. Permission to use,  copy,  modify,  and  distribute
       this  software  and  its  documentation for any purpose and without fee is hereby granted,
       without any conditions or restrictions. This software is provided "as is" without  express
       or implied warranty.

       Original text and program by John Walker[3] October 20th, 1998

       Modifications  by  Wesley  J.  Landaker  <⟩ >,
       released under the same terms as above.


       Introduction to Probability and Statistics[4]

              Hamming, Richard W. Coding and Information Theory. Englewood Cliffs  NJ:  Prentice-
              Hall, 1980.

              Knuth,  Donald  E.  The  Art  of  Computer  Programming,  Volume  2 / Seminumerical
              Algorithms. Reading MA: Addison-Wesley, 1969.  ISBN 0-201-89684-2.

       [Lempel & Ziv]
              Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data Compression". IEEE
              Transactions on Information Theory 23, 3, pp. 337-343.

       [Park & Miller]
              Park, Stephen K. and Keith W. Miller. "Random Number Generators: Good Ones Are Hard
              to Find". Communications of the ACM, October 1988, p. 1192.

       [1]    ⟨⟩

       [2]    ⟨⟩

       [3]    ⟨⟩

       [4]    ⟨⟩

                                           3 April 2018                                    ent(1)