Ubuntu Manpage: gztool - extract random-positioned data from gzip files, even like `tail -f`

NAME

       gztool - extract random-positioned data from gzip files, even like `tail -f`

SYNOPSIS

       gztool
        [ [-[abLnsv] #] [-[1..9]AcCdDeEfFhilpPrRStTwWxXz|u[cCdD]] [-I <INDEX>] ] "files" ...

       Note that actions `-bcStT` proceed to an index file creation (if none exists)  INTERLEAVED
       with data flow. As data flow and index creation occur at the same time there's no waste of
       time.  Also you can interrupt actions at any moment and the remaining index file  will  be
       reused (and completed if necessary) on the next gztool run over the same data.

DESCRIPTION

       gztool  is  a  GZIP  files  indexer,  compressor  and data retriever.  It can create small
       indexes for gzipped files and use them for quick and random-positioned data extraction.

       gztool can extract random-positioned data from gzip files with no penalty, including  gzip
       tailing like with `tail -f`.

       gztool  creates  an  index  file  (.gzi)  for  every  gzip  it  treats, and this action is
       interleaved with compression/uncompression so there's no waste of time. Any action can  be
       interrupted at any moment, and the remaining index will be reused on next runs.

       Extraction  is  possible  from  any byte (or line) position in the uncompressed data using
       `-b` (or `-L` for lines).

       If the uncompressed file is a text file, then using the `-x` modifier the index will  take
       care  of  lines,  so  later gztool can be requested to extract data from a particular line
       with `-L`.

       See the full listing of capabilities on INTERNALS.

OPTIONS

       files  One or more files. If no file is indicated, standard input is used.

       -[1..9]
              compression factor to use  with  `-[c|u[cC]]`,  from  best  speed  (`-1`)  to  best
              compression (`-9`). Default is `-6`.

       -a #   Await # seconds between reads when `-[ST]|Ec`. Default is 4 s.

       -A     modifier  for  `-[rR]`  to  indicate  the  range of bytes/lines in absolute values,
              instead of the default incremental values.

       -b #   extract data from indicated uncompressed byte position of gzip  file  (creating  or
              reusing  an  index file) to STDOUT.  Accepts '0', '0x', and suffixes 'kmgtpe' (^10)
              or 'KMGTPE' (^2).

       -C     always create a 'Complete' index file, ignoring possible errors.

       -c     compress a file like with gzip, creating an index at the same time.

       -d     decompress a file like with gzip.

       -D     do not delete original file when using `-[cd]`.

       -e     if multiple files are indicated, continue on error (if any).

       -E     end processing on first GZIP end of file marker at  EOF.   Nonetheless  with  `-c`,
              `-E` waits for more data even at EOF.

       -f     force file overwriting if destination file already exists.

       -F     force  index creation/completion first, and then action: if `-F` is not used, index
              is created interleaved with actions.

       -h     print brief help; `-hh` prints this help.

       -i     create index for indicated gzip file (For 'file.gz' the  default  index  file  name
              will be 'file.gzi'). This is the default action.

       -I string
              index file name will be the indicated string.

       -l     check  and  list info contained in indicated index file.  `-ll` and `-lll` increase
              the level of index checking detail.

       -L #   extract data from indicated uncompressed line position of gzip  file  (creating  or
              reusing  an  index file) to STDOUT.  Accepts '0', '0x', and suffixes 'kmgtpe' (^10)
              or 'KMGTPE' (^2).

       -n #   indicates that the first byte on compressed input is #, not  1,  and  so  truncated
              compressed inputs can be used if an index exists.

       -p     indicates  that  the  gzip  input  stream  may  be  composed of various incorrectly
              terminated GZIP streams, and so then a careful Patching of the input may be  needed
              to extract correct data.

       -P     like `-p`, but when used with `-[ST]` implies that checking for errors in stream is
              made as quick as possible as the gzip file grows. Warning: this may  lead  to  some
              errors not being patched.

       -r #   (range):  Number  of  bytes  to extract when using `-[bL]`.  Accepts '0', '0x', and
              suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).

       -R #   (Range): Number of lines to extract when using `-[bL]`.   Accepts  '0',  '0x',  and
              suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).

       -s #   span  in  uncompressed MiB between index points when creating the index. By default
              is `10`.

       -S     Supervise indicated file.  Create a growing index, for a still-growing  gzip  file.
              (`-i` is implicit).

       -t     tail (extract last bytes) to STDOUT on indicated gzip file.

       -T     tail  (extract  last  bytes)  to  STDOUT  on indicated still-growing gzip file, and
              continue Supervising & extracting to STDOUT.

       -u [cCdD]
              utility to compress (`-u c`) or decompress (`-u d`) zlib-format  files  to  STDOUT.
              Use `-u C` and `-u D` to manage raw compressed files. No index involved.

       -v #   output verbosity from `0` (none) to `5` (nuts). Default is `1` (normal).

       -w     wait for creation if file doesn't exist, when using `-[cdST]`.

       -W     do not Write index to disk. But if one is already available read and use it. Useful
              if the index is still under a `-S` run.

       -x     create index with line number information (win/*nix compatible).
              Please, note that gztool's index counts last line even if the  last  char  isn't  a
              newline char - whilst `wc` command will not count it in this case!.
              This is implicit unless `-X` or `-z` are indicated.

       -X     like `-x`, but newline character is '\r' (old mac).

       -z     create index without line number information.

QUICK EXAMPLE

       Extract  data  from  1 GiB byte (byte 2^30) on, from `myfile.gz` to the file `myfile.txt`.
       Also gztool will create (or reuse, or complete) an index file named `myfile.gzi`:

           $ gztool -b 1G myfile.gz > myfile.txt

MORE EXAMPLES

       * Make an index for `test.gz`. The index will be named `test.gzi`:

           $ gztool -i test.gz

       * Make an index for `test.gz` with name `test.index`, using `-I`:

           $ gztool -I test.index test.gz

       * Also `-I` can be used to indicate the complete path to an index  in  another  directory.
       This  way  the  directory  where the gzip file resides could be read-only and the index be
       created in another read-write path:

           $ gztool -I /tmp/test.gzi test.gz

       * Retrieve data from uncompressed byte position 1000000 inside test.gz. Index file will be
       created at the same time (named `test.gzi`):

           $ gztool -b 1m test.gz

       * Supervise an still-growing gzip file and generate the index for it on-the-fly. The index
       file name will be `openldap.log.gzi` in this case. `gztool` will execute until interrupted
       (it can also stop at first end-of-gzip data with `-E`):

           $ gztool -S openldap.log.gz

       *  The  previous command can be sent to background and with no verbosity, so we can forget
       about it:

           $ gztool -v0 -S openldap.log.gz &

       Creating and index for all "*gz" files in a directory:

           $ gztool -i *gz

       * Extract data from `project.gz` byte at 1 GiB to STDOUT, and use `grep` on  this  output.
       Index file name will be `project.gzi`:

           $ gztool -b 1G project.gz | grep -i "balance = "

       *  Please,  note  that  STDOUT  is used for data extraction with `-bcdtT` modifiers, so an
       explicit command line redirection is needed if output is to be stored in a file:

           $ gztool -b 99m project.gz > uncompressed.data

       * Extract data from a gzipped file which index is still growing with a `gztool -S` process
       that  is  monitoring  the (still-growing) gzip file: in this case the use of `-W` will not
       try to update the index on disk so the other process is not disturb! (Note  that  `gztool`
       always tries to update the index used if it thinks it's necessary):

           $ gztool -Wb 100k still-growing-gzip-file.gz > mytext

       * Extract data from line 10 million, to STDOUT:

           $ gztool -L 10m compressed_text_file.gz

       *  Nonetheless  note  that if in the precedent example an index was previously created for
       the gzip file without the `-x` parameter (or not using `-L`), as it doesn't  contain  line
       numbering  info,  `gztool`  will  complain  and  stop. This can be circumvented by telling
       `gztool` to use another new index file name (`-I`), or even not using anyone at  all  with
       `-W`  (do not write index) and an index file name that doesn't exists (in this case `None`
       - it won't be created because of `-W`), and  so  ((just)  this  time)  the  gzip  will  be
       processed from the beginning:

           $ gztool -L 10m -WI None compressed_text_file.gz

       *       Extract       all       data       from       a       rsyslog's      veryRobustZip
       (//www.rsyslog.com/doc/v8-stable/configuration/modules/omfile.html#veryrobustzip)     that
       contains   dirty   data.  This  *corrupted-gzip-files*  can  arise  when  using  rsyslog's
       veryRobustZip omfile option and the process that is logging  is  abruptly  terminated  and
       then  restarted  - this produces an incorrectly-terminated-gzip stream that is followed by
       another gzip stream **in the same file**. `gzip`  (nor  `zlib`)  cannot  read  this  files
       beyond  the  point  of  error.  But `gztool` can correctly extract all data (and only good
       data) using `-p` (*patch*) parameter:

           $ gztool -p -b0 compressed_text_file.gz

       This creates, as usual, the index file `compressed_text_file.gzi`. In order to not  create
       it, `-W` (*do not Write index*) can be used:

           $ gztool -pWb0 compressed_text_file.gz

       Note that `-p` can require up to twice the time for decompression, because it performs two
       decompression processes: the usual one, and another one that is performed  **in  advance**
       of  the  usual  and  which is the one that detects errors, marks them, and finds new entry
       points to end/begin the decompression circumventing the problems.
       Note also that these  *corrupted-gzip-files*  should  be  always  decompressed  with  `-p`
       parameter,  even  if  a `gztool` index file exists for them, because the index file stores
       entry points, but does not store where do errors occur in the `gzip` file.  That said,  if
       the  `-[bL]` point of extraction is beyond the point(s) of error in the `gzip` file and an
       index file exists, then the decompression can proceed fine  without  `-p`,  as  the  index
       points stored in the index file are always clean.

       *  When tailing an still-growing gzip file (`-T`) that could contain errors at some point,
       one may still want to obtain output from the gzip stream as soon as  possible  -  this  is
       what  the  patching  option  `-P`  is  for (like `-p` but capitalized): with `-p` `gztool`
       decompress the stream about 48 kiB ahead of the output that is actually  shown/written  in
       order to catch possible gzip-stream errors ahead of output, and so maintain always a clean
       output without error-introduced artifacts. This has  the  side  effect  that  output  must
       always  wait  for  that 48 kiB of data to be available in advance, which if the file grows
       slowly can take a very long time. With `-P` the buffer-ahead  restriction  is  relaxed  to
       just  as  few  bytes as available before reaching end-of-file and waiting for new data, so
       responsiveness is as quick as without `-p`. The side effect of `-P` is that  depending  on
       the  gzip  file  some  errors may lead to incorrect output being shown/written - though in
       this case a "PATCHING WARNING" would be shown (to stderr).

           $ gztool -PT application_log.gz

       The same applies to `-S` though in this case there's no output, as only the index is being
       constructed:

           $ gztool -PS application_log.gz

       *  To  tail to stdout, like a `tail -f`, an still-growing gzip file (an index file will be
       created with name `still-growing-gzip-file.gzi` in this case):

           $ gztool -T still-growing-gzip-file.gz

       * More on files still being "Supervised" (`-S`) by another  `gztool`  instance:  they  can
       also be tailed à la `tail -f` without updating the index on disk using `-W`:

           $ gztool -WT still-growing-gzip-file.gz

       *  Compress (`-c`) an still growing (`-E`) file: in this case both `still-growing-file.gz`
       and `still-growing-file.gzi` files will be created on-the-fly as the  source  file  grows.
       Note  that  in  order  to  terminate compression, Ctrl+C must be used to kill gztool: this
       results in an incomplete-gzip-file as per GZIP standard, but this is not important  as  it
       will  contain  all  the  source data, and both `gzip` and `gztool` (or any other tool) can
       correctly and completely decompress it:

           $ gztool -Ec still-growing-file

       * If you have an incomplete index file (it just does not have the  length  of  the  source
       data,  as  it didn't correctly finish) and want to make it complete and so that the length
       of the uncompressed data be stored, just unconditionally complete it with `-C` with a  new
       `-i`  run  over your gzip file: note that as the existent index data is used (in this case
       the file `my-incomplete-gzip-data.gzi`), only last compressed bytes  are  decompressed  to
       complete this action:

           $ gztool -Ci my-incomplete-gzip-data.gz

       *  Decompress  a  file  like  with gzip (`-d`), but do not delete (`-D`) the original one:
       Decompressed file will be `myfile`. Note that gzipped file must have a ".gz" extension  or
       `gztool` will complain:

           $ gztool -Dd myfile.gz

       * Decompress a file that does not have ".gz" file extension, like with gzip (`-d`):

           $ cat mycompressedfile | gztool -d > my_uncompressed_file

       *  Show  internals  of  all  index  files  in this directory. `-e` is used not to stop the
       process on the first error, if a `*.gzi` file is not a valid gzip index  file.  The  `-ll`
       list  option repetition will show data about each index point. `-lll` also decompress each
       point's window to ensure index integrity:

           $ gztool -ell *.gzi

       If `gztool` finds the gzip file companion of the index file, some  statistics  are  shown,
       like  the  index/gzip  size ratio, or the ratio of compression of the gzip file.  Also, if
       the gzip is complete, the  size  of  the  uncompressed  data  is  shown.  This  number  is
       interesting  if  the  gzip  file  is  bigger  than 4 GiB, in which case `gunzip -l` cannot
       correctly   calculate   it   as   it   is   limited   to   a   32   bit    counter    (see
       //tools.ietf.org/html/rfc1952#page-5),  or if the gzip file is in `bgzip` format, in which
       case `gunzip -l` would only show data about the first block (< 64 kiB).
       Note that `gztool -l` tries to guess the companion gzip file of the index  looking  for  a
       file with the same name, but without the `i` of the `.gzi` file name extension, or without
       the `.gzi`. But the gzip file name can also be directly indicated with this format:

           $ gztool -l -I index_filename gzip_filename

       In this latter case only a pair of index+gzip filenames can be indicated with each use.

       * Use a truncated gzip file (100000 first bytes are removed:  (not  zeroed,  removed);  if
       they're  zeroed  cautions  are  the same, but `-n` is not needed), to extract from byte 20
       MiB, using a previously generated index: as far as the `-b` parameter  refers  to  a  byte
       after an index point (See `-ll`) and `-n` be less than that needed first index point, this
       is always possible. In this case -I gzip_filename.gzi is implicit:

           $ gztool -n 100001 -b 20M gzip_filename.gz

       Take into account that, as shown, the first byte of the truncated `gzip_filename.gz`  file
       is numbered **100001**, that is, the bytes retain the order number in which they appear in
       the original file (that's the reason why it is not the *1* byte).
       Please, note that index point positions at index file may require also the  previous  byte
       to  be  available  in  the  truncated  gzip file, as gzip stream is not byte-rounded but a
       stream of pure bits. Thus if you're thinking on truncating  a  gzip  file,  please  do  it
       always at least by one byte before the indicated index point in the gzip - as said, it may
       not be needed, but in 7 of 8 cases it is needed.

       * Since v1.5.0, using `-[fW]` (`-f`: force index overwriting; `-W`: do  not  write  index)
       with  `-[ST]`  (`-S`:  create  index  on  still-growing gzip file; `-T`: tail and continue
       decompressing to stdout) indicates `gztool` to continue operations even after  the  source
       file is overwritten. If using `-f`, the index file will be overwritten. For example:

           $ gztool -WT log_filename.gz
           ...
           File overwriting detected and restarting decompression...
           Processing 'log_filename.gz'...

INTERNALS

       By  default  gzip-compressed files cannot be accessed in random mode: any byte required at
       position N requires the complete gzip file to be decompressed from the beginning to the  N
       byte.   Nonetheless  Mark  Adler,  the author of zlib (//github.com/madler/zlib), provided
       years         ago         a         cryptic          file          named          `zran.c`
       (//github.com/madler/zlib/blob/master/examples/zran.c)   that   creates   an   "index"  of
       "windows" filled with 32 kiB  of  uncompressed  data  at  different  positions  along  the
       un/compressed file, which can be used to initialize the zlib library and make it behave as
       if compressed data begin there.

       gztool builds upon zran.c to provide a useful command line tool.  Also, some optimizations
       has been made:

       *  gztool  can  correctly read incomplete gzip-concatenated-files (using `-p`), that is, a
       gzip composed of a concatenation  of  `gzip`  files,  some  of  which  are  not  correctly
       terminated. This can happen, for example, when using rsyslog's veryRobustZip omfile option
       (//www.rsyslog.com/doc/v8-stable/configuration/modules/omfile.html#veryrobustzip) and  the
       process that is logging is abruptly terminated and then restarted.

       *  gztool  can  store  line numbering information in the index (use only if source data is
       text!), and retrieve data from a specific line number using  `-L`.  (Using  `-[xXz]`  when
       creating  the index selects Unix new line format (default), old Mac new line format, or no
       line information respectively.)

       * gztool can Supervise an still-growing gzip file (for example, a log created  by  rsyslog
       directly  in gzip format) and generate the index on-the-fly, thus reducing in the practice
       to zero the time of index creation. See `-S`.

       * extraction of data and index creation are interleaved, so there's no waste of  time  for
       the index creation.

       * index files are reusable, so they can be stopped at any time and reused and/or completed
       later.

       * an ex novo index file format has been created to store the index

       * span between index points is raised by default from 1 MiB to 10 MiB, and can be adjusted
       with `-s` (span).

       * windows are compressed in file

       *  windows  are  not  loaded  in  memory  unless they're needed, so the application memory
       footprint is fairly low (< 1 MiB)

       * gztool can compress files (`-c`) and at the same time generate an index  that  is  about
       10-100  times  smaller  than  if  the  index  is generated after the file has already been
       compressed with gzip.

       * Compatible with `bgzip` files (//www.htslib.org/doc/bgzip.html)

       * Compatible with complete `gzip` concatenated files

       * Compatible  with  rsyslog's  veryRobustZip  omfile  option  (variable-short-uncompressed
       complete-gzip-block sizes)

       * data can be provided from/to stdin/stdout

       *  gztool  can  be used to remotely retrieve just a small part of a bigger gzip compressed
       file        and        successfully        decompress        it        locally.        See
       //unix.stackexchange.com/questions/429197/#541903  .  Just note that the gztool index file
       must be also available.

PROJECT HOME PAGE

       //github.com/circulosmeos/gztool

AUTHOR

       This program was written by Roberto S. Galende <roberto.s.galende@gmail.com>  on  work  by
       Mark Adler's zlib (examples/zran.c) and is copyrighted under zlib licence terms.

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

QUICK EXAMPLE

MORE EXAMPLES

INTERNALS

PROJECT HOME PAGE

SEE ALSO

AUTHOR