oracular (7) funtext.7.gz

Provided by: libfuntools-dev_1.4.8-1.1build2_amd64 bug

NAME

       Funtext - Support for Column-based Text Files

SYNOPSIS

       This document contains a summary of the options for processing column-based text files.

DESCRIPTION

       Funtools will automatically sense and process "standard" column-based text files as if
       they were FITS binary tables without any change in Funtools syntax. In particular, you can
       filter text files using the same syntax as FITS binary tables:

         fundisp foo.txt'[cir 512 512 .1]'
         fundisp -T foo.txt > foo.rdb
         funtable foo.txt'[pha=1:10,cir 512 512 10]' foo.fits

       The first example displays a filtered selection of a text file.  The second example
       converts a text file to an RDB file.  The third example converts a filtered selection of a
       text file to a FITS binary table.

       Text files can also be used in Funtools image programs. In this case, you must provide
       binning parameters (as with raw event files), using the bincols keyword specifier:

         bincols=([xname[:tlmin[:tlmax:[binsiz]]]],[yname[:tlmin[:tlmax[:binsiz]]]

       For example:

         funcnts foo'[bincols=(x:1024,y:1024)]' "ann 512 512 0 10 n=10"

       Standard Text Files

       Standard text files have the following characteristics:

       •   Optional comment lines start with #

       •   Optional blank lines are considered comments

       •   An optional table header consists of the following (in order):

           •   a single line of alpha-numeric column names

           •   an optional line of unit strings containing the same number of cols

           •   an optional line of dashes containing the same number of cols

       •   Data lines follow the optional header and (for the present) consist of
                the same number of columns as the header.

       •   Standard delimiters such as space, tab, comma, semi-colon, and bar.

       Examples:

         # rdb file
         foo1  foo2    foo3    foos
         ----  ----    ----    ----
         1     2.2     3       xxxx
         10    20.2    30      yyyy

         # multiple consecutive whitespace and dashes
         foo1   foo2    foo3 foos
         ---    ----    ---- ----
            1    2.2    3    xxxx
           10   20.2    30   yyyy

         # comma delims and blank lines
         foo1,foo2,foo3,foos

         1,2.2,3,xxxx
         10,20.2,30,yyyy

         # bar delims with null values
         foo1⎪foo2⎪foo3⎪foos
         1⎪⎪3⎪xxxx
         10⎪20.2⎪⎪yyyy

         # header-less data
         1     2.2   3 xxxx
         10    20.2 30 yyyy

       The default set of token delimiters consists of spaces, tabs, commas, semi-colons, and
       vertical bars. Several parsers are used simultaneously to analyze a line of text in
       different ways.  One way of analyzing a line is to allow a combination of spaces, tabs,
       and commas to be squashed into a single delimiter (no null values between consecutive
       delimiters). Another way is to allow tab, semi-colon, and vertical bar delimiters to
       support null values, i.e. two consecutive delimiters implies a null value (e.g. RDB file).
       A successful parser is one which returns a consistent number of columns for all rows, with
       each column having a consistent data type.  More than one parser can be successful. For
       now, it is assumed that successful parsers all return the same tokens for a given line.
       (Theoretically, there are pathological cases, which will be taken care of as needed). Bad
       parsers are discarded on the fly.

       If the header does not exist, then names "col1", "col2", etc.  are assigned to the columns
       to allow filtering.  Furthermore, data types for each column are determined by the data
       types found in the columns of the first data line, and can be one of the following:
       string, int, and double. Thus, all of the above examples return the following display:

         fundisp foo'[foo1>5]'
               FOO1                  FOO2       FOO3         FOOS
         ---------- --------------------- ---------- ------------
                 10           20.20000000         30         yyyy

       Comments Convert to Header Params

       Comments which precede data rows are converted into header parameters and will be written
       out as such using funimage or funhead. Two styles of comments are recognized:

       1. FITS-style comments have an equal sign "=" between the keyword and value and an
       optional slash "/" to signify a comment. The strict FITS rules on column positions are not
       enforced. In addition, strings only need to be quoted if they contain whitespace. For
       example, the following are valid FITS-style comments:

         # fits0 = 100
         # fits1 = /usr/local/bin
         # fits2 = "/usr/local/bin /opt/local/bin"
         # fits3c = /usr/local/bin /opt/local/bin /usr/bin
         # fits4c = "/usr/local/bin /opt/local/bin" / path dir

       Note that the fits3c comment is not quoted and therefore its value is the single token
       "/usr/local/bin" and the comment is "opt/local/bin /usr/bin".  This is different from the
       quoted comment in fits4c.

       2. Free-form comments can have an optional colon separator between the keyword and value.
       In the absence of quote, all tokens after the keyword are part of the value, i.e. no
       comment is allowed. If a string is quoted, then slash "/" after the string will signify a
       comment.  For example:

         # com1 /usr/local/bin
         # com2 "/usr/local/bin /opt/local/bin"
         # com3 /usr/local/bin /opt/local/bin /usr/bin
         # com4c "/usr/local/bin /opt/local/bin" / path dir

         # com11: /usr/local/bin
         # com12: "/usr/local/bin /opt/local/bin"
         # com13: /usr/local/bin /opt/local/bin /usr/bin
         # com14c: "/usr/local/bin /opt/local/bin" / path dir

       Note that com3 and com13 are not quoted, so the whole string is part of the value, while
       comz4c and com14c are quoted and have comments following the values.

       Some text files have column name and data type information in the header.  You can specify
       the format of column information contained in the header using the "hcolfmt="
       specification. See below for a detailed description.

       Multiple Tables in a Single File

       Multiple tables are supported in a single file. If an RDB-style file is sensed, then a ^L
       (vertical tab) will signify end of table. Otherwise, an end of table is sensed when a new
       header (i.e., all alphanumeric columns) is found. (Note that this heuristic does not work
       for single column tables where the column type is ASCII and the table that follows also
       has only one column.) You also can specify characters that signal an end of table
       condition using the eot= keyword. See below for details.

       You can access the nth table (starting from 1) in a multi-table file by enclosing the
       table number in brackets, as with a FITS extension:

         fundisp foo'[2]'

       The above example will display the second table in the file.  (Index values start at 1 in
       oder to maintain logical compatibility with FITS files, where extension numbers also start
       at 1).

       TEXT() Specifier

       As with ARRAY() and EVENTS() specifiers for raw image arrays and raw event lists
       respectively, you can use TEXT() on text files to pass key=value options to the parsers.
       An empty set of keywords is equivalent to not having TEXT() at all, that is:

         fundisp foo
         fundisp foo'[TEXT()]'

       are equivalent. A multi-table index number is placed before the TEXT() specifier as the
       first token, when indexing into a multi-table:

         fundisp foo'[2,TEXT(...)]'

       The filter specification is placed after the TEXT() specifier, separated by a comma, or in
       an entirely separate bracket:

         fundisp foo'[TEXT(...),circle 512 512 .1]'
         fundisp foo'[2,TEXT(...)][circle 512 512 .1]'

       Text() Keyword Options

       The following is a list of keywords that can be used within the TEXT() specifier (the
       first three are the most important):

       •   delims="[delims]"

           Specify token delimiters for this file. Only a single parser having these delimiters
           will be used to process the file.

             fundisp foo.fits'[TEXT(delims="!")]'
             fundisp foo.fits'[TEXT(delims="\t%")]'

       •   comchars="[comchars]"

           Specify comment characters. You must include "\n" to allow blank lines.  These comment
           characters will be used for all standard parsers (unless delims are also specified).

             fundisp foo.fits'[TEXT(comchars="!\n")]'

       •   cols="[name1:type1 ...]"

           Specify names and data type of columns. This overrides header names and/or data types
           in the first data row or default names and data types for header-less tables.

             fundisp foo.fits'[TEXT(cols="x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e")]'

           If the column specifier is the only keyword, then the cols= is not required (in
           analogy with EVENTS()):

             fundisp foo.fits'[TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]'

           Of course, an index is allowed in this case:

             fundisp foo.fits'[2,TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]'

       •   eot="[eot delim]"

           Specify end of table string specifier for multi-table files. RDB files support ^L. The
           end of table specifier is a string and the whole string must be found alone on a line
           to signify EOT. For example:

             fundisp foo.fits'[TEXT(eot="END")]'

           will end the table when a line contains "END" is found. Multiple lines are supported,
           so that:

             fundisp foo.fits'[TEXT(eot="END\nGAME")]'

           will end the table when a line contains "END" followed by a line containing "GAME".

           In the absence of an EOT delimiter, a new table will be sensed when a new header (all
           alphanumeric columns) is found.

       •   null1="[datatype]"

           Specify data type of a single null value in row 1.  Since column data types are
           determined by the first row, a null value in that row will result in an error and a
           request to specify names and data types using cols=. If you only have a one null in
           row 1, you don't need to specify all names and columns. Instead, use null1="type" to
           specify its data type.

       •   alen=[n]

           Specify size in bytes for ASCII type columns.  FITS binary tables only support fixed
           length ASCII columns, so a size value must be specified. The default is 16 bytes.

       •   nullvalues=["true"⎪"false"]

           Specify whether to expect null values.  Give the parsers a hint as to whether null
           values should be allowed. The default is to try to determine this from the data.

       •   whitespace=["true"⎪"false"]

           Specify whether surrounding white space should be kept as part of string tokens.  By
           default surrounding white space is removed from tokens.

       •   header=["true"⎪"false"]

           Specify whether to require a header.  This is needed by tables containing all string
           columns (and with no row containing dashes), in order to be able to tell whether the
           first row is a header or part of the data. The default is false, meaning that the
           first row will be data. If a row dashes are present, the previous row is considered
           the column name row.

       •   units=["true"⎪"false"]

           Specify whether to require a units line.  Give the parsers a hint as to whether a row
           specifying units should be allowed. The default is to try to determine this from the
           data.

       •   i2f=["true"⎪"false"]

           Specify whether to allow int to float conversions.  If a column in row 1 contains an
           integer value, the data type for that column will be set to int. If a subsequent row
           contains a float in that same column, an error will be signaled. This flag specifies
           that, instead of an error, the float should be silently truncated to int. Usually, you
           will want an error to be signaled, so that you can specify the data type using cols=
           (or by changing the value of the column in row 1).

       •   comeot=["true"⎪"false"⎪0⎪1⎪2]

           Specify whether comment signifies end of table.  If comeot is 0 or false, then
           comments do not signify end of table and can be interspersed with data rows. If the
           value is true or 1 (the default for standard parsers), then non-blank lines (e.g.
           lines beginning with '#') signify end of table but blanks are allowed between rows. If
           the value is 2, then all comments, including blank lines, signify end of table.

       •   lazyeot=["true"⎪"false"]

           Specify whether "lazy" end of table should be permitted (default is true for standard
           formats, except rdb format where explicit ^L is required between tables). A lazy EOT
           can occur when a new table starts directly after an old one, with no special EOT
           delimiter. A check for this EOT condition is begun when a given row contains all
           string tokens. If, in addition, there is a mismatch between the number of tokens in
           the previous row and this row, or a mismatch between the number of string tokens in
           the prev row and this row, a new table is assumed to have been started. For example:

             ival1 sval3
             ----- -----
             1     two
             3     four

             jval1 jval2 tval3
             ----- ----- ------
             10    20    thirty
             40    50    sixty

           Here the line "jval1 ..." contains all string tokens.  In addition, the number of
           tokens in this line (3) differs from the number of tokens in the previous line (2).
           Therefore a new table is assumed to have started. Similarly:

             ival1 ival2 sval3
             ----- ----- -----
             1     2     three
             4     5     six

             jval1 jval2 tval3
             ----- ----- ------
             10    20    thirty
             40    50    sixty

           Again, the line "jval1 ..." contains all string tokens. The number of string tokens in
           the previous row (1) differs from the number of tokens in the current row(3). We
           therefore assume a new table as been started. This lazy EOT test is not performed if
           lazyeot is explicitly set to false.

       •   hcolfmt=[header column format]

           Some text files have column name and data type information in the header.  For
           example, VizieR catalogs have headers containing both column names and data types:

             #Column e_Kmag  (F6.3)  ?(k_msigcom) K total magnitude uncertainty (4)  [ucd=ERROR]
             #Column Rflg    (A3)    (rd_flg) Source of JHK default mag (6)  [ucd=REFER_CODE]
             #Column Xflg    (I1)    [0,2] (gal_contam) Extended source contamination (10) [ucd=CODE_MISC]

           while Sextractor files have headers containing column names alone:

             #   1 X_IMAGE         Object position along x                         [pixel]
             #   2 Y_IMAGE         Object position along y                         [pixel]
             #   3 ALPHA_J2000     Right ascension of barycenter (J2000)           [deg]
             #   4 DELTA_J2000     Declination of barycenter (J2000)               [deg]

           The hcolfmt specification allows you to describe which header lines contain column
           name and data type information. It consists of a string defining the format of the
           column line, using "$col" (or "$name") to specify placement of the column name, "$fmt"
           to specify placement of the data format, and "$skip" to specify tokens to ignore. You
           also can specify tokens explicitly (or, for those users familiar with how sscanf
           works, you can specify scanf skip specifiers using "%*").  For example, the VizieR
           hcolfmt above might be specified in several ways:

             Column $col ($fmt)    # explicit specification of "Column" string
             $skip  $col ($fmt)    # skip one token
             %*s $col  ($fmt)      # skip one string (using scanf format)

           while the Sextractor format might be specified using:

             $skip $col            # skip one token
             %*d $col              # skip one int (using scanf format)

           You must ensure that the hcolfmt statement only senses actual column definitions, with
           no false positives or negatives.  For example, the first Sextractor specification,
           "$skip $col", will consider any header line containing two tokens to be a column name
           specifier, while the second one, "%*d $col", requires an integer to be the first
           token. In general, it is preferable to specify formats as explicitly as possible.

           Note that the VizieR-style header info is sensed automatically by the funtools
           standard VizieR-like parser, using the hcolfmt "Column $col ($fmt)".  There is no need
           for explicit use of hcolfmt in this case.

       •   debug=["true"⎪"false"]

           Display debugging information during parsing.

       Environment Variables

       Environment variables are defined to allow many of these TEXT() values to be set without
       having to include them in TEXT() every time a file is processed:

         keyword       environment variable
         -------       --------------------
         delims        TEXT_DELIMS
         comchars      TEXT_COMCHARS
         cols          TEXT_COLUMNS
         eot           TEXT_EOT
         null1         TEXT_NULL1
         alen          TEXT_ALEN
         bincols       TEXT_BINCOLS
         hcolfmt       TEXT_HCOLFMT

       Restrictions and Problems

       As with raw event files, the '+' (copy extensions) specifier is not supported for programs
       such as funtable.

       String to int and int to string data conversions are allowed by the text parsers. This is
       done more by force of circumstance than by conviction: these transitions often happens
       with VizieR catalogs, which we want to support fully. One consequence of allowing these
       transitions is that the text parsers can get confused by columns which contain a valid
       integer in the first row and then switch to a string. Consider the following table:

         xxx   yyy     zzz
         ----  ----    ----
         111   aaa     bbb
         ccc   222     ddd

       The xxx column has an integer value in row one a string in row two, while the yyy column
       has the reverse. The parser will erroneously treat the first column as having data type
       int:

         fundisp foo.tab
                XXX          YYY          ZZZ
         ---------- ------------ ------------
                111        'aaa'        'bbb'
         1667457792        '222'        'ddd'

       while the second column is processed correctly. This situation can be avoided in any
       number of ways, all of which force the data type of the first column to be a string. For
       example, you can edit the file and explicitly quote the first row of the column:

         xxx   yyy     zzz
         ----  ----    ----
         "111" aaa     bbb
         ccc   222     ddd

         [sh] fundisp foo.tab
                  XXX          YYY          ZZZ
         ------------ ------------ ------------
                '111'        'aaa'        'bbb'
                'ccc'        '222'        'ddd'

       You can edit the file and explicitly set the data type of the first column:

         xxx:3A   yyy  zzz
         ------   ---- ----
         111      aaa  bbb
         ccc      222  ddd

         [sh] fundisp foo.tab
                  XXX          YYY          ZZZ
         ------------ ------------ ------------
                '111'        'aaa'        'bbb'
                'ccc'        '222'        'ddd'

       You also can explicitly set the column names and data types of all columns, without
       editing the file:

         [sh] fundisp foo.tab'[TEXT(xxx:3A,yyy:3A,zzz:3a)]'
                  XXX          YYY          ZZZ
         ------------ ------------ ------------
                '111'        'aaa'        'bbb'
                'ccc'        '222'        'ddd'

       The issue of data type transitions (which to allow and which to disallow) is still under
       discussion.

SEE ALSO

       See funtools(7) for a list of Funtools help pages