Provided by: afnix_2.8.1-1_amd64 bug

NAME

       txt - standard text processing module

STANDARD TEXT PROCESSING MODULE

       The  Standard  Text Processingmodule is an original implementation of an object collection
       dedicated to text processing. Although text scaning is the current operation  perfomed  in
       the  field  of  text  processing, the module provides also specialized object to store and
       index text data. Text sorting and transliteration is also part of this module.

       Scanning concepts
       Text scanning is the ability to extract  lexical  elements  or  lexemesfrom  a  stream.  A
       scanner  or  lexical analyzer is the principal object used to perform this task. A scanner
       is created by adding special object that acts as a pattern  matcher.  When  a  pattern  is
       matched, a special object called a lexemeis returned.

       Pattern object
       A  Patternobject is a special object that acts as model for the string to match. There are
       several ways to build a  pattern.  The  simplest  way  to  build  it  is  with  a  regular
       expression.  Another  type  of pattern is a balanced pattern. In its first form, a pattern
       object can be created with a regular expression object.

       # create a pattern object
       const pat (afnix:txt:Pattern "$d+")

       In this example, the pattern object is built to detect integer objects.

       pat:check "123" # true
       pat:match "123" # 123

       The checkmethod return true if the input  string  matches  the  pattern.  The  matchmethod
       returns  the  string  that matches the pattern. Since the pattern object can also operates
       with stream object, the matchmethod is appropriate  to  match  a  particular  string.  The
       pattern object is, as usual, available with the appropriate predicate.

       afnix:txt:pattern-p pat # true

       Another  form  of pattern object is the balanced pattern. A balanced pattern is determined
       by a starting string and an ending string. There are two types of balanced pattern. One is
       a  single balanced pattern and the other one is the recursive balanced pattern. The single
       balanced pattern is appropriate for those lexical element that are defined by a character.
       For  example,  the  classical  C-string is a single balanced pattern with the double quote
       character.

       # create a balanced pattern
       const pat (afnix:txt:Pattern "ELEMENT" "<" ">")
       pat:check "<xml>" # true
       pat:match "<xml>" # xml

       In the case of the C-string, the pattern might  be  more  appropriately  defined  with  an
       additional  escape  character.  Such  character  is  used  by  the pattern matcher to grab
       characters that might be part of the pattern definition.

       # create a balanced pattern
       const pat (afnix:txt:Pattern "STRING" "'" '\')
       pat:check "'hello'" # true
       pat:match "'hello'" # "hello"

       In this form, a balanced pattern with an escape character is created. The same  string  is
       used  for  both the starting and ending string. Another constructor that takes two strings
       can be used if the starting and ending strings are different. The last pattern form is the
       balanced  recursive  form.  In this form, a starting and ending string are used to delimit
       the pattern. However, in this mode, a recursive use of the starting and ending strings  is
       allowed.  In  order  to  have an exact match, the number of starting string must equal the
       number of ending string. For example, the C-comment pattern can  be  viewed  as  recursive
       balanced pattern.

       # create a c-comment pattern
       const pat (afnix:txt:Pattern "STRING" "/*" "*/" )

       Lexeme object
       The  Lexemeobject  is  the  object  built by a scanner that contains the matched string. A
       lexeme is  therefore  a  tagged  string.  Additionally,  a  lexeme  can  carry  additional
       information like a source name and index.

       # create an empty lexeme
       const lexm (afnix:txt:Lexeme)
       afnix:txt:lexeme-p lexm # true

       The  default lexeme is created with any value. A value can be set with the set-valuemethod
       and retrieved with the get-valuemethods.

       lexm:set-value "hello"
       lexm:get-value # hello

       Similar are the set-tagand get-tagmethods which operate with an integer. The  source  name
       and index are defined as well with the same methods.

       # check for the source
       lexm:set-source "world"
       lexm:get-source # world
       # check for the source index
       lexm:set-index 2000
       lexm:get-index # 2000

       Text scanning
       Text  scanning is the ability to extract lexical elements or lexemes from an input stream.
       Generally, the lexemes are the results of a matching  operation  which  is  defined  by  a
       pattern  object. As a result, the definition of a scanner object is the object itself plus
       one or several pattern object.

       Scanner construction
       By default, a scanner is created without pattern objects.  The  lengthmethod  returns  the
       number of pattern objects. As usual, a predicate is associated with the scanner object.

       # the default scanner
       const  scan (afnix:txt:Scanner)
       afnix:txt:scanner-p scan # true
       # the length method
       scan:length # 0

       The  scanner  construction proceeds by adding pattern objects. Each pattern can be created
       independently, and later added to the scanner. For example, a  scanner  that  reads  real,
       integer and string can be defined as follow:

       # create the scanner pattern
       const REAL    (
         afnix:txt:Pattern "REAL"    [$d+.$d*])
       const STRING  (
         afnix:txt:Pattern "STRING"  """ '\')
       const INTEGER (
         afnix:txt:Pattern "INTEGER" [$d+|"0x"$x+])
       # add the pattern to the scanner
       scanner:add INTEGER REAL STRING

       The  order of pattern integration defines the priority at which a token is recognized. The
       symbol name for each pattern is optional since  the  functional  programming  permits  the
       creation  of  patterns directly. This writing style makes the scanner definition easier to
       read.

       Using the scanner
       Once constructed, the scanner can be used as is. A stream is generally  the  best  way  to
       operate.  If  the  scanner reaches the end-of-stream or cannot recognize a lexeme, the nil
       object is returned. With a loop, it is easy to get all lexemes.

       while (trans valid (is:valid-p)) {
         # try to get the lexeme
         trans lexm (scanner:scan is)
         # check for nil lexeme and print the value
         if (not (nil-p lexm)) (println (lexm:get-value))
         # update the valid flag
         valid:= (and (is:valid-p) (not (nil-p lexm)))
       }

       In this loop, it is necessary first to check for the end of the stream. This is done  with
       the help of the special loop construct that initialize the validsymbol. As soon as the the
       lexeme is built, it can be used. The lexeme holds the value as well as it tag.

       Text sorting
       Sorting is one the primary function implemented inside the  text  processingmodule.  There
       are three sorting functions available in the module.

       Ascending and descending order sorting
       The  sort-ascentfunction operates with a vector object and sorts the elements in ascending
       order. Any kind of objects can be sorted as long as they support a comparison method.  The
       elements are sorted in placed by using a quick sortalgorithm.

       # create an unsorted vector
       const v-i (Vector 7 5 3 4 1 8 0 9 2 6)
       # sort the vector in place
       afnix:txt:sort-ascent v-i
       # print the vector
       for (e) (v) (println e)

       The  sort-descentfunction is similar to the sort-ascentfunction except that the object are
       sorted in descending order.

       Lexical sorting
       The sort-lexicalfunction operates with a vector object and sorts the elements in ascending
       order  using  a  lexicographic  ordering  relation.  Objects in the vector must be literal
       objects or an exception is raised.

       Transliteration
       Transliteration is the process of changing characters my mapping one to another  one.  The
       transliteration  process  operates with a character source and produces a target character
       with the help  of  a  mapping  table.  The  transliteration  process  is  not  necessarily
       reversible as often indicated in the literature.

       Literate object
       The  Literateobject is a transliteration object that is bound by default with the identity
       function mapping. As usual, a predicate is associate with the object.

       # create a transliterate object
       const tl (afnix:txt:Literate)
       # check the object
       afnix:txt:literate-p tl # true

       The transliteration process can also operate with an escape  character  in  order  to  map
       double character sequence into a single one, as usually found inside programming language.

       # create a transliterate object by escape
       const tl (afnix:txt:Literate '\')

       Transliteration configuration
       The  set-mapconfigures the transliteration mapping table while the set-escape-mapconfigure
       the escape mapping table. The mapping is done by setting  the  source  character  and  the
       target  character.  For  instance,  if one want to map the tabulation character to a white
       space, the mapping table is set as follow:

       tl:set-map '' ' '

       The escape mapping table operates the same way.  It  should  be  noted  that  the  mapping
       algorithm  translate first the input character, eventually yielding to an escape character
       and then the escape mapping takes place. Note also that the set-escapemethod can  be  used
       to set the escape character.

       tl:set-map '' ' '

       Transliteration process
       The  transliteration process is done either with a string or an input stream. In the first
       case, the translatemethod operates with a string and returns a translated string.  On  the
       other hand, the readmethod returns a character when operating with a stream.

       # set the mapping characters
       tl:set-map '0
       tl:set-map ''' '
       tl:set-map '
       tl:set-map ''
       # translate a string
       tl:translate "helo" # word

STANDARD TEXT PROCESSING REFERENCE

       Pattern
       The  Patternclass  is  a  pattern  matching  class  based  either on regular expression or
       balanced string. In the regex mode, the pattern is defined with a regex and a matching  is
       said  to occur when a regex match is achieved. In the balanced string mode, the pattern is
       defined with a start pattern and end pattern strings. The balanced mode can be a single or
       recursive.  Additionally, an escape character can be associated with the class. A name and
       a tag is also bound to the pattern object as a mean  to  ease  the  integration  within  a
       scanner.

       Predicate

              pattern-p

       Inheritance

              Object

       Constructors

              Pattern (none)
              The Patternconstructor creates an empty pattern.

              Pattern (String|Regex)
              The   Patternconstructor  creates  a  pattern  object  associated  with  a  regular
              expression. The argument can be either a string or a regular expression object.  If
              the argument is a string, it is converted into a regular expression object.

              Pattern (String String)
              The  Patternconstructor creates a balanced pattern. The first argument is the start
              pattern string. The second argument is the end balanced string.

              Pattern (String String Character)
              The Patternconstructor creates a balanced pattern with  an  escape  character.  The
              first argument is the start pattern string. The second argument is the end balanced
              string. The third character is the escape character.

              Pattern (String String Boolean)
              The Patternconstructor creates a recursive balanced pattern. The first argument  is
              the start pattern string. The second argument is the end balanced string.

       Constants

              REGEX
              The REGEXconstant indicates that the pattern is a regular expression.

              BALANCED
              The BALANCEDconstant indicates that the pattern is a balanced pattern.

              RECURSIVE
              The RECURSIVEconstant indicates that the pattern is a recursive balanced pattern.

       Methods

              check -> Boolean (String)
              The checkmethod checks the pattern against the input string. If the verification is
              successful, the method returns true, false otherwise.

              match -> String (String|InputStream)
              The matchmethod attempts to match an input  string  or  an  input  stream.  If  the
              matching occurs, the matching string is returned. If the input is a string, the end
              of string is used as an end condition. If the input stream  is  used,  the  end  of
              stream is used as an end condition.

              set-tag -> none (Integer)
              The  set-tagmethod  sets  the  pattern  tag.  The  tag can be further used inside a
              scanner.

              get-tag -> Integer (none)
              The get-tagmethod returns the pattern tag.

              set-name -> none (String)
              The set-namemethod sets the pattern name. The name is symbol  identifier  for  that
              pattern.

              get-name -> String (none)
              The get-namemethod returns the pattern name.

              set-regex -> none (String|Regex)
              The  set-regexmethod  sets  the  pattern regex either with a string or with a regex
              object. If the method is successfully completed, the pattern type  is  switched  to
              the REGEX type.

              set-escape -> none (Character)
              The  set-escapemethod  sets  the  pattern escape character. The escape character is
              used only in balanced mode.

              get-escape -> Character (none)
              The get-escapemethod returns the escape character.

              set-balanced -> none (String| String String)
              The set-balancedmethod sets the pattern balanced string.  With  one  argument,  the
              same balanced string is used for starting and ending. With two arguments, the first
              argument is the starting string and the second is the ending string.

       Lexeme
       The Lexemeclass is a literal object that is designed to hold a matching pattern. A  lexeme
       consists  in string (i.e. the lexeme value), a tag and eventually a source name (i.e. file
       name) and a source index (line number).

       Predicate

              lexeme-p

       Inheritance

              Literal

       Constructors

              Lexeme (none)
              The Lexemeconstructor creates an empty lexeme.

              Lexeme (String)
              The Lexemeconstructor creates a lexeme by value. The string argument is the  lexeme
              value.

       Methods

              set-tag -> none (Integer)
              The  set-tagmethod  sets  the  lexeme  tag.  The  tag  can be further used inside a
              scanner.

              get-tag -> Integer (none)
              The get-tagmethod returns the lexeme tag.

              set-value -> none (String)
              The set-valuemethod sets the lexeme value. The lexeme value is generally the result
              of a matching operation.

              get-value -> String (none)
              The get-valuemethod returns the lexeme value.

              set-index -> none (Integer)
              The  set-indexmethod  sets  the lexeme source index. The lexeme source index can be
              for instance the source line number.

              get-index -> Integer (none)
              The get-indexmethod returns the lexeme source index.

              set-source -> none (String)
              The set-sourcemethod sets the lexeme source name. The lexeme source name can be for
              instance the source file name.

              get-source -> String (none)
              The get-sourcemethod returns the lexeme source name.

       Scanner
       The Scannerclass is a text scanner or lexical analyzerthat operates on an input stream and
       permits to match one or several patterns. The scanner is built by adding patterns  to  the
       scanner  object.  With an input stream, the scanner object attempts to build a buffer that
       match at least one pattern. When such matching occurs, a lexeme is built. When building  a
       lexeme, the pattern tag is used to mark the lexeme.

       Predicate

              scanner-p

       Inheritance

              Object

       Constructors

              Scanner (none)
              The Scannerconstructor creates an empty scanner.

       Methods

              add -> none (Pattern*)
              The  addmethod  adds  0 or more pattern objects to the scanner. The priority of the
              pattern is determined by the order in which the patterns are added.

              length -> Integer (none)
              The lengthmethod returns the number of pattern objects in this scanner.

              get -> Pattern (Integer)
              The getmethod returns a pattern object by index.

              check -> Lexeme (String)
              The checkmethod checks that a string is matched by  the  scanner  and  returns  the
              associated lexeme.

              scan -> Lexeme (InputStream)
              The  scanmethod  scans  an input stream until a pattern is matched. When a matching
              occurs, the associated lexeme is returned.

       Literate
       The Literateclass is transliteration mapping class.  Transliteration  is  the  process  of
       changing  characters  my  mapping one to another one. The transliteration process operates
       with a character source and produces a target character with the help of a mapping  table.
       This  transliteration  object can also operate with an escape table. In the presence of an
       escape character, an escape mapping table is used instead of the regular one.

       Predicate

              literate-p

       Inheritance

              Object

       Constructors

              Literate (none)
              The Literateconstructor creates a default transliteration object.

              Literate (Character)
              The Literateconstructor creates a default transliteration  object  with  an  escape
              character. The argument is the escape character.

       Methods

              read -> Character (InputStream)
              The  readmethod  reads  a character from the input stream and translate it with the
              help of the mapping table. A second character might be consumed from the stream  if
              the first character is an escape character.

              getu -> Character (InputStream)
              The  getumethod  reads  a  Unicode character from the input stream and translate it
              with the help of the mapping table. A second character might be consumed  from  the
              stream if the first character is an escape character.

              reset -> none (none)
              The resetmethod resets all the mapping table and install a default identity one.

              set-map -> none (Character Character)
              The set-mapmethod set the mapping table by using a source and target character. The
              first character is the  source  character.  The  second  character  is  the  target
              character.

              get-map -> Character (Character)
              The  get-mapmethod returns the mapping character by character. The source character
              is the argument.

              translate -> String (String)
              The translatemethod translate a string by transliteration and returns a new string.

              set-escape -> none (Character)
              The set-escapemethod set the escape character.

              get-escape -> Character (none)
              The get-escapemethod returns the escape character.

              set-escape-map -> none (Character Character)
              The set-escape-mapmethod set the escape mapping table by using a source and  target
              character. The first character is the source character. The second character is the
              target character.

              get-escape-map -> Character (Character)
              The get-escape-mapmethod returns the escape mapping  character  by  character.  The
              source character is the argument.

       Functions

              sort-ascent -> none (Vector)
              The sort-ascentfunction sorts in ascending order the vector argument. The vector is
              sorted in place.

              sort-descent -> none (Vector)
              The sort-descentfunction sorts in descending order the vector argument. The  vector
              is sorted in place.

              sort-lexical -> none (Vector)
              The  sort-lexicalfunction  sorts  in  lexicographic  order the vector argument. The
              vector is sorted in place.