Provided by: afnix_3.5.0-3_amd64 bug

NAME

       txt - standard text processing module

STANDARD TEXT PROCESSING MODULE

       The  Standard Text Processing module is an original implementation of an object collection
       dedicated to text processing. Although text scaning is the current operation  perfomed  in
       the  field  of  text  processing, the module provides also specialized object to store and
       index text data. Text sorting and transliteration is also part of this module.

       Scanning concepts
       Text scanning is the ability to extract lexical elements  or  lexemes  from  a  stream.  A
       scanner  or  lexical analyzer is the principal object used to perform this task. A scanner
       is created by adding special object that acts as a pattern  matcher.  When  a  pattern  is
       matched, a special object called a lexeme is returned.

       Pattern object
       A Pattern object is a special object that acts as model for the string to match. There are
       several ways to build a  pattern.  The  simplest  way  to  build  it  is  with  a  regular
       expression.  Another  type  of pattern is a balanced pattern. In its first form, a pattern
       object can be created with a regular expression object.

       # create a pattern object
       const pat (afnix:txt:Pattern "$d+")

       In this example, the pattern object is built to detect integer objects.

       pat:check "123" # true
       pat:match "123" # 123

       The check method return true if the input string matches the  pattern.  The  match  method
       returns  the  string  that matches the pattern. Since the pattern object can also operates
       with stream object, the match method is appropriate to  match  a  particular  string.  The
       pattern object is, as usual, available with the appropriate predicate.

       afnix:txt:pattern-p pat # true

       Another  form  of pattern object is the balanced pattern. A balanced pattern is determined
       by a starting string and an ending string. There are two types of balanced pattern. One is
       a  single balanced pattern and the other one is the recursive balanced pattern. The single
       balanced pattern is appropriate for those lexical element that are defined by a character.
       For  example,  the  classical  C-string is a single balanced pattern with the double quote
       character.

       # create a balanced pattern
       const pat (afnix:txt:Pattern "ELEMENT" "<" ">")
       pat:check "<xml>" # true
       pat:match "<xml>" # xml

       In the case of the C-string, the pattern might  be  more  appropriately  defined  with  an
       additional  escape  character.  Such  character  is  used  by  the pattern matcher to grab
       characters that might be part of the pattern definition.

       # create a balanced pattern
       const pat (afnix:txt:Pattern "STRING" "'" '\')
       pat:check "'hello'" # true
       pat:match "'hello'" # "hello"

       In this form, a balanced pattern with an escape character is created. The same  string  is
       used  for  both the starting and ending string. Another constructor that takes two strings
       can be used if the starting and ending strings are different. The last pattern form is the
       balanced  recursive  form.  In this form, a starting and ending string are used to delimit
       the pattern. However, in this mode, a recursive use of the starting and ending strings  is
       allowed.  In  order  to  have an exact match, the number of starting string must equal the
       number of ending string. For example, the C-comment pattern can  be  viewed  as  recursive
       balanced pattern.

       # create a c-comment pattern
       const pat (afnix:txt:Pattern "STRING" "/*" "*/" )

       Lexeme object
       The  Lexeme  object  is  the object built by a scanner that contains the matched string. A
       lexeme is  therefore  a  tagged  string.  Additionally,  a  lexeme  can  carry  additional
       information like a source name and index.

       # create an empty lexeme
       const lexm (afnix:txt:Lexeme)
       afnix:txt:lexeme-p lexm # true

       The default lexeme is created with any value. A value can be set with the set-value method
       and retrieved with the get-value methods.

       lexm:set-value "hello"
       lexm:get-value # hello

       Similar are the set-tag and get-tag methods which operate with an integer. The source name
       and index are defined as well with the same methods.

       # check for the source
       lexm:set-source "world"
       lexm:get-source # world
       # check for the source index
       lexm:set-index 2000
       lexm:get-index # 2000

       Text scanning
       Text  scanning is the ability to extract lexical elements or lexemes from an input stream.
       Generally, the lexemes are the results of a matching  operation  which  is  defined  by  a
       pattern  object. As a result, the definition of a scanner object is the object itself plus
       one or several pattern object.

       Scanner construction
       By default, a scanner is created without pattern objects. The length  method  returns  the
       number of pattern objects. As usual, a predicate is associated with the scanner object.

       # the default scanner
       const  scan (afnix:txt:Scanner)
       afnix:txt:scanner-p scan # true
       # the length method
       scan:length # 0

       The  scanner  construction proceeds by adding pattern objects. Each pattern can be created
       independently, and later added to the scanner. For example, a  scanner  that  reads  real,
       integer and string can be defined as follow:

       # create the scanner pattern
       const REAL    (
         afnix:txt:Pattern "REAL"    [$d+.$d*])
       const STRING  (
         afnix:txt:Pattern "STRING"  """ '\')
       const INTEGER (
         afnix:txt:Pattern "INTEGER" [$d+|"0x"$x+])
       # add the pattern to the scanner
       scanner:add INTEGER REAL STRING

       The  order of pattern integration defines the priority at which a token is recognized. The
       symbol name for each pattern is optional since  the  functional  programming  permits  the
       creation  of  patterns directly. This writing style makes the scanner definition easier to
       read.

       Using the scanner
       Once constructed, the scanner can be used as is. A stream is generally  the  best  way  to
       operate.  If  the  scanner reaches the end-of-stream or cannot recognize a lexeme, the nil
       object is returned. With a loop, it is easy to get all lexemes.

       while (trans valid (is:valid-p)) {
         # try to get the lexeme
         trans lexm (scanner:scan is)
         # check for nil lexeme and print the value
         if (not (nil-p lexm)) (println (lexm:get-value))
         # update the valid flag
         valid:= (and (is:valid-p) (not (nil-p lexm)))
       }

       In this loop, it is necessary first to check for the end of the stream. This is done  with
       the  help  of  the special loop construct that initialize the valid symbol. As soon as the
       the lexeme is built, it can be used. The lexeme holds the value as well as it tag.

       Text sorting
       Sorting is one the primary function implemented inside the text processing  module.  There
       are three sorting functions available in the module.

       Ascending and descending order sorting
       The sort-ascent function operates with a vector object and sorts the elements in ascending
       order. Any kind of objects can be sorted as long as they support a comparison method.  The
       elements are sorted in placed by using a quick sort algorithm.

       # create an unsorted vector
       const v-i (Vector 7 5 3 4 1 8 0 9 2 6)
       # sort the vector in place
       afnix:txt:sort-ascent v-i
       # print the vector
       for (e) (v) (println e)

       The  sort-descent  function  is similar to the sort-ascent function except that the object
       are sorted in descending order.

       Lexical sorting
       The sort-lexical function operates  with  a  vector  object  and  sorts  the  elements  in
       ascending  order  using  a  lexicographic ordering relation. Objects in the vector must be
       literal objects or an exception is raised.

       Transliteration
       Transliteration is the process of changing characters my mapping one to another  one.  The
       transliteration  process  operates with a character source and produces a target character
       with the help  of  a  mapping  table.  The  transliteration  process  is  not  necessarily
       reversible as often indicated in the literature.

       Literate object
       The Literate object is a transliteration object that is bound by default with the identity
       function mapping. As usual, a predicate is associate with the object.

       # create a transliterate object
       const tl (afnix:txt:Literate)
       # check the object
       afnix:txt:literate-p tl # true

       The transliteration process can also operate with an escape  character  in  order  to  map
       double character sequence into a single one, as usually found inside programming language.

       # create a transliterate object by escape
       const tl (afnix:txt:Literate '\')

       Transliteration configuration
       The  set-map  configures  the  transliteration  mapping  table  while  the  set-escape-map
       configure the escape mapping table. The mapping is done by setting  the  source  character
       and  the  target character. For instance, if one want to map the tabulation character to a
       white space, the mapping table is set as follow:

       tl:set-map '' ' '

       The escape mapping table operates the same way.  It  should  be  noted  that  the  mapping
       algorithm  translate first the input character, eventually yielding to an escape character
       and then the escape mapping takes place. Note also that the set-escape method can be  used
       to set the escape character.

       tl:set-map '' ' '

       Transliteration process
       The  transliteration process is done either with a string or an input stream. In the first
       case, the translate method operates with a string and returns a translated string. On  the
       other hand, the read method returns a character when operating with a stream.

       # set the mapping characters
       tl:set-map 'w'
       tl:set-map '\' 'o'
       tl:set-map 'r'
       tl:set-map 'd'
       # translate a string
       tl:translate "helo" # word

STANDARD TEXT PROCESSING REFERENCE

       Pattern
       The  Pattern  class  is  a  pattern  matching  class based either on regular expression or
       balanced string. In the regex mode, the pattern is defined with a regex and a matching  is
       said  to occur when a regex match is achieved. In the balanced string mode, the pattern is
       defined with a start pattern and end pattern strings. The balanced mode can be a single or
       recursive.  Additionally, an escape character can be associated with the class. A name and
       a tag is also bound to the pattern object as a mean  to  ease  the  integration  within  a
       scanner.

       Predicate

              pattern-p

       Inheritance

              Object

       Constructors

              Pattern (none)
              The Pattern constructor creates an empty pattern.

              Pattern (String|Regex)
              The  Pattern  constructor  creates  a  pattern  object  associated  with  a regular
              expression. The argument can be either a string or a regular expression object.  If
              the argument is a string, it is converted into a regular expression object.

              Pattern (String String)
              The Pattern constructor creates a balanced pattern. The first argument is the start
              pattern string. The second argument is the end balanced string.

              Pattern (String String Character)
              The Pattern constructor creates a balanced pattern with an  escape  character.  The
              first argument is the start pattern string. The second argument is the end balanced
              string. The third character is the escape character.

              Pattern (String String Boolean)
              The Pattern constructor creates a recursive balanced pattern. The first argument is
              the start pattern string. The second argument is the end balanced string.

       Constants

              REGEX
              The REGEX constant indicates that the pattern is a regular expression.

              BALANCED
              The BALANCED constant indicates that the pattern is a balanced pattern.

              RECURSIVE
              The RECURSIVE constant indicates that the pattern is a recursive balanced pattern.

       Methods

              check -> Boolean (String)
              The  check  method checks the pattern against the input string. If the verification
              is successful, the method returns true, false otherwise.

              match -> String (String|InputStream)
              The match method attempts to match an input string  or  an  input  stream.  If  the
              matching occurs, the matching string is returned. If the input is a string, the end
              of string is used as an end condition. If the input stream  is  used,  the  end  of
              stream is used as an end condition.

              set-tag -> none (Integer)
              The  set-tag  method  sets  the  pattern  tag. The tag can be further used inside a
              scanner.

              get-tag -> Integer (none)
              The get-tag method returns the pattern tag.

              set-name -> none (String)
              The set-name method sets the pattern name. The name is symbol identifier  for  that
              pattern.

              get-name -> String (none)
              The get-name method returns the pattern name.

              set-regex -> none (String|Regex)
              The  set-regex  method  sets the pattern regex either with a string or with a regex
              object. If the method is successfully completed, the pattern type  is  switched  to
              the REGEX type.

              set-escape -> none (Character)
              The  set-escape  method  sets the pattern escape character. The escape character is
              used only in balanced mode.

              get-escape -> Character (none)
              The get-escape method returns the escape character.

              set-balanced -> none (String| String String)
              The set-balanced method sets the pattern balanced string. With  one  argument,  the
              same balanced string is used for starting and ending. With two arguments, the first
              argument is the starting string and the second is the ending string.

       Lexeme
       The Lexeme class is a literal object that is designed to hold a matching pattern. A lexeme
       consists  in string (i.e. the lexeme value), a tag and eventually a source name (i.e. file
       name) and a source index (line number).

       Predicate

              lexeme-p

       Inheritance

              Literal

       Constructors

              Lexeme (none)
              The Lexeme constructor creates an empty lexeme.

              Lexeme (String)
              The Lexeme constructor creates a lexeme by value. The string argument is the lexeme
              value.

       Methods

              set-tag -> none (Integer)
              The  set-tag  method  sets  the  lexeme  tag.  The tag can be further used inside a
              scanner.

              get-tag -> Integer (none)
              The get-tag method returns the lexeme tag.

              set-value -> none (String)
              The set-value method sets the lexeme value.  The  lexeme  value  is  generally  the
              result of a matching operation.

              get-value -> String (none)
              The get-value method returns the lexeme value.

              set-index -> none (Integer)
              The  set-index  method sets the lexeme source index. The lexeme source index can be
              for instance the source line number.

              get-index -> Integer (none)
              The get-index method returns the lexeme source index.

              set-source -> none (String)
              The set-source method sets the lexeme source name. The lexeme source  name  can  be
              for instance the source file name.

              get-source -> String (none)
              The get-source method returns the lexeme source name.

       Scanner
       The  Scanner  class is a text scanner or lexical analyzer that operates on an input stream
       and permits to match one or several patterns. The scanner is built by adding  patterns  to
       the  scanner  object.  With an input stream, the scanner object attempts to build a buffer
       that match at least one pattern. When such  matching  occurs,  a  lexeme  is  built.  When
       building a lexeme, the pattern tag is used to mark the lexeme.

       Predicate

              scanner-p

       Inheritance

              Object

       Constructors

              Scanner (none)
              The Scanner constructor creates an empty scanner.

       Methods

              add -> none (Pattern*)
              The  add  method adds 0 or more pattern objects to the scanner. The priority of the
              pattern is determined by the order in which the patterns are added.

              length -> Integer (none)
              The length method returns the number of pattern objects in this scanner.

              get -> Pattern (Integer)
              The get method returns a pattern object by index.

              check -> Lexeme (String)
              The check method checks that a string is matched by the  scanner  and  returns  the
              associated lexeme.

              scan -> Lexeme (InputStream)
              The  scan  method scans an input stream until a pattern is matched. When a matching
              occurs, the associated lexeme is returned.

       Literate
       The Literate class is transliteration mapping class. Transliteration  is  the  process  of
       changing  characters  my  mapping one to another one. The transliteration process operates
       with a character source and produces a target character with the help of a mapping  table.
       This  transliteration  object can also operate with an escape table. In the presence of an
       escape character, an escape mapping table is used instead of the regular one.

       Predicate

              literate-p

       Inheritance

              Object

       Constructors

              Literate (none)
              The Literate constructor creates a default transliteration object.

              Literate (Character)
              The Literate constructor creates a default transliteration object  with  an  escape
              character. The argument is the escape character.

       Methods

              read -> Character (InputStream)
              The  read  method reads a character from the input stream and translate it with the
              help of the mapping table. A second character might be consumed from the stream  if
              the first character is an escape character.

              getu -> Character (InputStream)
              The  getu  method  reads a Unicode character from the input stream and translate it
              with the help of the mapping table. A second character might be consumed  from  the
              stream if the first character is an escape character.

              reset -> none (none)
              The reset method resets all the mapping table and install a default identity one.

              set-map -> none (Character Character)
              The  set-map  method  set the mapping table by using a source and target character.
              The first character is the source character. The second  character  is  the  target
              character.

              get-map -> Character (Character)
              The get-map method returns the mapping character by character. The source character
              is the argument.

              translate -> String (String)
              The translate method translate a  string  by  transliteration  and  returns  a  new
              string.

              set-escape -> none (Character)
              The set-escape method set the escape character.

              get-escape -> Character (none)
              The get-escape method returns the escape character.

              set-escape-map -> none (Character Character)
              The set-escape-map method set the escape mapping table by using a source and target
              character. The first character is the source character. The second character is the
              target character.

              get-escape-map -> Character (Character)
              The  get-escape-map  method  returns the escape mapping character by character. The
              source character is the argument.

       Functions

              sort-ascent -> none (Vector)
              The sort-ascent function sorts in ascending order the vector argument.  The  vector
              is sorted in place.

              sort-descent -> none (Vector)
              The sort-descent function sorts in descending order the vector argument. The vector
              is sorted in place.

              sort-lexical -> none (Vector)
              The sort-lexical function sorts in lexicographic order  the  vector  argument.  The
              vector is sorted in place.