Provided by: afnix_2.9.2-2build1_amd64 bug

NAME

       txt - standard text processing module

STANDARD TEXT PROCESSING MODULE

       The  Standard  Text  Processingmodule  is an original implementation of an object collection dedicated to
       text processing. Although text scaning is the current operation perfomed in the field of text processing,
       the  module  provides  also  specialized  object  to  store  and  index  text  data.  Text  sorting   and
       transliteration is also part of this module.

       Scanning concepts
       Text  scanning  is  the ability to extract lexical elements or lexemesfrom a stream. A scanner or lexical
       analyzer is the principal object used to perform this task. A scanner is created by adding special object
       that acts as a pattern matcher. When a pattern is matched, a special object called a lexemeis returned.

       Pattern object
       A Patternobject is a special object that acts as model for the string to match. There are several ways to
       build a pattern. The simplest way to build it is with a regular expression. Another type of pattern is  a
       balanced pattern. In its first form, a pattern object can be created with a regular expression object.

       # create a pattern object
       const pat (afnix:txt:Pattern "$d+")

       In this example, the pattern object is built to detect integer objects.

       pat:check "123" # true
       pat:match "123" # 123

       The  checkmethod  return true if the input string matches the pattern. The matchmethod returns the string
       that matches the pattern. Since the pattern object can also operates with stream object, the  matchmethod
       is  appropriate  to  match  a  particular  string.  The  pattern  object is, as usual, available with the
       appropriate predicate.

       afnix:txt:pattern-p pat # true

       Another form of pattern object is the balanced pattern. A balanced pattern is determined  by  a  starting
       string  and  an  ending string. There are two types of balanced pattern. One is a single balanced pattern
       and the other one is the recursive balanced pattern. The single balanced pattern is appropriate for those
       lexical element that are defined by a character. For example, the classical C-string is a single balanced
       pattern with the double quote character.

       # create a balanced pattern
       const pat (afnix:txt:Pattern "ELEMENT" "<" ">")
       pat:check "<xml>" # true
       pat:match "<xml>" # xml

       In the case of the C-string, the pattern might be more appropriately defined with  an  additional  escape
       character.  Such  character  is  used by the pattern matcher to grab characters that might be part of the
       pattern definition.

       # create a balanced pattern
       const pat (afnix:txt:Pattern "STRING" "'" '\')
       pat:check "'hello'" # true
       pat:match "'hello'" # "hello"

       In this form, a balanced pattern with an escape character is created. The same string is  used  for  both
       the  starting  and  ending string. Another constructor that takes two strings can be used if the starting
       and ending strings are different. The last pattern form is the balanced recursive form. In this  form,  a
       starting and ending string are used to delimit the pattern. However, in this mode, a recursive use of the
       starting  and  ending  strings is allowed. In order to have an exact match, the number of starting string
       must equal the number of ending string. For example, the C-comment pattern can  be  viewed  as  recursive
       balanced pattern.

       # create a c-comment pattern
       const pat (afnix:txt:Pattern "STRING" "/*" "*/" )

       Lexeme object
       The Lexemeobject is the object built by a scanner that contains the matched string. A lexeme is therefore
       a tagged string. Additionally, a lexeme can carry additional information like a source name and index.

       # create an empty lexeme
       const lexm (afnix:txt:Lexeme)
       afnix:txt:lexeme-p lexm # true

       The  default  lexeme is created with any value. A value can be set with the set-valuemethod and retrieved
       with the get-valuemethods.

       lexm:set-value "hello"
       lexm:get-value # hello

       Similar are the set-tagand get-tagmethods which operate with an integer. The source name  and  index  are
       defined as well with the same methods.

       # check for the source
       lexm:set-source "world"
       lexm:get-source # world
       # check for the source index
       lexm:set-index 2000
       lexm:get-index # 2000

       Text scanning
       Text  scanning is the ability to extract lexical elements or lexemes from an input stream. Generally, the
       lexemes are the results of a matching operation which is defined by a pattern object. As  a  result,  the
       definition of a scanner object is the object itself plus one or several pattern object.

       Scanner construction
       By  default, a scanner is created without pattern objects. The lengthmethod returns the number of pattern
       objects. As usual, a predicate is associated with the scanner object.

       # the default scanner
       const  scan (afnix:txt:Scanner)
       afnix:txt:scanner-p scan # true
       # the length method
       scan:length # 0

       The scanner construction proceeds by adding pattern objects. Each pattern can be  created  independently,
       and later added to the scanner. For example, a scanner that reads real, integer and string can be defined
       as follow:

       # create the scanner pattern
       const REAL    (
         afnix:txt:Pattern "REAL"    [$d+.$d*])
       const STRING  (
         afnix:txt:Pattern "STRING"  """ '\')
       const INTEGER (
         afnix:txt:Pattern "INTEGER" [$d+|"0x"$x+])
       # add the pattern to the scanner
       scanner:add INTEGER REAL STRING

       The order of pattern integration defines the priority at which a token is recognized. The symbol name for
       each pattern is optional since the functional programming permits the creation of patterns directly. This
       writing style makes the scanner definition easier to read.

       Using the scanner
       Once  constructed,  the  scanner can be used as is. A stream is generally the best way to operate. If the
       scanner reaches the end-of-stream or cannot recognize a lexeme, the nil object is returned. With a  loop,
       it is easy to get all lexemes.

       while (trans valid (is:valid-p)) {
         # try to get the lexeme
         trans lexm (scanner:scan is)
         # check for nil lexeme and print the value
         if (not (nil-p lexm)) (println (lexm:get-value))
         # update the valid flag
         valid:= (and (is:valid-p) (not (nil-p lexm)))
       }

       In this loop, it is necessary first to check for the end of the stream. This is done with the help of the
       special  loop  construct  that  initialize the validsymbol. As soon as the the lexeme is built, it can be
       used. The lexeme holds the value as well as it tag.

       Text sorting
       Sorting is one the primary function implemented inside the text processingmodule. There are three sorting
       functions available in the module.

       Ascending and descending order sorting
       The sort-ascentfunction operates with a vector object and sorts the elements in ascending order. Any kind
       of objects can be sorted as long as they support a comparison method. The elements are sorted  in  placed
       by using a quick sortalgorithm.

       # create an unsorted vector
       const v-i (Vector 7 5 3 4 1 8 0 9 2 6)
       # sort the vector in place
       afnix:txt:sort-ascent v-i
       # print the vector
       for (e) (v) (println e)

       The  sort-descentfunction  is  similar  to  the  sort-ascentfunction except that the object are sorted in
       descending order.

       Lexical sorting
       The sort-lexicalfunction operates with a vector object and sorts the elements in ascending order using  a
       lexicographic ordering relation. Objects in the vector must be literal objects or an exception is raised.

       Transliteration
       Transliteration  is the process of changing characters my mapping one to another one. The transliteration
       process operates with a character source and produces a target character  with  the  help  of  a  mapping
       table. The transliteration process is not necessarily reversible as often indicated in the literature.

       Literate object
       The  Literateobject  is  a  transliteration  object  that  is bound by default with the identity function
       mapping. As usual, a predicate is associate with the object.

       # create a transliterate object
       const tl (afnix:txt:Literate)
       # check the object
       afnix:txt:literate-p tl # true

       The transliteration process can also operate with an escape character in order to  map  double  character
       sequence into a single one, as usually found inside programming language.

       # create a transliterate object by escape
       const tl (afnix:txt:Literate '\')

       Transliteration configuration
       The  set-mapconfigures  the  transliteration  mapping  table while the set-escape-mapconfigure the escape
       mapping table. The mapping is done by  setting  the  source  character  and  the  target  character.  For
       instance,  if  one  want  to  map  the tabulation character to a white space, the mapping table is set as
       follow:

       tl:set-map '' ' '

       The escape mapping table operates the same way. It should be noted that the mapping  algorithm  translate
       first  the  input character, eventually yielding to an escape character and then the escape mapping takes
       place. Note also that the set-escapemethod can be used to set the escape character.

       tl:set-map '' ' '

       Transliteration process
       The transliteration process is done either with a string or an input  stream.  In  the  first  case,  the
       translatemethod operates with a string and returns a translated string. On the other hand, the readmethod
       returns a character when operating with a stream.

       # set the mapping characters
       tl:set-map '0
       tl:set-map ''' '
       tl:set-map '
       tl:set-map ''
       # translate a string
       tl:translate "helo" # word

STANDARD TEXT PROCESSING REFERENCE

       Pattern
       The  Patternclass  is  a pattern matching class based either on regular expression or balanced string. In
       the regex mode, the pattern is defined with a regex and a matching is said to occur when a regex match is
       achieved. In the balanced string mode, the pattern is defined  with  a  start  pattern  and  end  pattern
       strings.  The  balanced  mode  can  be  a  single  or recursive. Additionally, an escape character can be
       associated with the class. A name and a tag is also bound to the pattern object as a  mean  to  ease  the
       integration within a scanner.

       Predicate

              pattern-p

       Inheritance

              Object

       Constructors

              Pattern (none)
              The Patternconstructor creates an empty pattern.

              Pattern (String|Regex)
              The Patternconstructor creates a pattern object associated with a regular expression. The argument
              can  be  either  a  string  or  a  regular  expression  object. If the argument is a string, it is
              converted into a regular expression object.

              Pattern (String String)
              The Patternconstructor creates a balanced pattern. The first argument is the start pattern string.
              The second argument is the end balanced string.

              Pattern (String String Character)
              The Patternconstructor creates a balanced pattern with an escape character. The first argument  is
              the  start  pattern string. The second argument is the end balanced string. The third character is
              the escape character.

              Pattern (String String Boolean)
              The Patternconstructor creates a recursive balanced pattern.  The  first  argument  is  the  start
              pattern string. The second argument is the end balanced string.

       Constants

              REGEX
              The REGEXconstant indicates that the pattern is a regular expression.

              BALANCED
              The BALANCEDconstant indicates that the pattern is a balanced pattern.

              RECURSIVE
              The RECURSIVEconstant indicates that the pattern is a recursive balanced pattern.

       Methods

              check -> Boolean (String)
              The  checkmethod  checks  the pattern against the input string. If the verification is successful,
              the method returns true, false otherwise.

              match -> String (String|InputStream)
              The matchmethod attempts to match an input string or an input stream. If the matching occurs,  the
              matching  string  is  returned.  If  the  input  is  a string, the end of string is used as an end
              condition. If the input stream is used, the end of stream is used as an end condition.

              set-tag -> none (Integer)
              The set-tagmethod sets the pattern tag. The tag can be further used inside a scanner.

              get-tag -> Integer (none)
              The get-tagmethod returns the pattern tag.

              set-name -> none (String)
              The set-namemethod sets the pattern name. The name is symbol identifier for that pattern.

              get-name -> String (none)
              The get-namemethod returns the pattern name.

              set-regex -> none (String|Regex)
              The set-regexmethod sets the pattern regex either with a string or with a  regex  object.  If  the
              method is successfully completed, the pattern type is switched to the REGEX type.

              set-escape -> none (Character)
              The  set-escapemethod  sets  the  pattern  escape  character. The escape character is used only in
              balanced mode.

              get-escape -> Character (none)
              The get-escapemethod returns the escape character.

              set-balanced -> none (String| String String)
              The set-balancedmethod sets the pattern balanced string. With  one  argument,  the  same  balanced
              string  is  used  for  starting and ending. With two arguments, the first argument is the starting
              string and the second is the ending string.

       Lexeme
       The Lexemeclass is a literal object that is designed to hold a matching pattern.  A  lexeme  consists  in
       string  (i.e.  the  lexeme value), a tag and eventually a source name (i.e. file name) and a source index
       (line number).

       Predicate

              lexeme-p

       Inheritance

              Literal

       Constructors

              Lexeme (none)
              The Lexemeconstructor creates an empty lexeme.

              Lexeme (String)
              The Lexemeconstructor creates a lexeme by value. The string argument is the lexeme value.

       Methods

              set-tag -> none (Integer)
              The set-tagmethod sets the lexeme tag. The tag can be further used inside a scanner.

              get-tag -> Integer (none)
              The get-tagmethod returns the lexeme tag.

              set-value -> none (String)
              The set-valuemethod sets the lexeme value. The lexeme value is generally the result of a  matching
              operation.

              get-value -> String (none)
              The get-valuemethod returns the lexeme value.

              set-index -> none (Integer)
              The  set-indexmethod sets the lexeme source index. The lexeme source index can be for instance the
              source line number.

              get-index -> Integer (none)
              The get-indexmethod returns the lexeme source index.

              set-source -> none (String)
              The set-sourcemethod sets the lexeme source name. The lexeme source name can be for  instance  the
              source file name.

              get-source -> String (none)
              The get-sourcemethod returns the lexeme source name.

       Scanner
       The  Scannerclass  is  a  text scanner or lexical analyzerthat operates on an input stream and permits to
       match one or several patterns. The scanner is built by adding patterns to the  scanner  object.  With  an
       input  stream,  the  scanner object attempts to build a buffer that match at least one pattern. When such
       matching occurs, a lexeme is built. When building a lexeme, the pattern tag is used to mark the lexeme.

       Predicate

              scanner-p

       Inheritance

              Object

       Constructors

              Scanner (none)
              The Scannerconstructor creates an empty scanner.

       Methods

              add -> none (Pattern*)
              The addmethod adds 0 or more pattern objects to the  scanner.  The  priority  of  the  pattern  is
              determined by the order in which the patterns are added.

              length -> Integer (none)
              The lengthmethod returns the number of pattern objects in this scanner.

              get -> Pattern (Integer)
              The getmethod returns a pattern object by index.

              check -> Lexeme (String)
              The checkmethod checks that a string is matched by the scanner and returns the associated lexeme.

              scan -> Lexeme (InputStream)
              The  scanmethod  scans  an  input  stream  until a pattern is matched. When a matching occurs, the
              associated lexeme is returned.

       Literate
       The Literateclass is transliteration mapping class. Transliteration is the process of changing characters
       my mapping one to another one. The transliteration process operates with a character source and  produces
       a target character with the help of a mapping table. This transliteration object can also operate with an
       escape  table.  In  the  presence  of an escape character, an escape mapping table is used instead of the
       regular one.

       Predicate

              literate-p

       Inheritance

              Object

       Constructors

              Literate (none)
              The Literateconstructor creates a default transliteration object.

              Literate (Character)
              The Literateconstructor creates a default transliteration object with  an  escape  character.  The
              argument is the escape character.

       Methods

              read -> Character (InputStream)
              The  readmethod  reads  a  character  from  the input stream and translate it with the help of the
              mapping table. A second character might be consumed from the stream if the first character  is  an
              escape character.

              getu -> Character (InputStream)
              The  getumethod  reads a Unicode character from the input stream and translate it with the help of
              the mapping table. A second character might be consumed from the stream if the first character  is
              an escape character.

              reset -> none (none)
              The resetmethod resets all the mapping table and install a default identity one.

              set-map -> none (Character Character)
              The  set-mapmethod  set  the  mapping  table  by  using  a  source and target character. The first
              character is the source character. The second character is the target character.

              get-map -> Character (Character)
              The get-mapmethod returns the  mapping  character  by  character.  The  source  character  is  the
              argument.

              translate -> String (String)
              The translatemethod translate a string by transliteration and returns a new string.

              set-escape -> none (Character)
              The set-escapemethod set the escape character.

              get-escape -> Character (none)
              The get-escapemethod returns the escape character.

              set-escape-map -> none (Character Character)
              The  set-escape-mapmethod set the escape mapping table by using a source and target character. The
              first character is the source character. The second character is the target character.

              get-escape-map -> Character (Character)
              The get-escape-mapmethod returns the escape mapping character by character. The  source  character
              is the argument.

       Functions

              sort-ascent -> none (Vector)
              The  sort-ascentfunction  sorts  in  ascending  order the vector argument. The vector is sorted in
              place.

              sort-descent -> none (Vector)
              The sort-descentfunction sorts in descending order the vector argument. The vector  is  sorted  in
              place.

              sort-lexical -> none (Vector)
              The sort-lexicalfunction sorts in lexicographic order the vector argument. The vector is sorted in
              place.

AFNIX                                              2020-03-22                                             txt(3)