Provided by: libpcre3-dev_8.12-4_amd64 bug

NAME

       PCRE - Perl-compatible regular expressions

INTRODUCTION


       The  PCRE library is a set of functions that implement regular expression pattern matching
       using the same syntax and semantics as Perl, with just a few  differences.  Some  features
       that appeared in Python and PCRE before they appeared in Perl are also available using the
       Python syntax, there is some support for one or two .NET and Oniguruma syntax  items,  and
       there  is  an  option  for  requesting  some  minor  changes  that  give better JavaScript
       compatibility.

       The current implementation of PCRE corresponds approximately  with  Perl  5.12,  including
       support  for UTF-8 encoded strings and Unicode general category properties. However, UTF-8
       and Unicode support has to be explicitly enabled; it  is  not  the  default.  The  Unicode
       tables correspond to Unicode release 5.2.0.

       In  addition  to  the  Perl-compatible  matching  function,  PCRE  contains an alternative
       function that  matches  the  same  compiled  patterns  in  a  different  way.  In  certain
       circumstances,  the alternative function has some advantages.  For a discussion of the two
       matching algorithms, see the pcrematching page.

       PCRE is written in C and released as a C library. A number of people have written wrappers
       and interfaces of various kinds. In particular, Google Inc.  have provided a comprehensive
       C++ wrapper. This is now included as part of the PCRE distribution. The pcrecpp  page  has
       details  of  this  interface.  Other  people's  contributions  can be found in the Contrib
       directory at the primary FTP site, which is:

       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

       Details of exactly which Perl regular expression features are and  are  not  supported  by
       PCRE are given in separate documents. See the pcrepattern and pcrecompat pages. There is a
       syntax summary in the pcresyntax page.

       Some features of PCRE can be included, excluded, or changed when the library is built. The
       pcre_config()  function  makes  it  possible  for  a client to discover which features are
       available. The features themselves are described  in  the  pcrebuild  page.  Documentation
       about building PCRE for various operating systems can be found in the README and NON-UNIX-
       USE files in the source distribution.

       The library contains a number of undocumented internal functions and data tables that  are
       used  by  more than one of the exported external functions, but which are not intended for
       use by external callers. Their names all begin with "_pcre_",  which  hopefully  will  not
       provoke  any  name clashes. In some environments, it is possible to control which external
       symbols are exported when a shared library is built, and in these cases  the  undocumented
       symbols are not exported.

USER DOCUMENTATION


       The  user  documentation  for  PCRE comprises a number of different sections. In the "man"
       format, each of these is a separate "man page". In the HTML format,  each  is  a  separate
       page,  linked  from the index page. In the plain text format, all the sections, except the
       pcredemo section, are concatenated, for ease of searching. The sections are as follows:

         pcre              this document
         pcre-config       show PCRE installation configuration information
         pcreapi           details of PCRE's native C API
         pcrebuild         options for building PCRE
         pcrecallout       details of the callout feature
         pcrecompat        discussion of Perl compatibility
         pcrecpp           details of the C++ wrapper
         pcredemo          a demonstration C program that uses PCRE
         pcregrep          description of the pcregrep command
         pcrematching      discussion of the two matching algorithms
         pcrepartial       details of the partial matching facility
         pcrepattern       syntax and semantics of supported
                             regular expressions
         pcreperform       discussion of performance issues
         pcreposix         the POSIX-compatible C API
         pcreprecompile    details of saving and re-using precompiled patterns
         pcresample        discussion of the pcredemo program
         pcrestack         discussion of stack usage
         pcresyntax        quick syntax reference
         pcretest          description of the pcretest testing command

       In addition, in the "man" and HTML formats, there is a  short  page  for  each  C  library
       function, listing its arguments and results.

LIMITATIONS


       There  are  some size limitations in PCRE but it is hoped that they will never in practice
       be relevant.

       The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE is compiled with the
       default  internal  linkage  size of 2. If you want to process regular expressions that are
       truly enormous, you can compile PCRE with an internal linkage size of  3  or  4  (see  the
       README  file  in  the source distribution and the pcrebuild documentation for details). In
       these cases the limit is substantially larger.  However, the speed of execution is slower.

       All values in repeating quantifiers must be less than 65536.

       There is no limit to the number of parenthesized subpatterns, but there  can  be  no  more
       than 65535 capturing subpatterns.

       The maximum length of name for a named subpattern is 32 characters, and the maximum number
       of named subpatterns is 10000.

       The maximum length of a subject string is the largest  positive  number  that  an  integer
       variable  can  hold.  However,  when  using  the  traditional matching function, PCRE uses
       recursion to handle subpatterns and indefinite repetition.  This means that the  available
       stack  space  may  limit  the  size  of  a subject string that can be processed by certain
       patterns. For a discussion of stack issues, see the pcrestack documentation.

UTF-8 AND UNICODE PROPERTY SUPPORT


       From release 3.3, PCRE has had some support for character strings  encoded  in  the  UTF-8
       format.  For  release 4.0 this was greatly extended to cover most common requirements, and
       in release 5.0 additional support for Unicode general category properties was added.

       In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the  code,
       and,  in  addition,  you  must  call pcre_compile() with the PCRE_UTF8 option flag, or the
       pattern must start with the sequence (*UTF8). When either of these is the case,  both  the
       pattern  and  any subject strings that are matched against it are treated as UTF-8 strings
       instead of strings of 1-byte characters.

       If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be
       a  bit  bigger,  but  the additional run time overhead is limited to testing the PCRE_UTF8
       flag occasionally, so should not be very big.

       If PCRE is built with Unicode character property support (which  implies  UTF-8  support),
       the  escape sequences \p{..}, \P{..}, and \X are supported.  The available properties that
       can be tested are limited to the general category properties such as Lu for an upper  case
       letter or Nd for a decimal number, the Unicode script names such as Arabic or Han, and the
       derived properties Any and L&. A full list is given in the pcrepattern documentation. Only
       the  short  names  for  properties are supported. For example, \p{L} matches a letter. Its
       Perl synonym, \p{Letter}, is not supported.  Furthermore, in  Perl,  many  properties  may
       optionally  be  prefixed  by  "Is", for compatibility with Perl 5.6. PCRE does not support
       this.

   Validity of UTF-8 strings

       When you set the PCRE_UTF8 flag, the strings passed  as  patterns  and  subjects  are  (by
       default)  checked  for  validity  on  entry to the relevant functions. From release 7.3 of
       PCRE, the check is according the rules of RFC 3629, which are themselves derived from  the
       Unicode  specification.  Earlier  releases  of  PCRE followed the rules of RFC 2279, which
       allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current  check  allows  only
       values in the range U+0 to U+10FFFF, excluding U+D800 to U+DFFF.

       The  excluded  code  points  are the "Low Surrogate Area" of Unicode, of which the Unicode
       Standard says this: "The Low Surrogate Area does not contain  any  character  assignments,
       consequently  no character code charts or namelists are provided for this area. Surrogates
       are reserved for use with UTF-16 and then must be used in pairs." The code points that are
       encoded  by  UTF-16  pairs are available as independent code points in the UTF-8 encoding.
       (In other words, the whole surrogate thing is  a  fudge  for  UTF-16  which  unfortunately
       messes up UTF-8.)

       If  an  invalid  UTF-8  string  is passed to PCRE, an error return (PCRE_ERROR_BADUTF8) is
       given. In some situations, you may already know that your strings are valid, and therefore
       want   to   skip   these   checks  in  order  to  improve  performance.  If  you  set  the
       PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes that the  pattern  or
       subject  it is given (respectively) contains only valid UTF-8 codes. In this case, it does
       not diagnose an invalid UTF-8 string.

       If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what  happens  depends
       on why the string is invalid. If the string conforms to the "old" definition of UTF-8 (RFC
       2279), it is processed as a string of characters in the range 0 to  0x7FFFFFFF.  In  other
       words,  apart  from  the  initial validity test, PCRE (when in UTF-8 mode) handles strings
       according to the more liberal rules of RFC 2279. However, if  the  string  does  not  even
       conform to RFC 2279, the result is undefined. Your program may crash.

       If  you  want to process strings of values in the full range 0 to 0x7FFFFFFF, encoded in a
       UTF-8-like manner as per the old RFC, you can set PCRE_NO_UTF8_CHECK to  bypass  the  more
       restrictive  test.  However,  in  this situation, you will have to apply your own validity
       check.

   General comments about UTF-8 mode

       1. An unbraced hexadecimal escape  sequence  (such  as  \xb3)  matches  a  two-byte  UTF-8
       character if the value is greater than 127.

       2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 characters for values
       greater than \177.

       3. Repeat quantifiers apply to complete UTF-8 characters, not  to  individual  bytes,  for
       example: \x{100}{3}.

       4. The dot metacharacter matches one UTF-8 character instead of a single byte.

       5.  The  escape  sequence \C can be used to match a single byte in UTF-8 mode, but its use
       can lead to some strange effects. This  facility  is  not  available  in  the  alternative
       matching function, pcre_dfa_exec().

       6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test characters of
       any code value, but, by default, the characters that PCRE recognizes as digits, spaces, or
       word characters remain the same set as before, all with values less than 256. This remains
       true even when PCRE is built to include Unicode property support, because to do  otherwise
       would  slow down PCRE in many common cases. Note in particular that this applies to \b and
       \B, because they are defined in terms of \w and \W. If you really want to test for a wider
       sense  of,  say,  "digit",  you  can  use  explicit Unicode property tests such as \p{Nd}.
       Alternatively, if you set the PCRE_UCP option, the way that the character escapes work  is
       changed so that Unicode properties are used to determine which characters match. There are
       more details in the section on generic character types in the pcrepattern documentation.

       7. Similarly, characters that match the POSIX named character classes are  all  low-valued
       characters, unless the PCRE_UCP option is set.

       8.  However,  the horizontal and vertical whitespace matching escapes (\h, \H, \v, and \V)
       do match all the appropriate Unicode characters, whether or not PCRE_UCP is set.

       9. Case-insensitive matching applies only to characters whose values are  less  than  128,
       unless  PCRE is built with Unicode property support. Even when Unicode property support is
       available, PCRE still uses its own character tables when checking the case  of  low-valued
       characters,  so  as  not to degrade performance.  The Unicode property information is used
       only for characters  with  higher  values.  Furthermore,  PCRE  supports  case-insensitive
       matching  only  when  there  is a one-to-one mapping between a letter's cases. There are a
       small number of many-to-one mappings in Unicode; these are not supported by PCRE.

AUTHOR


       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

       Putting an actual email address here seems to have been a spam magnet, so  I've  taken  it
       away.  If you want to email me, use my two initials, followed by the two digits 10, at the
       domain cam.ac.uk.

REVISION


       Last updated: 13 November 2010
       Copyright (c) 1997-2010 University of Cambridge.

                                                                                          PCRE(3)