Provided by: libpcre3-dev_7.2-1ubuntu2_i386 bug
 

NAME

        PCRE - Perl-compatible regular expressions
 

INTRODUCTION

 
        The  PCRE  library is a set of functions that implement regular expres‐
        sion pattern matching using the same syntax and semantics as Perl, with
        just  a  few differences. (Certain features that appeared in Python and
        PCRE before they appeared in Perl are also available using  the  Python
        syntax.)
 
        The  current  implementation of PCRE (release 7.x) corresponds approxi‐
        mately with Perl 5.10, including support for UTF-8 encoded strings  and
        Unicode general category properties. However, UTF-8 and Unicode support
        has to be explicitly enabled; it is not the default. The Unicode tables
        correspond to Unicode release 5.0.0.
 
        In  addition to the Perl-compatible matching function, PCRE contains an
        alternative matching function that matches the same  compiled  patterns
        in  a different way. In certain circumstances, the alternative function
        has some advantages. For a discussion of the two  matching  algorithms,
        see the pcrematching page.
 
        PCRE  is  written  in C and released as a C library. A number of people
        have written wrappers and interfaces of various kinds.  In  particular,
        Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
        included as part of the PCRE distribution. The pcrecpp page has details
        of  this  interface.  Other  people’s contributions can be found in the
        Contrib directory at the primary FTP site, which is:
 
        ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
 
        Details of exactly which Perl regular expression features are  and  are
        not supported by PCRE are given in separate documents. See the pcrepat     
        tern and pcrecompat pages.
 
        Some features of PCRE can be included, excluded, or  changed  when  the
        library  is  built.  The pcre_config() function makes it possible for a
        client to discover which features are  available.  The  features  them‐
        selves  are described in the pcrebuild page. Documentation about build‐
        ing PCRE for various operating systems can be found in the README  file
        in the source distribution.
 
        The  library  contains  a number of undocumented internal functions and
        data tables that are used by more than one  of  the  exported  external
        functions,  but  which  are  not  intended for use by external callers.
        Their names all begin with "_pcre_", which hopefully will  not  provoke
        any name clashes. In some environments, it is possible to control which
        external symbols are exported when a shared library is  built,  and  in
        these cases the undocumented symbols are not exported.
        The  user  documentation  for PCRE comprises a number of different sec‐
        tions. In the "man" format, each of these is a separate "man page".  In
        the  HTML  format, each is a separate page, linked from the index page.
        In the plain text format, all the sections are concatenated,  for  ease
        of searching. The sections are as follows:
 
          pcre              this document
          pcre-config       show PCRE installation configuration information
          pcreapi           details of PCRE’s native C API
          pcrebuild         options for building PCRE
          pcrecallout       details of the callout feature
          pcrecompat        discussion of Perl compatibility
          pcrecpp           details of the C++ wrapper
          pcregrep          description of the pcregrep command
          pcrematching      discussion of the two matching algorithms
          pcrepartial       details of the partial matching facility
          pcrepattern       syntax and semantics of supported
                              regular expressions
          pcreperform       discussion of performance issues
          pcreposix         the POSIX-compatible C API
          pcreprecompile    details of saving and re-using precompiled patterns
          pcresample        discussion of the sample program
          pcrestack         discussion of stack usage
          pcretest          description of the pcretest testing command
 
        In addition, in the "man" and HTML formats, there is a short  page  for
        each C library function, listing its arguments and results.
 

LIMITATIONS

 
        There  are some size limitations in PCRE but it is hoped that they will
        never in practice be relevant.
 
        The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE
        is compiled with the default internal linkage size of 2. If you want to
        process regular expressions that are truly enormous,  you  can  compile
        PCRE  with  an  internal linkage size of 3 or 4 (see the README file in
        the source distribution and the pcrebuild documentation  for  details).
        In  these  cases the limit is substantially larger.  However, the speed
        of execution is slower.
 
        All values in repeating quantifiers must be less than 65536. The  maxi‐
        mum  compiled  length  of  subpattern  with an explicit repeat count is
        30000 bytes. The maximum number of capturing subpatterns is 65535.
 
        There is no limit to the number of parenthesized subpatterns, but there
        can be no more than 65535 capturing subpatterns.
 
        The maximum length of name for a named subpattern is 32 characters, and
        the maximum number of named subpatterns is 10000.
 
        The maximum length of a subject string is the largest  positive  number
        that  an integer variable can hold. However, when using the traditional
        matching function, PCRE uses recursion to handle subpatterns and indef‐
        inite  repetition.  This means that the available stack space may limit
        the size of a subject string that can be processed by certain patterns.
        For a discussion of stack issues, see the pcrestack documentation.
        From  release  3.3,  PCRE  has  had  some support for character strings
        encoded in the UTF-8 format. For release 4.0 this was greatly  extended
        to  cover  most common requirements, and in release 5.0 additional sup‐
        port for Unicode general category properties was added.
 
        In order process UTF-8 strings, you must build PCRE  to  include  UTF-8
        support  in  the  code,  and, in addition, you must call pcre_compile()
        with the PCRE_UTF8 option flag. When you do this, both the pattern  and
        any  subject  strings  that are matched against it are treated as UTF-8
        strings instead of just strings of bytes.
 
        If you compile PCRE with UTF-8 support, but do not use it at run  time,
        the  library will be a bit bigger, but the additional run time overhead
        is limited to testing the PCRE_UTF8 flag occasionally, so should not be
        very big.
 
        If PCRE is built with Unicode character property support (which implies
        UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup‐
        ported.  The available properties that can be tested are limited to the
        general category properties such as Lu for an upper case letter  or  Nd
        for  a  decimal number, the Unicode script names such as Arabic or Han,
        and the derived properties Any and L&. A full  list  is  given  in  the
        pcrepattern documentation. Only the short names for properties are sup‐
        ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let‐
        ter},  is  not  supported.   Furthermore,  in Perl, many properties may
        optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE
        does not support this.
 
        The following comments apply when PCRE is running in UTF-8 mode:
 
        1.  When you set the PCRE_UTF8 flag, the strings passed as patterns and
        subjects are checked for validity on entry to the  relevant  functions.
        If an invalid UTF-8 string is passed, an error return is given. In some
        situations, you may already know  that  your  strings  are  valid,  and
        therefore want to skip these checks in order to improve performance. If
        you set the PCRE_NO_UTF8_CHECK flag at compile time  or  at  run  time,
        PCRE  assumes  that  the  pattern or subject it is given (respectively)
        contains only valid UTF-8 codes. In this case, it does not diagnose  an
        invalid  UTF-8 string. If you pass an invalid UTF-8 string to PCRE when
        PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program  may
        crash.
 
        2.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a
        two-byte UTF-8 character if the value is greater than 127.
 
        3. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8
        characters for values greater than \177.
 
        4.  Repeat quantifiers apply to complete UTF-8 characters, not to indi‐
        vidual bytes, for example: \x{100}{3}.
 
        5. The dot metacharacter matches one UTF-8 character instead of a  sin‐
        gle byte.
 
        6.  The  escape sequence \C can be used to match a single byte in UTF-8
        mode, but its use can lead to some strange effects.  This  facility  is
        not available in the alternative matching function, pcre_dfa_exec().
 
        7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
        test characters of any code value, but the characters that PCRE  recog‐
        nizes  as  digits,  spaces,  or  word characters remain the same set as
        before, all with values less than 256. This remains true even when PCRE
        includes  Unicode  property support, because to do otherwise would slow
        down PCRE in many common cases. If you really want to test for a  wider
        sense  of,  say,  "digit",  you must use Unicode property tests such as
        \p{Nd}.
 
        8. Similarly, characters that match the POSIX named  character  classes
        are all low-valued characters.
 
        9.  However,  the Perl 5.10 horizontal and vertical whitespace matching
        escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char‐
        acters.
 
        10.  Case-insensitive  matching applies only to characters whose values
        are less than 128, unless PCRE is built with Unicode property  support.
        Even  when  Unicode  property support is available, PCRE still uses its
        own character tables when checking the case of  low-valued  characters,
        so  as not to degrade performance.  The Unicode property information is
        used only for characters with higher values. Even when Unicode property
        support is available, PCRE supports case-insensitive matching only when
        there is a one-to-one mapping between a letter’s  cases.  There  are  a
        small  number  of  many-to-one  mappings in Unicode; these are not sup‐
        ported by PCRE.
 

AUTHOR

 
        Philip Hazel
        University Computing Service
        Cambridge CB2 3QH, England.
 
        Putting an actual email address here seems to have been a spam  magnet,
        so  I’ve  taken  it away. If you want to email me, use my two initials,
        followed by the two digits 10, at the domain cam.ac.uk.
 

REVISION

 
        Last updated: 13 June 2007
        Copyright (c) 1997-2007 University of Cambridge.
 
                                                                        PCRE(3)