focal (3) pcre2.3.gz

Provided by: libpcre2-dev_10.34-7ubuntu0.1_amd64 bug

NAME

       PCRE2 - Perl-compatible regular expressions (revised API)

INTRODUCTION

       PCRE2  is  the  name  used  for  a  revised  API  for  the PCRE library, which is a set of
       functions, written in C, that implement regular expression pattern matching using the same
       syntax  and  semantics as Perl, with just a few differences. After nearly two decades, the
       limitations of the original API were making development increasingly  difficult.  The  new
       API  is  more  extensible,  and  it  was  simplified  by  abolishing  the separate "study"
       optimizing function; in PCRE2, patterns are automatically optimized where possible.  Since
       forking from PCRE1, the code has been extensively refactored and new features introduced.

       As  well  as Perl-style regular expression patterns, some features that appeared in Python
       and the original PCRE before they appeared in Perl are available using the Python  syntax.
       There  is  also some support for one or two .NET and Oniguruma syntax items, and there are
       options for requesting some minor changes that give  better  ECMAScript  (aka  JavaScript)
       compatibility.

       The  source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit code units,
       which means that up to three separate libraries may be installed.  The  original  work  to
       extend  PCRE  to  16-bit  and  32-bit  code units was done by Zoltan Herczeg and Christian
       Persch, respectively. In all three  cases,  strings  can  be  interpreted  either  as  one
       character  per  code  unit,  or  as  UTF-encoded Unicode, with support for Unicode general
       category properties. Unicode support is optional at  build  time  (but  is  the  default).
       However,  processing strings as UTF code units must be enabled explicitly at run time. The
       version of Unicode in use can be discovered by running

         pcre2test -C

       The three libraries contain identical sets of functions, with names ending in _8, _16,  or
       _32,    respectively    (for    example,    pcre2_compile_8()).   However,   by   defining
       PCRE2_CODE_UNIT_WIDTH to be 8, 16, or 32, a program that uses just one code unit width can
       be  written  using generic names such as pcre2_compile(), and the documentation is written
       assuming that this is the case.

       In addition to the  Perl-compatible  matching  function,  PCRE2  contains  an  alternative
       function  that  matches  the  same  compiled  patterns  in  a  different  way.  In certain
       circumstances, the alternative function has some advantages.  For a discussion of the  two
       matching algorithms, see the pcre2matching page.

       Details  of  exactly  which  Perl regular expression features are and are not supported by
       PCRE2 are given in separate documents. See the pcre2pattern and pcre2compat  pages.  There
       is a syntax summary in the pcre2syntax page.

       Some  features  of  PCRE2 can be included, excluded, or changed when the library is built.
       The pcre2_config() function makes it possible for a client to discover which features  are
       available.  The  features  themselves  are described in the pcre2build page. Documentation
       about building PCRE2 for various operating systems can be found in  the  README  and  NON-
       AUTOTOOLS_BUILD files in the source distribution.

       The  libraries  contains  a number of undocumented internal functions and data tables that
       are used by more than one of the exported external functions, but which are  not  intended
       for use by external callers. Their names all begin with "_pcre2", which hopefully will not
       provoke any name clashes. In some environments, it is possible to control  which  external
       symbols  are  exported when a shared library is built, and in these cases the undocumented
       symbols are not exported.

SECURITY CONSIDERATIONS

       If you are using PCRE2 in a non-UTF application that permits  users  to  supply  arbitrary
       patterns  for  compilation,  you should be aware of a feature that allows users to turn on
       UTF support from within a pattern. For example, an 8-bit pattern that begins with "(*UTF)"
       turns on UTF-8 mode, which interprets patterns and subjects as strings of UTF-8 code units
       instead of individual 8-bit characters. This causes both the pattern and any data  against
       which  it  is  matched  to be checked for UTF-8 validity. If the data string is very long,
       such a check might use sufficiently many resources as to cause your  application  to  lose
       performance.

       One  way  of guarding against this possibility is to use the pcre2_pattern_info() function
       to check the compiled pattern's options for PCRE2_UTF.  Alternatively,  you  can  set  the
       PCRE2_NEVER_UTF  option  when calling pcre2_compile(). This causes a compile time error if
       the pattern contains a UTF-setting sequence.

       The use of Unicode properties for character types such as \d  can  also  be  enabled  from
       within  the pattern, by specifying "(*UCP)". This feature can be disallowed by setting the
       PCRE2_NEVER_UCP option.

       If your application is one that supports UTF, be aware that  validity  checking  can  take
       time.   If  the  same  data  string  is  to  be  matched  many  times,  you  can  use  the
       PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid running redundant
       checks.

       The  use  of  the  \C  escape  sequence in a UTF-8 or UTF-16 pattern can lead to problems,
       because it may leave the current  matching  point  in  the  middle  of  a  multi-code-unit
       character.  The  PCRE2_NEVER_BACKSLASH_C  option can be used by an application to lock out
       the use of \C, causing a compile-time error if it is encountered. It is also  possible  to
       build PCRE2 with the use of \C permanently disabled.

       Another  way  that  performance  can  be hit is by running a pattern that has a very large
       search tree against a string that will never match. Nested unlimited repeats in a  pattern
       are   a   common   example.   PCRE2   provides  some  protection  against  this:  see  the
       pcre2_set_match_limit() function in the pcre2api page. There is a similar function  called
       pcre2_set_depth_limit() that can be used to restrict the amount of memory that is used.

USER DOCUMENTATION

       The  user  documentation  for PCRE2 comprises a number of different sections. In the "man"
       format, each of these is a separate "man page". In the HTML format,  each  is  a  separate
       page,  linked  from  the  index  page.  In  the plain text format, the descriptions of the
       pcre2grep and pcre2test programs are in  files  called  pcre2grep.txt  and  pcre2test.txt,
       respectively. The remaining sections, except for the pcre2demo section (which is a program
       listing), and the short pages for individual functions, are concatenated in pcre2.txt, for
       ease of searching. The sections are as follows:

         pcre2              this document
         pcre2-config       show PCRE2 installation configuration information
         pcre2api           details of PCRE2's native C API
         pcre2build         building PCRE2
         pcre2callout       details of the pattern callout feature
         pcre2compat        discussion of Perl compatibility
         pcre2convert       details of pattern conversion functions
         pcre2demo          a demonstration C program that uses PCRE2
         pcre2grep          description of the pcre2grep command (8-bit only)
         pcre2jit           discussion of just-in-time optimization support
         pcre2limits        details of size and other limits
         pcre2matching      discussion of the two matching algorithms
         pcre2partial       details of the partial matching facility
         pcre2pattern       syntax and semantics of supported regular
                              expression patterns
         pcre2perform       discussion of performance issues
         pcre2posix         the POSIX-compatible C API for the 8-bit library
         pcre2sample        discussion of the pcre2demo program
         pcre2serialize     details of pattern serialization
         pcre2syntax        quick syntax reference
         pcre2test          description of the pcre2test command
         pcre2unicode       discussion of Unicode and UTF support

       In  the  "man"  and  HTML formats, there is also a short page for each C library function,
       listing its arguments and results.

AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.

       Putting an actual email address here is a spam magnet. If you want to email me, use my two
       initials, followed by the two digits 10, at the domain cam.ac.uk.

REVISION

       Last updated: 17 September 2018
       Copyright (c) 1997-2018 University of Cambridge.