Provided by: libpcre2-dev_10.31-2_amd64 bug

NAME

       PCRE - Perl-compatible regular expressions (revised API)

UNICODE AND UTF SUPPORT


       When  PCRE2  is  built  with  Unicode  support (which is the default), it has knowledge of
       Unicode character properties and can process text strings  in  UTF-8,  UTF-16,  or  UTF-32
       format  (depending  on  the  code unit width). However, by default, PCRE2 assumes that one
       code unit is one character. To process a pattern as a UTF string, where  a  character  may
       require  more  than one code unit, you must call pcre2_compile() with the PCRE2_UTF option
       flag, or the pattern must start with the sequence (*UTF). When  either  of  these  is  the
       case,  both the pattern and any subject strings that are matched against it are treated as
       UTF strings instead of strings of individual one-code-unit characters.

       If you do not need Unicode support you can build PCRE2  without  it,  in  which  case  the
       library will be smaller.

UNICODE PROPERTY SUPPORT


       When  PCRE2 is built with Unicode support, the escape sequences \p{..}, \P{..}, and \X can
       be used. The Unicode properties that can be tested are limited  to  the  general  category
       properties  such  as  Lu  for an upper case letter or Nd for a decimal number, the Unicode
       script names such as Arabic or Han, and the derived properties Any and L&. Full lists  are
       given  in  the  pcre2pattern  and  pcre2syntax  documentation.  Only  the  short names for
       properties are  supported.  For  example,  \p{L}  matches  a  letter.  Its  Perl  synonym,
       \p{Letter},  is  not  supported.   Furthermore, in Perl, many properties may optionally be
       prefixed by "Is", for compatibility with Perl 5.6. PCRE2 does not support this.

WIDE CHARACTERS AND UTF MODES


       Codepoints less than 256 can be  specified  in  patterns  by  either  braced  or  unbraced
       hexadecimal  escape  sequences  (for  example,  \x{b3} or \xb3). Larger values have to use
       braced sequences. Unbraced octal code points up to \777 are also recognized;  larger  ones
       can be coded using \o{...}.

       In  UTF modes, repeat quantifiers apply to complete UTF characters, not to individual code
       units.

       In UTF modes, the dot metacharacter matches one UTF character instead  of  a  single  code
       unit.

       The  escape sequence \C can be used to match a single code unit in a UTF mode, but its use
       can lead to some strange effects because it  breaks  up  multi-unit  characters  (see  the
       description of \C in the pcre2pattern documentation).

       The use of \C is not supported by the alternative matching function pcre2_dfa_match() when
       in UTF-8 or UTF-16 mode, that is, when a character may consist of more than one code unit.
       The  use of \C in these modes provokes a match-time error. Also, the JIT optimization does
       not support \C in these modes. If JIT optimization is requested  for  a  UTF-8  or  UTF-16
       pattern  that  contains  \C, it will not succeed, and so when pcre2_match() is called, the
       matching will be carried out by the normal interpretive function.

       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test characters of  any
       code  value,  but,  by default, the characters that PCRE2 recognizes as digits, spaces, or
       word characters remain the same set as in non-UTF mode, all with  code  points  less  than
       256.  This remains true even when PCRE2 is built to include Unicode support, because to do
       otherwise would slow down matching in many common cases. Note that this also applies to \b
       and  \B,  because  they are defined in terms of \w and \W. If you want to test for a wider
       sense of, say, "digit", you can use  explicit  Unicode  property  tests  such  as  \p{Nd}.
       Alternatively, if you set the PCRE2_UCP option, the way that the character escapes work is
       changed so that Unicode properties are used to determine which characters match. There are
       more details in the section on generic character types in the pcre2pattern documentation.

       Similarly,  characters  that  match  the  POSIX named character classes are all low-valued
       characters, unless the PCRE2_UCP option is set.

       However, the special horizontal and vertical white space matching escapes (\h, \H, \v, and
       \V) do match all the appropriate Unicode characters, whether or not PCRE2_UCP is set.

CASE-EQUIVALENCE IN UTF MODES


       Case-insensitive  matching  in  a  UTF  mode  makes  use  of Unicode properties except for
       characters whose code points are less than 128 and that have at most  two  case-equivalent
       values.  For these, a direct table lookup is used for speed. A few Unicode characters such
       as Greek sigma have more than two codepoints  that  are  case-equivalent,  and  these  are
       treated as such.

VALIDITY OF UTF STRINGS


       When  the  PCRE2_UTF  option  is  set, the strings passed as patterns and subjects are (by
       default) checked for validity on entry to the  relevant  functions.   If  an  invalid  UTF
       string  is  passed,  an  negative  error  code  is  returned.  The code unit offset to the
       offending  character  can  be  extracted  from   the   match   data   block   by   calling
       pcre2_get_startchar(), which is used for this purpose after a UTF error.

       UTF-16  and  UTF-32 strings can indicate their endianness by special code knows as a byte-
       order mark (BOM). The PCRE2 functions do not handle this, expecting strings to be in  host
       byte order.

       A  UTF  string  is  checked  before  any  other  processing  takes  place.  In the case of
       pcre2_match() and pcre2_dfa_match() calls with a non-zero starting offset,  the  check  is
       applied  only  to  that  part  of the subject that could be inspected during matching, and
       there is a check that the starting offset points to the first code unit of a character  or
       to the end of the subject. If there are no lookbehind assertions in the pattern, the check
       starts at the starting  offset.  Otherwise,  it  starts  at  the  length  of  the  longest
       lookbehind  before  the  starting  offset, or at the start of the subject if there are not
       that many characters before the starting offset. Note that the sequences  \b  and  \B  are
       one-character lookbehinds.

       In addition to checking the format of the string, there is a check to ensure that all code
       points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-
       character" code points are not excluded because Unicode corrigendum #9 makes it clear that
       they should not be.

       Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,  where  they
       are  used  in pairs to encode code points with values greater than 0xFFFF. The code points
       that are encoded by UTF-16 pairs are available  independently  in  the  UTF-8  and  UTF-32
       encodings.  (In  other  words,  the  whole  surrogate  thing  is  a fudge for UTF-16 which
       unfortunately messes up UTF-8 and UTF-32.)

       In some situations, you may already know that your strings are valid, and  therefore  want
       to  skip  these  checks in order to improve performance, for example in the case of a long
       subject string that is being scanned repeatedly.  If you set the PCRE2_NO_UTF_CHECK option
       at  compile  time  or at match time, PCRE2 assumes that the pattern or subject it is given
       (respectively) contains only valid UTF code unit sequences.

       Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check for the pattern;  it
       does  not  also  apply  to subject strings. If you want to disable the check for a subject
       string you must pass this option to pcre2_match() or pcre2_dfa_match().

       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result is  undefined
       and your program may crash or loop indefinitely.

       Note  that  setting  PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
       given if an escape sequence for an invalid  Unicode  code  point  is  encountered  in  the
       pattern.  If  you want to allow escape sequences such as \x{d800} (a surrogate code point)
       you can  set  the  PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  extra  option.  However,  this  is
       possible  only  in  UTF-8  and UTF-32 modes, because these values are not representable in
       UTF-16.

   Errors in UTF-8 strings

       The following negative error codes are given for invalid UTF-8 strings:

         PCRE2_ERROR_UTF8_ERR1
         PCRE2_ERROR_UTF8_ERR2
         PCRE2_ERROR_UTF8_ERR3
         PCRE2_ERROR_UTF8_ERR4
         PCRE2_ERROR_UTF8_ERR5

       The string ends with a truncated UTF-8 character; the code specifies how  many  bytes  are
       missing  (1  to  5).  Although  RFC 3629 restricts UTF-8 characters to be no longer than 4
       bytes, the encoding scheme (originally defined by RFC 2279) allows for up to 6 bytes,  and
       this is checked first; hence the possibility of 4 or 5 missing bytes.

         PCRE2_ERROR_UTF8_ERR6
         PCRE2_ERROR_UTF8_ERR7
         PCRE2_ERROR_UTF8_ERR8
         PCRE2_ERROR_UTF8_ERR9
         PCRE2_ERROR_UTF8_ERR10

       The  two  most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the character do
       not have the binary value 0b10 (that is, either the most significant bit is 0, or the next
       bit is 1).

         PCRE2_ERROR_UTF8_ERR11
         PCRE2_ERROR_UTF8_ERR12

       A  character  that  is valid by the RFC 2279 rules is either 5 or 6 bytes long; these code
       points are excluded by RFC 3629.

         PCRE2_ERROR_UTF8_ERR13

       A 4-byte character has a value greater than 0x10fff; these code points are excluded by RFC
       3629.

         PCRE2_ERROR_UTF8_ERR14

       A  3-byte  character  has a value in the range 0xd800 to 0xdfff; this range of code points
       are reserved by RFC 3629 for use with UTF-16, and so are excluded from UTF-8.

         PCRE2_ERROR_UTF8_ERR15
         PCRE2_ERROR_UTF8_ERR16
         PCRE2_ERROR_UTF8_ERR17
         PCRE2_ERROR_UTF8_ERR18
         PCRE2_ERROR_UTF8_ERR19

       A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for  a  value  that
       can be represented by fewer bytes, which is invalid. For example, the two bytes 0xc0, 0xae
       give the value 0x2e, whose correct coding uses just one byte.

         PCRE2_ERROR_UTF8_ERR20

       The two most significant bits of the first byte of a character have the binary value  0b10
       (that is, the most significant bit is 1 and the second is 0). Such a byte can only validly
       occur as the second or subsequent byte of a multi-byte character.

         PCRE2_ERROR_UTF8_ERR21

       The first byte of a character has the value 0xfe or 0xff. These values can never occur  in
       a valid UTF-8 string.

   Errors in UTF-16 strings

       The following negative error codes are given for invalid UTF-16 strings:

         PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string
         PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate
         PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate

   Errors in UTF-32 strings

       The following negative error codes are given for invalid UTF-32 strings:

         PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
         PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff

AUTHOR


       Philip Hazel
       University Computing Service
       Cambridge, England.

REVISION


       Last updated: 17 May 2017
       Copyright (c) 1997-2017 University of Cambridge.