Provided by: libpcre2-dev_10.31-2_amd64 bug

NAME

       PCRE2 - Perl-compatible regular expressions

PARTIAL MATCHING IN PCRE2


       In  normal  use  of  PCRE2,  if  the  subject string that is passed to a matching function
       matches  as  far  as  it  goes,  but  is  too  short  to   match   the   entire   pattern,
       PCRE2_ERROR_NOMATCH  is  returned.  There  are  circumstances where it might be helpful to
       distinguish this case from other cases in which there is no match.

       Consider, for example, an application where a human is required to  type  in  data  for  a
       field  with  specific  formatting  requirements.  An  example  might be a date in the form
       ddmmmyy, defined by this pattern:

         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

       If the application sees the user's keystrokes one by one, and can check that what has been
       typed  so  far  is potentially valid, it is able to raise an error as soon as a mistake is
       made, by beeping and not reflecting the character that has been typed, for  example.  This
       immediate  feedback  is  likely to be a better user interface than a check that is delayed
       until the entire string has been entered. Partial matching can also  be  useful  when  the
       subject string is very long and is not all available at once.

       PCRE2  supports partial matching by means of the PCRE2_PARTIAL_SOFT and PCRE2_PARTIAL_HARD
       options, which can be set when calling a matching function.  The  difference  between  the
       two  options  is  whether  or  not a partial match is preferred to an alternative complete
       match, though the details differ between the two  types  of  matching  function.  If  both
       options are set, PCRE2_PARTIAL_HARD takes precedence.

       If  you  want  to  use  partial  matching  with just-in-time optimized code, you must call
       pcre2_jit_compile() with one or both of these options:

         PCRE2_JIT_PARTIAL_SOFT
         PCRE2_JIT_PARTIAL_HARD

       PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial matches  on  the
       same  pattern.  If  the  appropriate JIT mode has not been compiled, interpretive matching
       code is used.

       Setting a partial matching option disables two of PCRE2's  standard  optimizations.  PCRE2
       remembers the last literal code unit in a pattern, and abandons matching immediately if it
       is not present in the subject string. This optimization  cannot  be  used  for  a  subject
       string  that might match only partially. PCRE2 also knows the minimum length of a matching
       string, and does not bother  to  run  the  matching  function  on  shorter  strings.  This
       optimization is also disabled for partial matching.

PARTIAL MATCHING USING pcre2_match()


       A  partial  match occurs during a call to pcre2_match() when the end of the subject string
       is reached successfully, but matching cannot continue because more characters are  needed.
       However,  at  least  one character in the subject must have been inspected. This character
       need not form part of the final matched string; lookbehind assertions and  the  \K  escape
       sequence  provide  ways of inspecting characters before the start of a matched string. The
       requirement for inspecting at least one character  exists  because  an  empty  string  can
       always  be matched; without such a restriction there would always be a partial match of an
       empty string at the end of the subject.

       When a partial match is returned, the first two elements  in  the  ovector  point  to  the
       portion  of  the  subject  that was matched, but the values in the rest of the ovector are
       undefined. The appearance of \K in the pattern has no effect for a partial match. Consider
       this pattern:

         /abc\K123/

       If  it  is  matched against "456abc123xyz" the result is a complete match, and the ovector
       defines the matched string as "123",  because  \K  resets  the  "start  of  match"  point.
       However,  if  a partial match is requested and the subject string is "456abc12", a partial
       match is found for the string "abc12", because all  these  characters  are  needed  for  a
       subsequent re-match with additional characters.

       What  happens  when  a  partial  match  is  identified depends on which of the two partial
       matching options are set.

   PCRE2_PARTIAL_SOFT WITH pcre2_match()

       If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial  match,  the  partial
       match  is  remembered,  but  matching  continues  as normal, and other alternatives in the
       pattern are tried. If no complete match can  be  found,  PCRE2_ERROR_PARTIAL  is  returned
       instead of PCRE2_ERROR_NOMATCH.

       This  option  is "soft" because it prefers a complete match over a partial match.  All the
       various matching items in a pattern  behave  as  if  the  subject  string  is  potentially
       complete.  For  example, \z, \Z, and $ match at the end of the subject, as normal, and for
       \b and \B the end of the subject is treated as a non-alphanumeric.

       If there is more than one partial match, the first one that was found  provides  the  data
       that is returned. Consider this pattern:

         /123\w+X|dogY/

       If  this  is  matched  against  the  subject string "abc123dog", both alternatives fail to
       match, but the end of the subject is reached during matching,  so  PCRE2_ERROR_PARTIAL  is
       returned.  The offsets are set to 3 and 9, identifying "123dog" as the first partial match
       that was found. (In this example, there are two partial matches, because "dog" on its  own
       partially matches the second alternative.)

   PCRE2_PARTIAL_HARD WITH pcre2_match()

       If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is returned as soon as
       a partial match is found, without continuing to search for possible complete matches. This
       option  is "hard" because it prefers an earlier partial match over a later complete match.
       For this reason, the assumption is made that the end of the supplied  subject  string  may
       not be the true end of the available data, and so, if \z, \Z, \b, \B, or $ are encountered
       at the end of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at  least  one
       character in the subject has been inspected.

   Comparing hard and soft partial matching

       The  difference  between  the two partial matching options can be illustrated by a pattern
       such as:

         /dog(sbody)?/

       This matches either "dog" or "dogsbody", greedily (that is, it prefers the  longer  string
       if possible). If it is matched against the string "dog" with PCRE2_PARTIAL_SOFT, it yields
       a complete match  for  "dog".  However,  if  PCRE2_PARTIAL_HARD  is  set,  the  result  is
       PCRE2_ERROR_PARTIAL.  On  the  other  hand,  if the pattern is made ungreedy the result is
       different:

         /dog(sbody)??/

       In this case the result is always a complete  match  because  that  is  found  first,  and
       matching never continues after finding a complete match. It might be easier to follow this
       explanation by thinking of the two patterns like this:

         /dog(sbody)?/    is the same as  /dogsbody|dog/
         /dog(sbody)??/   is the same as  /dog|dogsbody/

       The second pattern will never match "dogsbody", because it will always  find  the  shorter
       match first.

PARTIAL MATCHING USING pcre2_dfa_match()


       The  DFA  functions  move  along  the  subject  string  character  by  character,  without
       backtracking, searching for all possible matches simultaneously. If the end of the subject
       is  reached  before  the  end of the pattern, there is the possibility of a partial match,
       again provided that at least one character has been inspected.

       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there have been no
       complete   matches.   Otherwise,   the   complete   matches  are  returned.   However,  if
       PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any complete matches. The
       portion  of the string that was matched when the longest partial match was found is set as
       the first matching string.

       Because the DFA functions always  search  for  all  possible  matches,  and  there  is  no
       difference  between  greedy and ungreedy repetition, their behaviour is different from the
       standard functions when PCRE2_PARTIAL_HARD is  set.  Consider  the  string  "dog"  matched
       against the ungreedy pattern shown above:

         /dog(sbody)??/

       Whereas  the standard function stops as soon as it finds the complete match for "dog", the
       DFA function also finds the partial  match  for  "dogsbody",  and  so  returns  that  when
       PCRE2_PARTIAL_HARD is set.

PARTIAL MATCHING AND WORD BOUNDARIES


       If  a pattern ends with one of sequences \b or \B, which test for word boundaries, partial
       matching  with  PCRE2_PARTIAL_SOFT  can  give  counter-intuitive  results.  Consider  this
       pattern:

         /\bcat\b/

       This matches "cat", provided there is a word boundary at either end. If the subject string
       is "the cat", the comparison of the final "t"  with  a  following  character  cannot  take
       place, so a partial match is found. However, normal matching carries on, and \b matches at
       the end of the subject when the last character is a letter, so a complete match is  found.
       The  result,  therefore, is not PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case
       does yield PCRE2_ERROR_PARTIAL, because then the partial match takes precedence.

EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST


       If  the  partial_soft  (or  ps)  modifier  is  present  on  a  pcre2test  data  line,  the
       PCRE2_PARTIAL_SOFT option is used for the match.  Here is a run of pcre2test that uses the
       date example quoted above:

           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 25jun04\=ps
          0: 25jun04
          1: jun
         data> 25dec3\=ps
         Partial match: 23dec3
         data> 3ju\=ps
         Partial match: 3ju
         data> 3juj\=ps
         No match
         data> j\=ps
         No match

       The first data string is matched completely, so pcre2test shows  the  matched  substrings.
       The  remaining  four  strings  do  not  match  the complete pattern, but the first two are
       partial matches. Similar output is obtained if DFA matching is used.

       If  the  partial_hard  (or  ph)  modifier  is  present  on  a  pcre2test  data  line,  the
       PCRE2_PARTIAL_HARD option is set for the match.

MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()


       When  a  partial  match  has  been  found using a DFA matching function, it is possible to
       continue the match by providing additional subject data and  calling  the  function  again
       with the same compiled regular expression, this time setting the PCRE2_DFA_RESTART option.
       You must pass the same working space as before, because  this  is  where  details  of  the
       previous partial match are stored. Here is an example using pcre2test:

           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 23ja\=dfa,ps
         Partial match: 23ja
         data> n05\=dfa,dfa_restart
          0: n05

       The  first  call has "23ja" as the subject, and requests partial matching; the second call
       has "n05" as the subject for the continued (restarted) match.  Notice that when the  match
       is  complete, only the last part is shown; PCRE2 does not retain the previously partially-
       matched string. It is up to the calling program to do that if it needs to.

       That means that, for an unanchored pattern, if a continued match fails, it is not possible
       to  try again at a new starting point. All this facility is capable of doing is continuing
       with the previous match attempt. In the previous example, if the second  set  of  data  is
       "ug23"  the  result  is  no  match,  even though there would be a match for "aug23" if the
       entire string were given at once. Depending on the application, this may  or  may  not  be
       what  you  want.   The  only  way  to allow for starting again at the next character is to
       retain the matched part of the subject and try a new complete match.

       You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART to
       continue  partial  matching over multiple segments. This facility can be used to pass very
       long subject strings to the DFA matching functions.

MULTI-SEGMENT MATCHING WITH pcre2_match()


       Unlike the DFA function, it is not possible to restart  the  previous  match  with  a  new
       segment  of data when using pcre2_match(). Instead, new data must be added to the previous
       subject string, and the entire match re-run, starting from the  point  where  the  partial
       match occurred. Earlier data can be discarded.

       It  is best to use PCRE2_PARTIAL_HARD in this situation, because it does not treat the end
       of a segment as the end of the subject when matching \z, \Z, \b, \B, and  $.  Consider  an
       unanchored pattern that matches dates:

           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
         data> The date is 23ja\=ph
         Partial match: 23ja

       At  this  stage,  an application could discard the text preceding "23ja", add on text from
       the next segment, and call the matching function again. Unlike the DFA matching  function,
       the  entire  matching  string  must always be available, and the complete matching process
       occurs for each call, so more memory and more processing time is needed.

ISSUES WITH MULTI-SEGMENT MATCHING


       Certain types of pattern may give problems with multi-segment matching, whichever matching
       function is used.

       1.  If  the  pattern  contains  a  test  for the beginning of a line, you need to pass the
       PCRE2_NOTBOL option when the subject string for any call does start at the beginning of  a
       line.  There  is  also  a  PCRE2_NOTEOL  option,  but in practice when doing multi-segment
       matching  you  should  be  using  PCRE2_PARTIAL_HARD,  which  includes   the   effect   of
       PCRE2_NOTEOL.

       2.  If a pattern contains a lookbehind assertion, characters that precede the start of the
       partial  match  may  have  been  inspected  during  the  matching  process.   When   using
       pcre2_match(),  sufficient characters must be retained for the next match attempt. You can
       ensure that enough characters are retained by doing the following:

       Before doing any matching, find the length of the longest lookbehind  in  the  pattern  by
       calling  pcre2_pattern_info()  with  the  PCRE2_INFO_MAXLOOKBEHIND  option.  Note that the
       resulting count is in characters, not code units. After a partial match, moving back  from
       the  ovector[0]  offset  in  the subject by the number of characters given for the maximum
       lookbehind gets you to the earliest character that must be retained. In  a  non-UTF  or  a
       32-bit  situation,  moving  back is just a subtraction, but in UTF-8 or UTF-16 you have to
       count characters while moving back through the code units.

       Characters before the point you have now reached can be  discarded,  and  after  the  next
       segment  has  been  added  to  what  is  retained,  you should run the next match with the
       startoffset argument set so that the match begins at the same point as before.

       For example, if  the  pattern  "(?<=123)abc"  is  partially  matched  against  the  string
       "xx123ab",  the  ovector offsets are 5 and 7 ("ab"). The maximum lookbehind count is 3, so
       all characters before offset 2 can be discarded. The value of  startoffset  for  the  next
       match  should  be  3. When pcre2test displays a partial match, it indicates the lookbehind
       characters with '<' characters:

           re> "(?<=123)abc"
         data> xx123ab\=ph
         Partial match: 123ab
                        <<<

       3. Because a partial match must always contain at  least  one  character,  what  might  be
       considered  a  partial  match  of  an empty string actually gives a "no match" result. For
       example:

           re> /c(?<=abc)x/
         data> ab\=ps
         No match

       If the next segment begins "cx", a match should be found, but this  will  only  happen  if
       characters  from  the  previous segment are retained. For this reason, a "no match" result
       should be interpreted as "partial match of an empty  string"  when  the  pattern  contains
       lookbehinds.

       4.  Matching  a subject string that is split into multiple segments may not always produce
       exactly the same  result  as  matching  over  one  single  long  string,  especially  when
       PCRE2_PARTIAL_SOFT  is  used.  The  section  "Partial  Matching and Word Boundaries" above
       describes an issue that arises if the  pattern  ends  with  \b  or  \B.  Another  kind  of
       difference  may  occur  when  there  are  multiple  matching  possibilities,  because (for
       PCRE2_PARTIAL_SOFT) a partial match result is given  only  when  there  are  no  completed
       matches.  This  means that as soon as the shortest match has been found, continuation to a
       new subject segment is no longer possible. Consider this pcre2test example:

           re> /dog(sbody)?/
         data> dogsb\=ps
          0: dog
         data> do\=ps,dfa
         Partial match: do
         data> gsb\=ps,dfa,dfa_restart
          0: g
         data> dogsbody\=dfa
          0: dogsbody
          1: dog

       The first data line passes the string "dogsb" to a standard matching function, setting the
       PCRE2_PARTIAL_SOFT  option.  Although  the  string  is a partial match for "dogsbody", the
       result is not PCRE2_ERROR_PARTIAL, because the shorter string "dog" is a  complete  match.
       Similarly, when the subject is presented to a DFA matching function in several parts ("do"
       and "gsb" being the first two) the match stops when "dog" has been found, and  it  is  not
       possible to continue.  On the other hand, if "dogsbody" is presented as a single string, a
       DFA matching function finds both matches.

       Because of these problems, it is best  to  use  PCRE2_PARTIAL_HARD  when  matching  multi-
       segment data. The example above then behaves differently:

           re> /dog(sbody)?/
         data> dogsb\=ph
         Partial match: dogsb
         data> do\=ps,dfa
         Partial match: do
         data> gsb\=ph,dfa,dfa_restart
         Partial match: gsb

       5.  Patterns  that  contain  alternatives at the top level which do not all start with the
       same pattern item may not work as expected when PCRE2_DFA_RESTART is  used.  For  example,
       consider this pattern:

         1234|3789

       If  the first part of the subject is "ABC123", a partial match of the first alternative is
       found at offset 3. There is no partial match for the second alternative,  because  such  a
       match  does not start at the same point in the subject string. Attempting to continue with
       the string "7890" does not yield a match because only those alternatives that match at one
       point  in  the  subject are remembered. The problem arises because the start of the second
       alternative matches within the first  alternative.  There  is  no  problem  with  anchored
       patterns or patterns such as:

         1234|ABCD

       where  no  string can be a partial match for both alternatives. This is not a problem if a
       standard matching function is used, because the entire match has to be rerun each time:

           re> /1234|3789/
         data> ABC123\=ph
         Partial match: 123
         data> 1237890
          0: 3789

       Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running the entire
       match can also be used with the DFA matching function. Another possibility is to work with
       two buffers. If a partial match at offset n in the first buffer is followed by "no  match"
       when PCRE2_DFA_RESTART is used on the second buffer, you can then try a new match starting
       at offset n+1 in the first buffer.

AUTHOR


       Philip Hazel
       University Computing Service
       Cambridge, England.

REVISION


       Last updated: 22 December 2014
       Copyright (c) 1997-2014 University of Cambridge.