Ubuntu Manpage: PCRE2 - Perl-compatible regular expressions

NAME

       PCRE2 - Perl-compatible regular expressions

PARTIAL MATCHING IN PCRE2

In normal use of PCRE2, if the subject string that is passed to a matching function
matches as far as it goes, but is too short to match the entire pattern,
PCRE2_ERROR_NOMATCH is returned. There are circumstances where it might be helpful to
distinguish this case from other cases in which there is no match.

Consider, for example, an application where a human is required to type in data for a
field with specific formatting requirements. An example might be a date in the form
ddmmmyy, defined by this pattern:

^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

If the application sees the user's keystrokes one by one, and can check that what has been
typed so far is potentially valid, it is able to raise an error as soon as a mistake is
made, by beeping and not reflecting the character that has been typed, for example. This
immediate feedback is likely to be a better user interface than a check that is delayed
until the entire string has been entered. Partial matching can also be useful when the
subject string is very long and is not all available at once.

PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and PCRE2_PARTIAL_HARD
options, which can be set when calling a matching function. The difference between the
two options is whether or not a partial match is preferred to an alternative complete
match, though the details differ between the two types of matching function. If both
options are set, PCRE2_PARTIAL_HARD takes precedence.

If you want to use partial matching with just-in-time optimized code, you must call
pcre2_jit_compile() with one or both of these options:

PCRE2_JIT_PARTIAL_SOFT
PCRE2_JIT_PARTIAL_HARD

PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial matches on the
same pattern. If the appropriate JIT mode has not been compiled, interpretive matching
code is used.

Setting a partial matching option disables two of PCRE2's standard optimizations. PCRE2
remembers the last literal code unit in a pattern, and abandons matching immediately if it
is not present in the subject string. This optimization cannot be used for a subject
string that might match only partially. PCRE2 also knows the minimum length of a matching
string, and does not bother to run the matching function on shorter strings. This
optimization is also disabled for partial matching.

PARTIAL MATCHING USING pcre2_match()


       A  partial  match occurs during a call to pcre2_match() when the end of the subject string
       is reached successfully, but matching cannot continue because more characters are  needed.
       However,  at  least  one character in the subject must have been inspected. This character
       need not form part of the final matched string; lookbehind assertions and  the  \K  escape
       sequence  provide  ways of inspecting characters before the start of a matched string. The
       requirement for inspecting at least one character  exists  because  an  empty  string  can
       always  be matched; without such a restriction there would always be a partial match of an
       empty string at the end of the subject.

       When a partial match is returned, the first two elements  in  the  ovector  point  to  the
       portion  of  the  subject  that was matched, but the values in the rest of the ovector are
       undefined. The appearance of \K in the pattern has no effect for a partial match. Consider
       this pattern:

         /abc\K123/

       If  it  is  matched against "456abc123xyz" the result is a complete match, and the ovector
       defines the matched string as "123",  because  \K  resets  the  "start  of  match"  point.
       However,  if  a partial match is requested and the subject string is "456abc12", a partial
       match is found for the string "abc12", because all  these  characters  are  needed  for  a
       subsequent re-match with additional characters.

       What  happens  when  a  partial  match  is  identified depends on which of the two partial
       matching options are set.

   PCRE2_PARTIAL_SOFT WITH pcre2_match()

       If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial  match,  the  partial
       match  is  remembered,  but  matching  continues  as normal, and other alternatives in the
       pattern are tried. If no complete match can  be  found,  PCRE2_ERROR_PARTIAL  is  returned
       instead of PCRE2_ERROR_NOMATCH.

       This  option  is "soft" because it prefers a complete match over a partial match.  All the
       various matching items in a pattern  behave  as  if  the  subject  string  is  potentially
       complete.  For  example, \z, \Z, and $ match at the end of the subject, as normal, and for
       \b and \B the end of the subject is treated as a non-alphanumeric.

       If there is more than one partial match, the first one that was found  provides  the  data
       that is returned. Consider this pattern:

         /123\w+X|dogY/

       If  this  is  matched  against  the  subject string "abc123dog", both alternatives fail to
       match, but the end of the subject is reached during matching,  so  PCRE2_ERROR_PARTIAL  is
       returned.  The offsets are set to 3 and 9, identifying "123dog" as the first partial match
       that was found. (In this example, there are two partial matches, because "dog" on its  own
       partially matches the second alternative.)

   PCRE2_PARTIAL_HARD WITH pcre2_match()

       If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is returned as soon as
       a partial match is found, without continuing to search for possible complete matches. This
       option  is "hard" because it prefers an earlier partial match over a later complete match.
       For this reason, the assumption is made that the end of the supplied  subject  string  may
       not be the true end of the available data, and so, if \z, \Z, \b, \B, or $ are encountered
       at the end of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at  least  one
       character in the subject has been inspected.

   Comparing hard and soft partial matching

       The  difference  between  the two partial matching options can be illustrated by a pattern
       such as:

         /dog(sbody)?/

       This matches either "dog" or "dogsbody", greedily (that is, it prefers the  longer  string
       if possible). If it is matched against the string "dog" with PCRE2_PARTIAL_SOFT, it yields
       a complete match  for  "dog".  However,  if  PCRE2_PARTIAL_HARD  is  set,  the  result  is
       PCRE2_ERROR_PARTIAL.  On  the  other  hand,  if the pattern is made ungreedy the result is
       different:

         /dog(sbody)??/

       In this case the result is always a complete  match  because  that  is  found  first,  and
       matching never continues after finding a complete match. It might be easier to follow this
       explanation by thinking of the two patterns like this:

         /dog(sbody)?/    is the same as  /dogsbody|dog/
         /dog(sbody)??/   is the same as  /dog|dogsbody/

       The second pattern will never match "dogsbody", because it will always  find  the  shorter
       match first.

PARTIAL MATCHING USING pcre2_dfa_match()


       The  DFA  functions  move  along  the  subject  string  character  by  character,  without
       backtracking, searching for all possible matches simultaneously. If the end of the subject
       is  reached  before  the  end of the pattern, there is the possibility of a partial match,
       again provided that at least one character has been inspected.

       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there have been no
       complete   matches.   Otherwise,   the   complete   matches  are  returned.   However,  if
       PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any complete matches. The
       portion  of the string that was matched when the longest partial match was found is set as
       the first matching string.

       Because the DFA functions always  search  for  all  possible  matches,  and  there  is  no
       difference  between  greedy and ungreedy repetition, their behaviour is different from the
       standard functions when PCRE2_PARTIAL_HARD is  set.  Consider  the  string  "dog"  matched
       against the ungreedy pattern shown above:

         /dog(sbody)??/

       Whereas  the standard function stops as soon as it finds the complete match for "dog", the
       DFA function also finds the partial  match  for  "dogsbody",  and  so  returns  that  when
       PCRE2_PARTIAL_HARD is set.

PARTIAL MATCHING AND WORD BOUNDARIES


       If  a pattern ends with one of sequences \b or \B, which test for word boundaries, partial
       matching  with  PCRE2_PARTIAL_SOFT  can  give  counter-intuitive  results.  Consider  this
       pattern:

         /\bcat\b/

       This matches "cat", provided there is a word boundary at either end. If the subject string
       is "the cat", the comparison of the final "t"  with  a  following  character  cannot  take
       place, so a partial match is found. However, normal matching carries on, and \b matches at
       the end of the subject when the last character is a letter, so a complete match is  found.
       The  result,  therefore, is not PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case
       does yield PCRE2_ERROR_PARTIAL, because then the partial match takes precedence.

EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST


       If  the  partial_soft  (or  ps)  modifier  is  present  on  a  pcre2test  data  line,  the
       PCRE2_PARTIAL_SOFT option is used for the match.  Here is a run of pcre2test that uses the
       date example quoted above:

           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 25jun04\=ps
          0: 25jun04
          1: jun
         data> 25dec3\=ps
         Partial match: 23dec3
         data> 3ju\=ps
         Partial match: 3ju
         data> 3juj\=ps
         No match
         data> j\=ps
         No match

       The first data string is matched completely, so pcre2test shows  the  matched  substrings.
       The  remaining  four  strings  do  not  match  the complete pattern, but the first two are
       partial matches. Similar output is obtained if DFA matching is used.

       If  the  partial_hard  (or  ph)  modifier  is  present  on  a  pcre2test  data  line,  the
       PCRE2_PARTIAL_HARD option is set for the match.

MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()

When a partial match has been found using a DFA matching function, it is possible to
continue the match by providing additional subject data and calling the function again
with the same compiled regular expression, this time setting the PCRE2_DFA_RESTART option.
You must pass the same working space as before, because this is where details of the
previous partial match are stored. Here is an example using pcre2test:

re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data> 23ja\=dfa,ps
Partial match: 23ja
data> n05\=dfa,dfa_restart
0: n05

The first call has "23ja" as the subject, and requests partial matching; the second call
has "n05" as the subject for the continued (restarted) match. Notice that when the match
is complete, only the last part is shown; PCRE2 does not retain the previously partially-
matched string. It is up to the calling program to do that if it needs to.

That means that, for an unanchored pattern, if a continued match fails, it is not possible
to try again at a new starting point. All this facility is capable of doing is continuing
with the previous match attempt. In the previous example, if the second set of data is
"ug23" the result is no match, even though there would be a match for "aug23" if the
entire string were given at once. Depending on the application, this may or may not be
what you want. The only way to allow for starting again at the next character is to
retain the matched part of the subject and try a new complete match.

You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART to
continue partial matching over multiple segments. This facility can be used to pass very
long subject strings to the DFA matching functions.

MULTI-SEGMENT MATCHING WITH pcre2_match()


       Unlike the DFA function, it is not possible to restart  the  previous  match  with  a  new
       segment  of data when using pcre2_match(). Instead, new data must be added to the previous
       subject string, and the entire match re-run, starting from the  point  where  the  partial
       match occurred. Earlier data can be discarded.

       It  is best to use PCRE2_PARTIAL_HARD in this situation, because it does not treat the end
       of a segment as the end of the subject when matching \z, \Z, \b, \B, and  $.  Consider  an
       unanchored pattern that matches dates:

           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
         data> The date is 23ja\=ph
         Partial match: 23ja

       At  this  stage,  an application could discard the text preceding "23ja", add on text from
       the next segment, and call the matching function again. Unlike the DFA matching  function,
       the  entire  matching  string  must always be available, and the complete matching process
       occurs for each call, so more memory and more processing time is needed.

ISSUES WITH MULTI-SEGMENT MATCHING

Certain types of pattern may give problems with multi-segment matching, whichever matching
function is used.

1. If the pattern contains a test for the beginning of a line, you need to pass the
PCRE2_NOTBOL option when the subject string for any call does start at the beginning of a
line. There is also a PCRE2_NOTEOL option, but in practice when doing multi-segment
matching you should be using PCRE2_PARTIAL_HARD, which includes the effect of
PCRE2_NOTEOL.

2. If a pattern contains a lookbehind assertion, characters that precede the start of the
partial match may have been inspected during the matching process. When using
pcre2_match(), sufficient characters must be retained for the next match attempt. You can
ensure that enough characters are retained by doing the following:

Before doing any matching, find the length of the longest lookbehind in the pattern by
calling pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND option. Note that the
resulting count is in characters, not code units. After a partial match, moving back from
the ovector[0] offset in the subject by the number of characters given for the maximum
lookbehind gets you to the earliest character that must be retained. In a non-UTF or a
32-bit situation, moving back is just a subtraction, but in UTF-8 or UTF-16 you have to
count characters while moving back through the code units.

Characters before the point you have now reached can be discarded, and after the next
segment has been added to what is retained, you should run the next match with the
startoffset argument set so that the match begins at the same point as before.

For example, if the pattern "(?<=123)abc" is partially matched against the string
"xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum lookbehind count is 3, so
all characters before offset 2 can be discarded. The value of startoffset for the next
match should be 3. When pcre2test displays a partial match, it indicates the lookbehind
characters with '<' characters:

re> "(?<=123)abc"
data> xx123ab\=ph
Partial match: 123ab
<<<

3. Because a partial match must always contain at least one character, what might be
considered a partial match of an empty string actually gives a "no match" result. For
example:

re> /c(?<=abc)x/
data> ab\=ps
No match

If the next segment begins "cx", a match should be found, but this will only happen if
characters from the previous segment are retained. For this reason, a "no match" result
should be interpreted as "partial match of an empty string" when the pattern contains
lookbehinds.

4. Matching a subject string that is split into multiple segments may not always produce
exactly the same result as matching over one single long string, especially when
PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word Boundaries" above
describes an issue that arises if the pattern ends with \b or \B. Another kind of
difference may occur when there are multiple matching possibilities, because (for
PCRE2_PARTIAL_SOFT) a partial match result is given only when there are no completed
matches. This means that as soon as the shortest match has been found, continuation to a
new subject segment is no longer possible. Consider this pcre2test example:

re> /dog(sbody)?/
data> dogsb\=ps
0: dog
data> do\=ps,dfa
Partial match: do
data> gsb\=ps,dfa,dfa_restart
0: g
data> dogsbody\=dfa
0: dogsbody
1: dog

The first data line passes the string "dogsb" to a standard matching function, setting the
PCRE2_PARTIAL_SOFT option. Although the string is a partial match for "dogsbody", the
result is not PCRE2_ERROR_PARTIAL, because the shorter string "dog" is a complete match.
Similarly, when the subject is presented to a DFA matching function in several parts ("do"
and "gsb" being the first two) the match stops when "dog" has been found, and it is not
possible to continue. On the other hand, if "dogsbody" is presented as a single string, a
DFA matching function finds both matches.

Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching multi-
segment data. The example above then behaves differently:

re> /dog(sbody)?/
data> dogsb\=ph
Partial match: dogsb
data> do\=ps,dfa
Partial match: do
data> gsb\=ph,dfa,dfa_restart
Partial match: gsb

5. Patterns that contain alternatives at the top level which do not all start with the
same pattern item may not work as expected when PCRE2_DFA_RESTART is used. For example,
consider this pattern:

1234|3789

If the first part of the subject is "ABC123", a partial match of the first alternative is
found at offset 3. There is no partial match for the second alternative, because such a
match does not start at the same point in the subject string. Attempting to continue with
the string "7890" does not yield a match because only those alternatives that match at one
point in the subject are remembered. The problem arises because the start of the second
alternative matches within the first alternative. There is no problem with anchored
patterns or patterns such as:

1234|ABCD

where no string can be a partial match for both alternatives. This is not a problem if a
standard matching function is used, because the entire match has to be rerun each time:

re> /1234|3789/
data> ABC123\=ph
Partial match: 123
data> 1237890
0: 3789

Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running the entire
match can also be used with the DFA matching function. Another possibility is to work with
two buffers. If a partial match at offset n in the first buffer is followed by "no match"
when PCRE2_DFA_RESTART is used on the second buffer, you can then try a new match starting
at offset n+1 in the first buffer.

AUTHOR


       Philip Hazel
       University Computing Service
       Cambridge, England.

REVISION


       Last updated: 22 December 2014
       Copyright (c) 1997-2014 University of Cambridge.