
PCRE - Perl-compatible regular expressions
The syntax and semantics of the regular expressions that are supported
by PCRE are described in detail below. There is a quick-reference
syntax summary in the pcresyntax page. Perl’s regular expressions are
described in its own documentation, and regular expressions in general
are covered in a number of books, some of which have copious examples.
Jeffrey Friedl’s "Mastering Regular Expressions", published by
O’Reilly, covers regular expressions in great detail. This description
of PCRE’s regular expressions is intended as reference material.
The original operation of PCRE was on strings of one-byte characters.
However, there is now also support for UTF-8 character strings. To use
this, you must build PCRE to include UTF-8 support, and then call
pcre_compile() with the PCRE_UTF8 option. How this affects pattern
matching is mentioned in several places below. There is also a summary
of UTF-8 features in the section on UTF-8 support in the main pcre
page.
The remainder of this document discusses the patterns that are
supported by PCRE when its main matching function, pcre_exec(), is
used. From release 6.0, PCRE offers a second matching function,
pcre_dfa_exec(), which matches using a different algorithm that is not
Perl-compatible. Some of the features discussed below are not available
when pcre_dfa_exec() is used. The advantages and disadvantages of the
alternative function, and how it differs from the normal function, are
discussed in the pcrematching page.
PCRE supports five different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF
(linefeed) character, the two-character sequence CRLF, any of the three
preceding, or any Unicode newline sequence. The pcreapi page has
further discussion about newlines, and shows how to set the newline
convention in the options arguments for the compiling and matching
functions.
It is also possible to specify a newline convention by starting a
pattern string with one of the following five sequences:
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
These override the default and the options given to pcre_compile(). For
example, on a Unix system where LF is the default newline sequence, the
pattern
(*CR)a.b
changes the convention to CR. That pattern matches "a
b" because LF is
no longer a newline. Note that these special settings, which are not
Perl-compatible, are recognized only at the very start of a pattern,
and that they must be in upper case. If more than one of them is
present, the last one is used.
The newline convention does not affect what the \R escape sequence
matches. By default, this is any Unicode newline sequence, for Perl
compatibility. However, this can be changed; see the description of \R
in the section entitled "Newline sequences" below. A change of \R
setting can be combined with a change of newline convention.
A regular expression is a pattern that is matched against a subject
string from left to right. Most characters stand for themselves in a
pattern, and match the corresponding characters in the subject. As a
trivial example, the pattern
The quick brown fox
matches a portion of a subject string that is identical to itself. When
caseless matching is specified (the PCRE_CASELESS option), letters are
matched independently of case. In UTF-8 mode, PCRE always understands
the concept of case for characters whose values are less than 128, so
caseless matching is always possible. For characters with higher
values, the concept of case is supported if PCRE is compiled with
Unicode property support, but not otherwise. If you want to use
caseless matching for characters 128 and above, you must ensure that
PCRE is compiled with Unicode property support as well as with UTF-8
support.
The power of regular expressions comes from the ability to include
alternatives and repetitions in the pattern. These are encoded in the
pattern by the use of metacharacters, which do not stand for themselves
but instead are interpreted in some special way.
There are two different sets of metacharacters: those that are
recognized anywhere in the pattern except within square brackets, and
those that are recognized within square brackets. Outside square
brackets, the metacharacters are as follows:
\ general escape character with several uses
^ assert start of string (or line, in multiline mode)
$ assert end of string (or line, in multiline mode)
. match any character except newline (by default)
[ start character class definition
| start of alternative branch
( start subpattern
) end subpattern
? extends the meaning of (
also 0 or 1 quantifier
also quantifier minimizer
* 0 or more quantifier
+ 1 or more quantifier
also "possessive quantifier"
{ start min/max quantifier
Part of a pattern that is in square brackets is called a "character
class". In a character class the only metacharacters are:
\ general escape character
^ negate the class, but only if the first character
- indicates character range
[ POSIX character class (only if followed by POSIX
syntax)
] terminates the character class
The following sections describe the use of each of the metacharacters.
The backslash character has several uses. Firstly, if it is followed by
a non-alphanumeric character, it takes away any special meaning that
character may have. This use of backslash as an escape character
applies both inside and outside character classes.
For example, if you want to match a * character, you write \* in the
pattern. This escaping action applies whether or not the following
character would otherwise be interpreted as a metacharacter, so it is
always safe to precede a non-alphanumeric with backslash to specify
that it stands for itself. In particular, if you want to match a
backslash, you write \.
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
the pattern (other than in a character class) and characters between a
# outside a character class and the next newline are ignored. An
escaping backslash can be used to include a whitespace or # character
as part of the pattern.
If you want to remove the special meaning from a sequence of
characters, you can do so by putting them between \Q and \E. This is
different from Perl in that $ and @ are handled as literals in \Q...\E
sequences in PCRE, whereas in Perl, $ and @ cause variable
interpolation. Note the following examples:
Pattern PCRE matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the
contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
The \Q...\E sequence is recognized both inside and outside character
classes.
Non-printing characters
A second use of backslash provides a way of encoding non-printing
characters in patterns in a visible manner. There is no restriction on
the appearance of non-printing characters, apart from the binary zero
that terminates a pattern, but when a pattern is being prepared by text
editing, it is usually easier to use one of the following escape
sequences than the binary character it represents:
alarm, that is, the BEL character (hex 07)
Powered by the Ubuntu Manpage Repository generator
Maintained by Dustin Kirkland