mirror of
https://github.com/PCRE2Project/pcre2.git
synced 2025-10-22 07:31:15 +08:00
Updates to the README and some documentation (#681)
This commit is contained in:
@@ -50,7 +50,7 @@ please consult the man page, in case the conversion went wrong.
|
||||
<li><a name="TOC35" href="#SEC35">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||
<P>
|
||||
<p>
|
||||
The full syntax and semantics of the regular expression patterns that are
|
||||
supported by PCRE2 are described in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
@@ -59,9 +59,9 @@ syntax followed by the syntax of replacement strings in substitution function.
|
||||
The full description of the latter is in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
\x where x is non-alphanumeric is a literal x
|
||||
\Q...\E treat enclosed characters as literal
|
||||
@@ -71,9 +71,9 @@ PCRE2_EXTENDED is set, causing most other white space to be ignored. Note also
|
||||
that PCRE2's handling of \Q...\E has some differences from Perl's. See the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation for details.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC3" href="#TOC1">BRACED ITEMS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
With one exception, wherever brace characters { and } are required to enclose
|
||||
data for constructions such as \g{2} or \k{name}, space and/or horizontal tab
|
||||
characters that follow { or precede } are allowed and are ignored. In the case
|
||||
@@ -81,9 +81,9 @@ of quantifiers, they may also appear before or after the comma. The exception
|
||||
is \u{...} which is not Perl-compatible and is recognized only when
|
||||
PCRE2_EXTRA_ALT_BSUX is set. This is an ECMAScript compatibility feature, and
|
||||
follows ECMAScript's behaviour.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC4" href="#TOC1">ESCAPED CHARACTERS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
This table applies to ASCII and Unicode environments. An unrecognized escape
|
||||
sequence causes an error.
|
||||
<pre>
|
||||
@@ -104,8 +104,8 @@ sequence causes an error.
|
||||
\N{U+hh..} is synonymous with \x{hh..} but is not supported in environments
|
||||
that use EBCDIC code (mainly IBM mainframes). Note that \N not followed by an
|
||||
opening curly bracket has a different meaning (see below).
|
||||
</P>
|
||||
<P>
|
||||
</p>
|
||||
<p>
|
||||
If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
|
||||
following are also recognized:
|
||||
<pre>
|
||||
@@ -119,8 +119,8 @@ recognized as a hexadecimal escape; otherwise it matches a literal "x".
|
||||
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits
|
||||
or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it
|
||||
matches a literal "u".
|
||||
</P>
|
||||
<P>
|
||||
</p>
|
||||
<p>
|
||||
Note that \0dd is always an octal code. The treatment of backslash followed by
|
||||
a non-zero digit is complicated; for details see the section
|
||||
<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
|
||||
@@ -128,9 +128,9 @@ in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation, where details of escape processing in EBCDIC environments are
|
||||
also given.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC5" href="#TOC1">CHARACTER TYPES</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
. any character except newline;
|
||||
in dotall mode, any character whatsoever
|
||||
@@ -155,8 +155,8 @@ also given.
|
||||
of a UTF-8 or UTF-16 character. The application can lock out the use of \C by
|
||||
setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
|
||||
with the use of \C permanently disabled.
|
||||
</P>
|
||||
<P>
|
||||
</p>
|
||||
<p>
|
||||
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
|
||||
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
|
||||
happening, \s and \w may also match characters with code points in the range
|
||||
@@ -164,15 +164,15 @@ happening, \s and \w may also match characters with code points in the range
|
||||
sequences is changed to use Unicode properties and they match many more
|
||||
characters, but there are some option settings that can restrict individual
|
||||
sequences to matching only ASCII characters.
|
||||
</P>
|
||||
<P>
|
||||
</p>
|
||||
<p>
|
||||
Property descriptions in \p and \P are matched caselessly; hyphens,
|
||||
underscores, and ASCII white space characters are ignored, in accordance with
|
||||
Unicode's "loose matching" rules. For example, \p{Bidi_Class=al} is the same
|
||||
as \p{ bidi class = AL }.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC6" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
C Other
|
||||
Cc Control
|
||||
@@ -222,9 +222,9 @@ as \p{ bidi class = AL }.
|
||||
</pre>
|
||||
From release 10.45, when caseless matching is set, Ll, Lu, and Lt are all
|
||||
equivalent to Lc.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC7" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
Xan Alphanumeric: union of properties L and N
|
||||
Xps POSIX space: property Z or tab, NL, VT, FF, CR
|
||||
@@ -235,27 +235,27 @@ equivalent to Lc.
|
||||
</pre>
|
||||
Perl and POSIX space are now the same. Perl added VT to its space character set
|
||||
at release 5.18.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC8" href="#TOC1">BINARY PROPERTIES FOR \p AND \P</a><br>
|
||||
<P>
|
||||
<p>
|
||||
Unicode defines a number of binary properties, that is, properties whose only
|
||||
values are true or false. You can obtain a list of those that are recognized by
|
||||
\p and \P, along with their abbreviations, by running this command:
|
||||
<pre>
|
||||
pcre2test -LP
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC9" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
|
||||
<P>
|
||||
<p>
|
||||
Many script names and their 4-letter abbreviations are recognized in
|
||||
\p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P of
|
||||
course). You can obtain a list of these scripts by running this command:
|
||||
<pre>
|
||||
pcre2test -LS
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC10" href="#TOC1">THE BIDI_CLASS PROPERTY FOR \p AND \P</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
\p{Bidi_Class:<class>} matches a character with the given class
|
||||
\p{BC:<class>} matches a character with the given class
|
||||
@@ -285,10 +285,10 @@ The recognized classes are:
|
||||
RLO right-to-left override
|
||||
S segment separator
|
||||
WS white space
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC11" href="#TOC1">CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
[...] positive character class
|
||||
[^...] negative character class
|
||||
@@ -314,8 +314,8 @@ The recognized classes are:
|
||||
In PCRE2, POSIX character set names recognize only ASCII characters by default,
|
||||
but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||
\Q...\E inside a character class.
|
||||
</P>
|
||||
<P>
|
||||
</p>
|
||||
<p>
|
||||
When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes may be
|
||||
used, allowing nested character classes, combined using set operators.
|
||||
<pre>
|
||||
@@ -326,10 +326,10 @@ used, allowing nested character classes, combined using set operators.
|
||||
x--y set difference (AND NOT)
|
||||
x~~y set symmetric difference (XOR)
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC12" href="#TOC1">PERL EXTENDED CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
(?[...]) Perl extended character class
|
||||
(?[\p{Thai} & \p{Nd}]) operators; whitespace ignored
|
||||
@@ -352,9 +352,9 @@ as an ordinary character class. Outside of a nested [...], the only items
|
||||
permitted are backslash-escapes, POSIX sets, operators, and parentheses. Inside
|
||||
a nested ordinary class, ^ has its usual meaning (inverts the class when used
|
||||
as the first character); outside of a nested class, ^ is the XOR operator.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC13" href="#TOC1">QUANTIFIERS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
? 0 or 1, greedy
|
||||
?+ 0 or 1, possessive
|
||||
@@ -375,10 +375,10 @@ as the first character); outside of a nested class, ^ is the XOR operator.
|
||||
{,m} zero up to m, greedy
|
||||
{,m}+ zero up to m, possessive
|
||||
{,m}? zero up to m, lazy
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC14" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
\b word boundary
|
||||
\B not a word boundary
|
||||
@@ -393,10 +393,10 @@ as the first character); outside of a nested class, ^ is the XOR operator.
|
||||
also before newline at end of subject
|
||||
\z end of subject
|
||||
\G first matching position in subject
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC15" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
\K set reported start of match
|
||||
</pre>
|
||||
@@ -404,15 +404,15 @@ From release 10.38 \K is not permitted by default in lookaround assertions,
|
||||
for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
|
||||
option is set, the previous behaviour is re-enabled. When this option is set,
|
||||
\K is honoured in positive assertions, but ignored in negative ones.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC16" href="#TOC1">ALTERNATION</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
expr|expr|expr...
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC17" href="#TOC1">CAPTURING</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
(...) capture group
|
||||
(?<name>...) named capture group (Perl)
|
||||
@@ -425,22 +425,22 @@ option is set, the previous behaviour is re-enabled. When this option is set,
|
||||
In non-UTF modes, names may contain underscores and ASCII letters and digits;
|
||||
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
|
||||
both cases, a name must not start with a digit.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC18" href="#TOC1">ATOMIC GROUPS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
(?>...) atomic non-capture group
|
||||
(*atomic:...) atomic non-capture group
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC19" href="#TOC1">COMMENT</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
(?#....) comment (not nestable)
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC20" href="#TOC1">OPTION SETTING</a><br>
|
||||
<P>
|
||||
<p>
|
||||
Changes of these options within a group are automatically cancelled at the end
|
||||
of the group.
|
||||
<pre>
|
||||
@@ -465,15 +465,15 @@ of the group.
|
||||
(?aP) implies (?aT) as well, though this has no additional effect. However, it
|
||||
means that (?-aP) also implies (?-aT) and disables all ASCII restrictions for
|
||||
POSIX classes.
|
||||
</P>
|
||||
<P>
|
||||
</p>
|
||||
<p>
|
||||
Unsetting x or xx unsets both. Several options may be set at once, and a
|
||||
mixture of setting and unsetting such as (?i-x) is allowed, but there may be
|
||||
only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
|
||||
(?^in). An option setting may appear at the start of a non-capture group, for
|
||||
example (?i:...).
|
||||
</P>
|
||||
<P>
|
||||
</p>
|
||||
<p>
|
||||
The following are recognized only at the very start of a pattern or after one
|
||||
of the newline or \R sequences or options with similar syntax. More than one
|
||||
of them may appear. For the first three, d is a decimal number.
|
||||
@@ -497,9 +497,9 @@ the limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
|
||||
not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
|
||||
application can lock out the use of (*UTF) and (*UCP) by setting the
|
||||
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC21" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||
<P>
|
||||
<p>
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
settings with a similar syntax.
|
||||
<pre>
|
||||
@@ -509,19 +509,19 @@ settings with a similar syntax.
|
||||
(*ANYCRLF) all three of the above
|
||||
(*ANY) any Unicode newline sequence
|
||||
(*NUL) the NUL character (binary zero)
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC22" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||
<P>
|
||||
<p>
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
setting with a similar syntax.
|
||||
<pre>
|
||||
(*BSR_ANYCRLF) CR, LF, or CRLF
|
||||
(*BSR_UNICODE) any Unicode newline sequence
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC23" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
(?=...) )
|
||||
(*pla:...) ) positive lookahead
|
||||
@@ -545,9 +545,9 @@ the maximum for each branch is limited to a value set by the caller of
|
||||
<b>pcre2_compile()</b> or defaulted. The default is set when PCRE2 is built
|
||||
(ultimate default 255). If every branch matches a fixed number of characters,
|
||||
the limit for each branch is 65535 characters.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC24" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||
<pre>
|
||||
(?*...) )
|
||||
@@ -557,10 +557,10 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||
(?<*...) )
|
||||
(*naplb:...) ) synonyms
|
||||
(*non_atomic_positive_lookbehind:...) )
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC25" href="#TOC1">SUBSTRING SCAN ASSERTION</a><br>
|
||||
<P>
|
||||
<p>
|
||||
This feature is not Perl-compatible.
|
||||
<pre>
|
||||
(*scan_substring:(grouplist)...) scan captured substring
|
||||
@@ -574,20 +574,20 @@ The comma-separated list may identify groups in any of the following ways:
|
||||
<name> name
|
||||
'name' name
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC26" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
(*script_run:...) ) script run, can be backtracked into
|
||||
(*sr:...) )
|
||||
|
||||
(*atomic_script_run:...) ) atomic script run
|
||||
(*asr:...) )
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC27" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
\n reference by number (can be ambiguous)
|
||||
\gn reference by number
|
||||
@@ -601,10 +601,10 @@ The comma-separated list may identify groups in any of the following ways:
|
||||
\g{name} reference by name (Perl)
|
||||
\k{name} reference by name (.NET)
|
||||
(?P=name) reference by name (Python)
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC28" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
(?R) recurse whole pattern
|
||||
(?n) call subroutine by absolute number
|
||||
@@ -620,10 +620,10 @@ The comma-separated list may identify groups in any of the following ways:
|
||||
\g'+n' call subroutine by relative number (PCRE2 extension)
|
||||
\g<-n> call subroutine by relative number (PCRE2 extension)
|
||||
\g'-n' call subroutine by relative number (PCRE2 extension)
|
||||
</PRE>
|
||||
</P>
|
||||
</pre>
|
||||
</p>
|
||||
<br><a name="SEC29" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
(?(condition)yes-pattern)
|
||||
(?(condition)yes-pattern|no-pattern)
|
||||
@@ -644,9 +644,9 @@ The comma-separated list may identify groups in any of the following ways:
|
||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||
condition if the relevant named group exists.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC30" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
<p>
|
||||
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
||||
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
||||
if :NAME is present. The others just set a name for passing back to the caller,
|
||||
@@ -671,9 +671,9 @@ pattern is not anchored.
|
||||
</pre>
|
||||
The effect of one of these verbs in a group called as a subroutine is confined
|
||||
to the subroutine call.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC31" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<pre>
|
||||
(?C) callout (assumed number 0)
|
||||
(?Cn) callout with numerical data n
|
||||
@@ -682,9 +682,9 @@ to the subroutine call.
|
||||
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
||||
start and the end), and the starting delimiter { matched with the ending
|
||||
delimiter }. To encode the ending delimiter within the string, double it.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC32" href="#TOC1">REPLACEMENT STRINGS</a><br>
|
||||
<P>
|
||||
<p>
|
||||
If the PCRE2_SUBSTITUTE_LITERAL option is set, a replacement string for
|
||||
<b>pcre2_substitute()</b> is not interpreted. Otherwise, by default, the only
|
||||
special character is the dollar character in one of the following forms:
|
||||
@@ -700,8 +700,8 @@ special character is the dollar character in one of the following forms:
|
||||
</pre>
|
||||
For ${n}, n can be a name or a number. If PCRE2_SUBSTITUTE_EXTENDED is set,
|
||||
there is additional interpretation:
|
||||
</P>
|
||||
<P>
|
||||
</p>
|
||||
<p>
|
||||
1. Backslash is an escape character, and the forms described in "ESCAPED
|
||||
CHARACTERS" above are recognized. Also:
|
||||
<pre>
|
||||
@@ -719,8 +719,8 @@ CHARACTERS" above are recognized. Also:
|
||||
2. The Python form \g<n>, where the angle brackets are part of the syntax and
|
||||
<i>n</i> is either a group name or a number, is recognized as an alternative way
|
||||
of inserting the contents of a group, for example \g<3>.
|
||||
</P>
|
||||
<P>
|
||||
</p>
|
||||
<p>
|
||||
3. Capture substitution supports the following additional forms:
|
||||
<pre>
|
||||
${n:-string} default for unset group
|
||||
@@ -728,23 +728,23 @@ of inserting the contents of a group, for example \g<3>.
|
||||
</pre>
|
||||
The substitution strings themselves are expanded. Backslash can be used to
|
||||
escape colons and closing curly brackets.
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC33" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<p>
|
||||
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC34" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
<p>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
</p>
|
||||
<br><a name="SEC35" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
<p>
|
||||
Last updated: 27 November 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
|
Reference in New Issue
Block a user