Source and document file tidies for 10.20-RC1.

This commit is contained in:
Philip.Hazel
2015-06-18 16:39:25 +00:00
parent a68ddd48b5
commit 07a8fdce25
40 changed files with 677 additions and 439 deletions

View File

@@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
<li><a name="TOC2" href="#SEC2">QUOTING</a>
<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
<li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a>
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
@@ -55,11 +55,12 @@ documentation. This document contains a quick-reference summary of the syntax.
\Q...\E treat enclosed characters as literal
</PRE>
</P>
<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
<P>
This table applies to ASCII and Unicode environments.
<pre>
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character
\cx "control-x", where x is any ASCII printing character
\e escape (hex 1B)
\f form feed (hex 0C)
\n newline (hex 0A)
@@ -68,18 +69,32 @@ documentation. This document contains a quick-reference summary of the syntax.
\0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd..
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
</pre>
Note that \0dd is always an octal code, and that \8 and \9 are the literal
characters "8" and "9".
Note that \0dd is always an octal code. The treatment of backslash followed by
a non-zero digit is complicated; for details see the section
<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation, where details of escape processing in EBCDIC environments are
also given.
</P>
<P>
When \x is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
it matches a literal "u".
</P>
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
<P>
<pre>
. any character except newline;
in dotall mode, any character whatsoever
\C one data unit, even in UTF mode (best avoided)
\C one code unit, even in UTF mode (best avoided)
\d a decimal digit
\D a character that is not a decimal digit
\h a horizontal white space character
@@ -96,6 +111,11 @@ characters "8" and "9".
\W a "non-word" character
\X a Unicode extended grapheme cluster
</pre>
The application can lock out the use of \C by setting the
PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
current matching point in the middle of a UTF-8 or UTF-16 character.
</P>
<P>
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
happening, \s and \w may also match characters with code points in the range
@@ -348,13 +368,14 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
\b word boundary
\B not a word boundary
^ start of subject
also after internal newline in multiline mode
also after an internal newline in multiline mode
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
\A start of subject
$ end of subject
also before newline at end of subject
also before internal newline in multiline mode
also before newline at end of subject
also before internal newline in multiline mode
\Z end of subject
also before newline at end of subject
also before newline at end of subject
\z end of subject
\G first matching position in subject
</PRE>
@@ -423,7 +444,9 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre>
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_match(), not increase them.
limits set by the caller of pcre2_match(), not increase them. The application
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
PCRE2_NEVER_UCP options, respectively, at compile time.
</P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
@@ -539,9 +562,9 @@ pattern is not anchored.
(?Cn) callout with numerical data n
(?C"text") callout with string data
</pre>
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
</P>
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
<P>
@@ -559,7 +582,7 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 15 March 2015
Last updated: 13 June 2015
<br>
Copyright &copy; 1997-2015 University of Cambridge.
<br>