Source and document file tidies for 10.20-RC1.

2025-10-21 14:41:52 +08:00 · 2015-06-18 16:39:25 +00:00
parent a68ddd48b5
commit 07a8fdce25
40 changed files with 677 additions and 439 deletions
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
 <ul>
 <li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
 <li><a name="TOC2" href="#SEC2">QUOTING</a>
-<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
+<li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a>
 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
 <li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
@@ -55,11 +55,12 @@ documentation. This document contains a quick-reference summary of the syntax.
  \Q...\E    treat enclosed characters as literal
 </PRE>
 </P>
-<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
+<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
 <P>
+This table applies to ASCII and Unicode environments.
 <pre>
  \a         alarm, that is, the BEL character (hex 07)
-  \cx        "control-x", where x is any ASCII character
+  \cx        "control-x", where x is any ASCII printing character
  \e         escape (hex 1B)
  \f         form feed (hex 0C)
  \n         newline (hex 0A)
@@ -68,18 +69,32 @@ documentation. This document contains a quick-reference summary of the syntax.
  \0dd       character with octal code 0dd
  \ddd       character with octal code ddd, or backreference
  \o{ddd..}  character with octal code ddd..
+  \U         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
+  \uhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
  \xhh       character with hex code hh
  \x{hhh..}  character with hex code hhh..
 </pre>
-Note that \0dd is always an octal code, and that \8 and \9 are the literal
-characters "8" and "9".
+Note that \0dd is always an octal code. The treatment of backslash followed by
+a non-zero digit is complicated; for details see the section
+<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
+in the
+<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
+documentation, where details of escape processing in EBCDIC environments are
+also given.
+</P>
+<P>
+When \x is not followed by {, from zero to two hexadecimal digits are read,
+but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
+be recognized as a hexadecimal escape; otherwise it matches a literal "x".
+Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
+it matches a literal "u".
 </P>
 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
 <P>
 <pre>
  .          any character except newline;
               in dotall mode, any character whatsoever
-  \C         one data unit, even in UTF mode (best avoided)
+  \C         one code unit, even in UTF mode (best avoided)
  \d         a decimal digit
  \D         a character that is not a decimal digit
  \h         a horizontal white space character
@@ -96,6 +111,11 @@ characters "8" and "9".
  \W         a "non-word" character
  \X         a Unicode extended grapheme cluster
 </pre>
+The application can lock out the use of \C by setting the
+PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
+current matching point in the middle of a UTF-8 or UTF-16 character.
+</P>
+<P>
 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
 or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
 happening, \s and \w may also match characters with code points in the range
@@ -348,13 +368,14 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
  \b          word boundary
  \B          not a word boundary
  ^           start of subject
-               also after internal newline in multiline mode
+                also after an internal newline in multiline mode
+                (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
  \A          start of subject
  $           end of subject
-               also before newline at end of subject
-               also before internal newline in multiline mode
+                also before newline at end of subject
+                also before internal newline in multiline mode
  \Z          end of subject
-               also before newline at end of subject
+                also before newline at end of subject
  \z          end of subject
  \G          first matching position in subject
 </PRE>
@@ -423,7 +444,9 @@ appear.
  (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
 </pre>
 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_match(), not increase them.
+limits set by the caller of pcre2_match(), not increase them. The application
+can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
+PCRE2_NEVER_UCP options, respectively, at compile time.
 </P>
 <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
 <P>
@@ -539,9 +562,9 @@ pattern is not anchored.
  (?Cn)           callout with numerical data n
  (?C"text")      callout with string data
 </pre>
-The allowed string delimiters are ` ' " ^ % # $ (which are the same for the 
-start and the end), and the starting delimiter { matched with the ending 
-delimiter }. To encode the ending delimiter within the string, double it.   
+The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
+start and the end), and the starting delimiter { matched with the ending
+delimiter }. To encode the ending delimiter within the string, double it.
 </P>
 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
 <P>
@@ -559,7 +582,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 15 March 2015
+Last updated: 13 June 2015
 <br>
 Copyright &copy; 1997-2015 University of Cambridge.
 <br>