Documentation update for new PCRE2_EXTRA caseless and ASCII options

2025-10-22 16:08:34 +08:00 · 2023-02-04 17:19:56 +00:00
parent 9c905ce0c1
commit 6bf8045997
18 changed files with 2797 additions and 2538 deletions
--- a/doc/html/pcre2unicode.html
+++ b/doc/html/pcre2unicode.html
@@ -118,21 +118,22 @@ and \B, because they are defined in terms of \w and \W. If you want
 to test for a wider sense of, say, "digit", you can use explicit Unicode
 property tests such as \p{Nd}. Alternatively, if you set the PCRE2_UCP option,
 the way that the character escapes work is changed so that Unicode properties
-are used to determine which characters match. There are more details in the
-section on
+are used to determine which characters match, though there are some options
+that suppress this for individual escapes. For details see the section on
 <a href="pcre2pattern.html#genericchartypes">generic character types</a>
 in the
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
 documentation.
 </P>
 <P>
-Similarly, characters that match the POSIX named character classes are all
-low-valued characters, unless the PCRE2_UCP option is set.
+Like the escapes, characters that match the POSIX named character classes are
+all low-valued characters unless the PCRE2_UCP option is set, but there is an
+option to override this.
 </P>
 <P>
-However, the special horizontal and vertical white space matching escapes (\h,
-\H, \v, and \V) do match all the appropriate Unicode characters, whether or
-not PCRE2_UCP is set.
+In contrast to the character escapes and character classes, the special
+horizontal and vertical white space escapes (\h, \H, \v, and \V) do match
+all the appropriate Unicode characters, whether or not PCRE2_UCP is set.
 </P>
 <br><b>
 UNICODE CASE-EQUIVALENCE
@@ -145,6 +146,14 @@ lookup is used for speed. A few Unicode characters such as Greek sigma have
 more than two code points that are case-equivalent, and these are treated
 specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
 processing for non-UTF character encodings such as UCS-2.
+</P>
+<P>
+There are two ASCII characters (S and K) that, in addition to their ASCII lower
+case equivalents, have a non-ASCII one as well (long S and Kelvin sign).
+Recognition of these non-ASCII characters as case-equivalent to their ASCII
+counterparts can be disabled by setting the PCRE2_EXTRA_CASELESS_RESTRICT
+option. When this is set, all characters in a case equivalence must either be
+ASCII or non-ASCII; there can be no mixing.
 <a name="scriptruns"></a></P>
 <br><b>
 SCRIPT RUNS
@@ -501,7 +510,7 @@ Cambridge, England.
 REVISION
 </b><br>
 <P>
-Last updated: 20 January 2023
+Last updated: 04 February 2023
 <br>
 Copyright &copy; 1997-2023 University of Cambridge.
 <br>