diff --git a/ChangeLog b/ChangeLog index 676f0a08..b027003d 100644 --- a/ChangeLog +++ b/ChangeLog @@ -69,15 +69,19 @@ for example (?<=AB|CD?). Now all branches are checked for variability. 11. Matching with pcre2_match() could give an incorrect result if a variable-length lookbehind was used as the condition in a conditional group. The condition could erroneously be treated as true if a branch matched but -overran the current position. This bug was in the interpreter only; matching +overran the current position. This bug was in the interpreter only; matching with JIT was correct. 12. Add a new error code (PCRE2_ERROR_JIT_UNSUPPORTED) which is yielded for unsupported jit features. -13. Add a new experimental feature called scan substring. This feature -is a new type of assertion which matches the content of a capturing block -to a sub pattern. +13. Add a new experimental feature called scan substring. This feature is a new +type of assertion which matches the content of a capturing block to a sub +pattern. + +14. Item 43 of 10.43 was incomplete because it addressed only \z and not \Z, +which was still misbehaving when matching fragments inside invalid UTF strings. + Version 10.44 07-June-2024 diff --git a/HACKING b/HACKING index 2ed86381..58ff1f4c 100644 --- a/HACKING +++ b/HACKING @@ -259,14 +259,16 @@ META_COND_RNAME (?(R&name) META_COND_RNUMBER (?(Rdigits) META_RECURSE_BYNAME (?&name) META_BACKREF_BYNAME \k'name' or \k or \k{name} or \g{name} +META_SCS_NAME (*scs:()...) META_COND_RNUMBER is used for names that start with R and continue with digits, because this is an ambiguous case. It could be a back reference to a group with that name, or it could be a recursion test on a numbered group. -This one is followed by an offset, for use in error messages, then a number: +These are followed by an offset, for use in error messages, then a number: META_COND_NUMBER (?([+-]digits) +META_SCS_NUMBER (*scs:(digits)...) The following is followed just by an offset, for use in error messages: @@ -752,6 +754,10 @@ In ASCII or UTF-32 mode, the character counts in OP_REVERSE and OP_VREVERSE are also the number of code units, but in UTF-8/16 mode each character may occupy more than one code unit. +The "scan substring" assertion compiles as OP_ASSERT_SCS. What follows takes +the same form as a conditional subpattern with a back reference condition (see +next section). + Conditional subpatterns ----------------------- @@ -859,4 +865,4 @@ The file maint/README contains additional information. Philip Hazel -June 2024 +August 2024 diff --git a/doc/html/pcre2_jit_compile.html b/doc/html/pcre2_jit_compile.html index 626297a0..791dd0c3 100644 --- a/doc/html/pcre2_jit_compile.html +++ b/doc/html/pcre2_jit_compile.html @@ -34,11 +34,12 @@ documentation.

The availability of JIT support can be tested by calling -pcre2_compile_jit() with a NULL first argument and the single option -PCRE2_JIT_TEST_ALLOC. Such a call returns zero if JIT is available and has a -working allocator. Otherwise it returns PCRE2_ERROR_NOMEMORY if JIT is -available but cannot allocate executable memory, or PCRE2_ERROR_NULL if JIT -support is not compiled. +pcre2_compile_jit() with a single option PCRE2_JIT_TEST_ALLOC (the +code argument is ignored, so a NULL value is accepted). Such a call +returns zero if JIT is available and has a working allocator. Otherwise +it returns PCRE2_ERROR_NOMEMORY if JIT is available but cannot allocate +executable memory, or PCRE2_ERROR_JIT_UNSUPPORTED if JIT support is not +compiled.

Otherwise, the first argument must be a pointer that was returned by a @@ -59,7 +60,8 @@ for success, or a negative error code otherwise. In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or if an unknown bit is set in options. The function can also return PCRE2_ERROR_NOMEMORY if JIT is unable to allocate executable memory for the compiler, even if it was -because of a system security restriction. +because of a system security restriction. In a few cases, the function may +return with PCRE2_ERROR_JIT_UNSUPPORTED for unsupported features.

There is a complete description of the PCRE2 native API in the diff --git a/doc/html/pcre2compat.html b/doc/html/pcre2compat.html index 7ed2360f..62530db3 100644 --- a/doc/html/pcre2compat.html +++ b/doc/html/pcre2compat.html @@ -236,6 +236,10 @@ and condition references such as (?(4)...). PCRE2 supports relative group numbers such as +2 and -4 in all three cases. Perl supports both plus and minus for subroutine calls, but only minus for back references, and no relative numbering at all for conditions. +
+
+(m) The scan substring assertion (syntax (*scs:(n)...)) is a PCRE2 extension +that is not available in Perl.

20. Perl has different limits than PCRE2. See the @@ -272,7 +276,7 @@ Cambridge, England. REVISION

-Last updated: 12 August 2024 +Last updated: 30 August 2024
Copyright © 1997-2024 University of Cambridge.
diff --git a/doc/html/pcre2jit.html b/doc/html/pcre2jit.html index 3a139c88..5f1373a0 100644 --- a/doc/html/pcre2jit.html +++ b/doc/html/pcre2jit.html @@ -78,11 +78,11 @@ be used.

As of release 10.45 there is a more informative way to test for JIT support. If -pcre2_compile_jit() is called with a NULL first argument and the single -option PCRE2_JIT_TEST_ALLOC, it returns zero if JIT is available and has a -working allocator. Otherwise it returns PCRE2_ERROR_NOMEMORY if JIT is -available but cannot allocate executable memory, or PCRE2_ERROR_NULL if JIT -support is not compiled. +pcre2_compile_jit() is called with the single option PCRE2_JIT_TEST_ALLOC +it returns zero if JIT is available and has a working allocator. Otherwise it +returns PCRE2_ERROR_NOMEMORY if JIT is available but cannot allocate executable +memory, or PCRE2_ERROR_JIT_UNSUPPORTED if JIT support is not compiled. The +code argument is ignored, so it can be a NULL value.

A simple program does not need to check availability in order to use JIT when diff --git a/doc/html/pcre2matching.html b/doc/html/pcre2matching.html index 3b8b6293..4d023250 100644 --- a/doc/html/pcre2matching.html +++ b/doc/html/pcre2matching.html @@ -27,7 +27,7 @@ please consult the man page, in case the conversion went wrong. This document describes the two different algorithms that are available in PCRE2 for matching a compiled regular expression against a given subject string. The "standard" algorithm is the one provided by the pcre2_match() -function. This works in the same as Perl's matching function, and provide a +function. This works in the same as Perl's matching function, and provides a Perl-compatible matching operation. The just-in-time (JIT) optimization that is described in the pcre2jit @@ -42,7 +42,7 @@ these are described below.

When there is only one possible way in which a given subject string can match a pattern, the two algorithms give the same answer. A difference arises, however, -when there are multiple possibilities. For example, if the pattern +when there are multiple possibilities. For example, if the anchored pattern

   ^<.*>
 
@@ -115,9 +115,9 @@ algorithm after the first match (which is necessarily the shortest) is found.

Note that the size of vector needed to contain all the results depends on the -number of simultaneous matches, not on the number of parentheses in the -pattern. Using pcre2_match_data_create_from_pattern() to create the match -data block is therefore not advisable when doing DFA matching. +number of simultaneous matches, not on the number of capturing parentheses in +the pattern. Using pcre2_match_data_create_from_pattern() to create the +match data block is therefore not advisable when doing DFA matching.

Note also that all the matches that are found start at the same point in the @@ -166,37 +166,43 @@ possibilities, and PCRE2's implementation of this algorithm does not attempt to do this. This means that no captured substrings are available.

-3. Because no substrings are captured, backreferences within the pattern are -not supported. +3. Because no substrings are captured, a number of related features are not +available: +
+
+(a) Backreferences; +
+
+(b) Conditional expressions that use a backreference as the condition or test +for a specific group recursion; +
+
+(c) Script runs; +
+
+(d) Scan substring assertions.

-4. For the same reason, conditional expressions that use a backreference as the -condition or test for a specific group recursion are not supported. -

-

-5. Again for the same reason, script runs are not supported. -

-

-6. Because many paths through the tree may be active, the \K escape sequence, +4. Because many paths through the tree may be active, the \K escape sequence, which resets the start of the match when encountered (but may be on some paths and not on others), is not supported.

-7. Callouts are supported, but the value of the capture_top field is +5. Callouts are supported, but the value of the capture_top field is always 1, and the value of the capture_last field is always 0.

-8. The \C escape sequence, which (in the standard algorithm) always matches a -single code unit, even in a UTF mode, is not supported in these modes, because +6. The \C escape sequence, which (in the standard algorithm) always matches a +single code unit, even in a UTF mode, is not supported in UTF modes because the alternative algorithm moves through the subject string one character (not code unit) at a time, for all active paths through the tree.

-9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not +7. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not supported. (*FAIL) is supported, and behaves like a failing negative assertion.

-10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not +8. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not supported by pcre2_dfa_match().


ADVANTAGES OF THE ALTERNATIVE ALGORITHM
@@ -223,15 +229,18 @@ because it has to search for all possible matches, but is also because it is less susceptible to optimization.

-2. Capturing parentheses, backreferences, script runs, and matching within -invalid UTF string are not supported. +2. Capturing parentheses and other features such as backreferences that rely on +them are not supported.

-3. Although atomic groups are supported, their use does not provide the +3. Matching within invalid UTF strings is not supported. +

+

+4. Although atomic groups are supported, their use does not provide the performance advantage that it does for the standard algorithm.

-4. JIT optimization is not supported. +5. JIT optimization is not supported.


AUTHOR

@@ -244,7 +253,7 @@ Cambridge, England.


REVISION

-Last updated: 19 January 2024 +Last updated: 30 August 2024
Copyright © 1997-2024 University of Cambridge.
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 4886cd03..415f07fa 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -34,17 +34,18 @@ please consult the man page, in case the conversion went wrong.

  • BACKREFERENCES
  • ASSERTIONS
  • NON-ATOMIC ASSERTIONS -
  • SCRIPT RUNS -
  • CONDITIONAL GROUPS -
  • COMMENTS -
  • RECURSIVE PATTERNS -
  • GROUPS AS SUBROUTINES -
  • ONIGURUMA SUBROUTINE SYNTAX -
  • CALLOUTS -
  • BACKTRACKING CONTROL -
  • SEE ALSO -
  • AUTHOR -
  • REVISION +
  • SCAN SUBSTRING ASSERTIONS +
  • SCRIPT RUNS +
  • CONDITIONAL GROUPS +
  • COMMENTS +
  • RECURSIVE PATTERNS +
  • GROUPS AS SUBROUTINES +
  • ONIGURUMA SUBROUTINE SYNTAX +
  • CALLOUTS +
  • BACKTRACKING CONTROL +
  • SEE ALSO +
  • AUTHOR +
  • REVISION
    PCRE2 REGULAR EXPRESSION DETAILS

    @@ -406,11 +407,11 @@ a character class, this causes an error, because the character class is then not terminated by a closing square bracket.

    -Another difference from Perl is that any appearance of \Q or \E inside what -might otherwise be a quantifier causes PCRE2 not to recognize the sequence as a -quantifier. Perl recognizes a quantifier if (redundantly) either of the numbers -is inside \Q...\E, but not if the separating comma is. When not recognized as -a quantifier a sequence such as {\Q1\E,2} is treated as the literal string +Another difference from Perl is that any appearance of \Q or \E inside what +might otherwise be a quantifier causes PCRE2 not to recognize the sequence as a +quantifier. Perl recognizes a quantifier if (redundantly) either of the numbers +is inside \Q...\E, but not if the separating comma is. When not recognized as +a quantifier a sequence such as {\Q1\E,2} is treated as the literal string "{1,2}".


    @@ -851,7 +852,7 @@ examples are all equivalent:
       \p{bidiclass=al}
       \p{BC=al}
    -  \p{ Bidi_Class : AL } 
    +  \p{ Bidi_Class : AL }
       \p{ Bi-di class = Al }
       \P{ ^ Bi-di class = Al }
     
    @@ -1099,7 +1100,7 @@ explicitly. These properties are: Xan matches characters that have either the L (letter) or the N (number) property. Xps matches the characters tab, linefeed, vertical tab, form feed, or -carriage return, and any other character that has the Z (separator) property +carriage return, and any other character that has the Z (separator) property (this includes the space character). Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl compatibility, but Perl changed. Xwd matches the same characters as Xan, plus those that match Mn (non-spacing mark) or Pc @@ -2411,21 +2412,30 @@ as normal.


    ASSERTIONS

    -An assertion is a test on the characters following or preceding the current -matching point that does not consume any characters. The simple assertions -coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described +An assertion is a test that does not consume any characters. The test must +succeed for the match to continue. The simple assertions coded as \b, \B, +\A, \G, \Z, \z, ^ and $ are described above.

    -More complicated assertions are coded as parenthesized groups. There are two -kinds: those that look ahead of the current position in the subject string, and -those that look behind it, and in each case an assertion may be positive (must -match for the assertion to be true) or negative (must not match for the -assertion to be true). An assertion group is matched in the normal way, -and if it is true, matching continues after it, but with the matching position +More complicated assertions are coded as parenthesized groups. If matching such +a group succeeds, matching continues after it, but with the matching position in the subject string reset to what it was before the assertion was processed.

    +A special kind of assertion, called a "scan substring" assertion, matches a +subpattern against a previously captured substring. This is described in the +section entitled +"Scan substring assertions" +below. It is a PCRE2 extension, not compatible with Perl. +

    +

    +The other goup-based assertions are of two kinds: those that look ahead of the +current position in the subject string, and those that look behind it, and in +each case an assertion may be positive (must match for the assertion to be +true) or negative (must not match for the assertion to be true). +

    +

    The Perl-compatible lookaround assertions are atomic. If an assertion is true, but there is a subsequent matching failure, there is no backtracking into the assertion. However, there are some cases where non-atomic assertions can be @@ -2709,8 +2719,62 @@ contain any control verbs such as (*ACCEPT). (This may change in future). Note that assertions that appear as conditions for conditional groups (see below) must be atomic. +

    +
    SCAN SUBSTRING ASSERTIONS
    +

    +A special kind of assertion, not compatible with Perl, makes it possible to +check the contents of a captured substring by matching it with a subpattern. +Because this involves capturing, this feature is not supported by +pcre2_dfa_match().

    -
    SCRIPT RUNS
    +

    +A scan substring assertion starts with the sequence (*scan_substring: or +(*scs: which is followed by a substring number (absolute or relative) or a +substring name enclosed in single quotes or angle brackets, within parentheses. +The rest of the item is the subpattern that is applied to the substring, as +shown here: +

    +  (*scan_substring:(1)...)
    +  (*scs:(-2)...)
    +  (*scs:('AB')...)
    +
    +The pattern match on the substring is always anchored, that is, it must match +from the start of the substring. There is no "bumpalong" if it does not match +at the start. The end of the subject is temporarily reset to be the end of the +substring, so \Z, \z, and $ will match there. However, the start of the +subject is not reset. This means that ^ matches only if the substring is +actually at the start of the main subject, but it also means that lookbehind +assertions into what precedes the substring are possible. +

    +

    +Here is a very simple example: find a word that contains the rare (in English) +sequence of letters "rh" not at the start: +

    +  \b(\w++)(*scs:(1).+rh)
    +
    +The first group captures a word which is then scanned by the second group. +This example does not actually need this heavyweight feature; the same match +can be achieved with: +
    +  \b\w+?rh\w*\b
    +
    +When things are more complicated, however, this feature can be useful. For +exmple, there is a rather complicated pattern in the PCRE2 test data that +checks an entire subject string for a palindrome, that the sequence of +letters is the same in both directions. Suppose you want to search for +individual words of two or more characters such as "level" that are +palindromes: +
    +  (\b\w{2,}+\b)(*scs:(1)...palindrome-matching-pattern...)
    +
    +A scan substring assertion fails if the group it references has not been set. +Within the subpattern, references to other groups work as normal. Other +capturing groups may appear, and will retain their values during ongoing +matching if the assertion succeeds. When PCRE2_DUPNAMES is set and there are +ambiguous group names, if the assertion scans a group by name, it is the lowest +numbered set group of the groups with that name that is scanned. +

    +
    SCRIPT RUNS

    In concept, a script run is a sequence of characters that are all from the same Unicode script such as Latin or Greek. However, because some scripts are @@ -2772,7 +2836,7 @@ parentheses. should not be used within a script run group, because it causes an immediate exit from the group, bypassing the script run checking.

    -
    CONDITIONAL GROUPS
    +
    CONDITIONAL GROUPS

    It is possible to cause the matching process to obey a pattern fragment conditionally or to choose between two alternative fragments, depending on @@ -2973,7 +3037,7 @@ positive and negative assertions, because matching always continues after the assertion, whether it succeeds or fails. (Compare non-conditional assertions, for which captures are retained only for positive assertions that succeed.)

    -
    COMMENTS
    +
    COMMENTS

    There are two ways of including comments in patterns that are processed by PCRE2. In both cases, the start of the comment must not be in a character @@ -3003,7 +3067,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so it does not terminate the comment. Only an actual character with the code value 0x0a (the default newline) does so.

    -
    RECURSIVE PATTERNS
    +
    RECURSIVE PATTERNS

    Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can @@ -3191,7 +3255,7 @@ alternative matches "a" and then recurses. In the recursion, \1 does now match "b" and so the whole match succeeds. This match used to fail in Perl, but in later versions (I tried 5.024) it now works.

    -
    GROUPS AS SUBROUTINES
    +
    GROUPS AS SUBROUTINES

    If the syntax for a recursive group call (either by number or by name) is used outside the parentheses to which it refers, it operates a bit like a subroutine @@ -3239,7 +3303,7 @@ in groups when called as subroutines is described in the section entitled "Backtracking verbs in subroutines" below.

    -
    ONIGURUMA SUBROUTINE SYNTAX
    +
    ONIGURUMA SUBROUTINE SYNTAX

    For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is an alternative @@ -3257,7 +3321,7 @@ plus or a minus sign it is taken as a relative reference. For example: Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not synonymous. The former is a backreference; the latter is a subroutine call.

    -
    CALLOUTS
    +
    CALLOUTS

    Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl code to be obeyed in the middle of matching a regular expression. This makes it @@ -3333,7 +3397,7 @@ example: The doubling is removed before the string is passed to the callout function.

    -
    BACKTRACKING CONTROL
    +
    BACKTRACKING CONTROL

    There are a number of special "Backtracking Control Verbs" (to use Perl's terminology) that modify the behaviour of backtracking during matching. They @@ -3856,12 +3920,12 @@ enclosing group that has alternatives (its normal behaviour). However, if there is no such group within the subroutine's group, the subroutine match fails and there is a backtrack at the outer level.

    -
    SEE ALSO
    +
    SEE ALSO

    pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), pcre2(3).

    -
    AUTHOR
    +
    AUTHOR

    Philip Hazel
    @@ -3870,9 +3934,9 @@ Retired from University Computing Service Cambridge, England.

    -
    REVISION
    +
    REVISION

    -Last updated: 12 August 2024 +Last updated: 30 August 2024
    Copyright © 1997-2024 University of Cambridge.
    diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html index fa3b275d..c042ab36 100644 --- a/doc/html/pcre2syntax.html +++ b/doc/html/pcre2syntax.html @@ -36,15 +36,16 @@ please consult the man page, in case the conversion went wrong.

  • WHAT \R MATCHES
  • LOOKAHEAD AND LOOKBEHIND ASSERTIONS
  • NON-ATOMIC LOOKAROUND ASSERTIONS -
  • SCRIPT RUNS -
  • BACKREFERENCES -
  • SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) -
  • CONDITIONAL PATTERNS -
  • BACKTRACKING CONTROL -
  • CALLOUTS -
  • SEE ALSO -
  • AUTHOR -
  • REVISION +
  • SUBSTRING SCAN ASSERTION +
  • SCRIPT RUNS +
  • BACKREFERENCES +
  • SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) +
  • CONDITIONAL PATTERNS +
  • BACKTRACKING CONTROL +
  • CALLOUTS +
  • SEE ALSO +
  • AUTHOR +
  • REVISION
    PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY

    @@ -60,7 +61,7 @@ documentation. This document contains a quick-reference summary of the syntax. \Q...\E treat enclosed characters as literal Note that white space inside \Q...\E is always treated as literal, even if -PCRE2_EXTENDED is set, causing most other white space to be ignored. Note also +PCRE2_EXTENDED is set, causing most other white space to be ignored. Note also that PCRE2's handling of \Q...\E has some differences from Perl's. See the pcre2pattern documentation for details. @@ -509,7 +510,19 @@ These assertions are specific to PCRE2 and are not Perl-compatible. (*non_atomic_positive_lookbehind:...) )

    -
    SCRIPT RUNS
    +
    SUBSTRING SCAN ASSERTION
    +

    +This feature is not Perl-compatible. +

    +  (*scs:(n)...)       scan substring by absolute reference
    +  (*scs:(+n)...)      scan substring by relative reference
    +  (*scs:(-n)...)      scan substring by relative reference
    +  (*scs:(<name>)...)  scan substring by name
    +  (*scs:('name')...)  scan substring by name
    +
    +The full name "scan_substring" may be used instead of "scs". +

    +
    SCRIPT RUNS

       (*script_run:...)           ) script run, can be backtracked into
    @@ -519,7 +532,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
       (*asr:...)                  )
     

    -
    BACKREFERENCES
    +
    BACKREFERENCES

       \n              reference by number (can be ambiguous)
    @@ -536,7 +549,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
       (?P=name)       reference by name (Python)
     

    -
    SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
    +
    SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)

       (?R)            recurse whole pattern
    @@ -555,7 +568,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
       \g'-n'          call subroutine by relative number (PCRE2 extension)
     

    -
    CONDITIONAL PATTERNS
    +
    CONDITIONAL PATTERNS

       (?(condition)yes-pattern)
    @@ -578,7 +591,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
     conditions or recursion tests. Such a condition is interpreted as a reference
     condition if the relevant named group exists.
     

    -
    BACKTRACKING CONTROL
    +
    BACKTRACKING CONTROL

    All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the name is mandatory, for the others it is optional. (*SKIP) changes its behaviour @@ -605,7 +618,7 @@ pattern is not anchored. The effect of one of these verbs in a group called as a subroutine is confined to the subroutine call.

    -
    CALLOUTS
    +
    CALLOUTS

       (?C)            callout (assumed number 0)
    @@ -616,12 +629,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
     start and the end), and the starting delimiter { matched with the ending
     delimiter }. To encode the ending delimiter within the string, double it.
     

    -
    SEE ALSO
    +
    SEE ALSO

    pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2(3).

    -
    AUTHOR
    +
    AUTHOR

    Philip Hazel
    @@ -630,9 +643,9 @@ Retired from University Computing Service Cambridge, England.

    -
    REVISION
    +
    REVISION

    -Last updated: 12 August 2024 +Last updated: 30 August 2024
    Copyright © 1997-2024 University of Cambridge.
    diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 510ad4ee..59b751f8 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -188,8 +188,8 @@ REVISION PCRE2 10.38 27 August 2021 PCRE2(3) ------------------------------------------------------------------------------ - - + + PCRE2API(3) Library Functions Manual PCRE2API(3) @@ -4030,8 +4030,8 @@ REVISION PCRE2 10.45 04 August 2024 PCRE2API(3) ------------------------------------------------------------------------------ - - + + PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3) @@ -4656,8 +4656,8 @@ REVISION PCRE2 10.44 15 April 2024 PCRE2BUILD(3) ------------------------------------------------------------------------------ - - + + PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3) @@ -5089,8 +5089,8 @@ REVISION PCRE2 10.43 19 January 2024 PCRE2CALLOUT(3) ------------------------------------------------------------------------------ - - + + PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3) @@ -5293,20 +5293,23 @@ DIFFERENCES BETWEEN PCRE2 AND PERL Perl supports both plus and minus for subroutine calls, but only minus for back references, and no relative numbering at all for conditions. + (m) The scan substring assertion (syntax (*scs:(n)...)) is a PCRE2 ex- + tension that is not available in Perl. + 20. Perl has different limits than PCRE2. See the pcre2limit documenta- tion for details. Perl went with 5.10 from recursion to iteration keep- ing the intermediate matches on the heap, which is ~10% slower but does - not fall into any stack-overflow limit. PCRE2 made a similar change at - release 10.30, and also has many build-time and run-time customizable + not fall into any stack-overflow limit. PCRE2 made a similar change at + release 10.30, and also has many build-time and run-time customizable limits. - 21. Unlike Perl, PCRE2 doesn't have character set modifiers and spe- - cially no way to set characters by context just like Perl's "/d". A + 21. Unlike Perl, PCRE2 doesn't have character set modifiers and spe- + cially no way to set characters by context just like Perl's "/d". A regular expression using PCRE2_UTF and PCRE2_UCP will use similar rules - to Perl's "/u"; something closer to "/a" could be selected by adding + to Perl's "/u"; something closer to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top. - 22. Some recursive patterns that Perl diagnoses as infinite recursions + 22. Some recursive patterns that Perl diagnoses as infinite recursions can be handled by PCRE2, either by the interpreter or the JIT. An exam- ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number of repeated "abcd" substrings at the end of the subject. @@ -5321,14 +5324,14 @@ AUTHOR REVISION - Last updated: 12 August 2024 + Last updated: 30 August 2024 Copyright (c) 1997-2024 University of Cambridge. -PCRE2 10.45 12 August 2024 PCRE2COMPAT(3) +PCRE2 10.45 30 August 2024 PCRE2COMPAT(3) ------------------------------------------------------------------------------ - - + + PCRE2JIT(3) Library Functions Manual PCRE2JIT(3) @@ -5384,11 +5387,12 @@ AVAILABILITY OF JIT SUPPORT be used. As of release 10.45 there is a more informative way to test for JIT - support. If pcre2_compile_jit() is called with a NULL first argument - and the single option PCRE2_JIT_TEST_ALLOC, it returns zero if JIT is - available and has a working allocator. Otherwise it returns PCRE2_ER- - ROR_NOMEMORY if JIT is available but cannot allocate executable memory, - or PCRE2_ERROR_NULL if JIT support is not compiled. + support. If pcre2_compile_jit() is called with the single option + PCRE2_JIT_TEST_ALLOC it returns zero if JIT is available and has a + working allocator. Otherwise it returns PCRE2_ERROR_NOMEMORY if JIT is + available but cannot allocate executable memory, or PCRE2_ERROR_JIT_UN- + SUPPORTED if JIT support is not compiled. The code argument is ignored, + so it can be a NULL value. A simple program does not need to check availability in order to use JIT when possible. The API is implemented in a way that falls back to @@ -5781,8 +5785,8 @@ REVISION PCRE2 10.45 23 July 2024 PCRE2JIT(3) ------------------------------------------------------------------------------ - - + + PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3) @@ -5864,8 +5868,8 @@ REVISION PCRE2 10.43 1 August 2023 PCRE2LIMITS(3) ------------------------------------------------------------------------------ - - + + PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3) @@ -5879,7 +5883,7 @@ PCRE2 MATCHING ALGORITHMS in PCRE2 for matching a compiled regular expression against a given subject string. The "standard" algorithm is the one provided by the pcre2_match() function. This works in the same as Perl's matching func- - tion, and provide a Perl-compatible matching operation. The just-in- + tion, and provides a Perl-compatible matching operation. The just-in- time (JIT) optimization that is described in the pcre2jit documentation is compatible with this function. @@ -5891,7 +5895,7 @@ PCRE2 MATCHING ALGORITHMS When there is only one possible way in which a given subject string can match a pattern, the two algorithms give the same answer. A difference arises, however, when there are multiple possibilities. For example, if - the pattern + the anchored pattern ^<.*> @@ -5967,83 +5971,86 @@ THE ALTERNATIVE MATCHING ALGORITHM first match (which is necessarily the shortest) is found. Note that the size of vector needed to contain all the results depends - on the number of simultaneous matches, not on the number of parentheses - in the pattern. Using pcre2_match_data_create_from_pattern() to create - the match data block is therefore not advisable when doing DFA match- - ing. + on the number of simultaneous matches, not on the number of capturing + parentheses in the pattern. Using pcre2_match_data_create_from_pat- + tern() to create the match data block is therefore not advisable when + doing DFA matching. - Note also that all the matches that are found start at the same point + Note also that all the matches that are found start at the same point in the subject. If the pattern cat(er(pillar)?)? - is matched against the string "the caterpillar catchment", the result - is the three strings "caterpillar", "cater", and "cat" that start at - the fifth character of the subject. The algorithm does not automati- + is matched against the string "the caterpillar catchment", the result + is the three strings "caterpillar", "cater", and "cat" that start at + the fifth character of the subject. The algorithm does not automati- cally move on to find matches that start at later positions. PCRE2's "auto-possessification" optimization usually applies to charac- - ter repeats at the end of a pattern (as well as internally). For exam- + ter repeats at the end of a pattern (as well as internally). For exam- ple, the pattern "a\d+" is compiled as if it were "a\d++" because there - is no point even considering the possibility of backtracking into the - repeated digits. For DFA matching, this means that only one possible - match is found. If you really do want multiple matches in such cases, - either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS- + is no point even considering the possibility of backtracking into the + repeated digits. For DFA matching, this means that only one possible + match is found. If you really do want multiple matches in such cases, + either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS- SESS option when compiling. - There are a number of features of PCRE2 regular expressions that are - not supported or behave differently in the alternative matching func- + There are a number of features of PCRE2 regular expressions that are + not supported or behave differently in the alternative matching func- tion. Those that are not supported cause an error if encountered. - 1. Because the algorithm finds all possible matches, the greedy or un- - greedy nature of repetition quantifiers is not relevant (though it may - affect auto-possessification, as just described). During matching, - greedy and ungreedy quantifiers are treated in exactly the same way. + 1. Because the algorithm finds all possible matches, the greedy or un- + greedy nature of repetition quantifiers is not relevant (though it may + affect auto-possessification, as just described). During matching, + greedy and ungreedy quantifiers are treated in exactly the same way. However, possessive quantifiers can make a difference when what follows - could also match what is quantified, for example in a pattern like + could also match what is quantified, for example in a pattern like this: ^a++\w! - This pattern matches "aaab!" but not "aaa!", which would be matched by - a non-possessive quantifier. Similarly, if an atomic group is present, - it is matched as if it were a standalone pattern at the current point, - and the longest match is then "locked in" for the rest of the overall + This pattern matches "aaab!" but not "aaa!", which would be matched by + a non-possessive quantifier. Similarly, if an atomic group is present, + it is matched as if it were a standalone pattern at the current point, + and the longest match is then "locked in" for the rest of the overall pattern. 2. When dealing with multiple paths through the tree simultaneously, it - is not straightforward to keep track of captured substrings for the - different matching possibilities, and PCRE2's implementation of this + is not straightforward to keep track of captured substrings for the + different matching possibilities, and PCRE2's implementation of this algorithm does not attempt to do this. This means that no captured sub- strings are available. - 3. Because no substrings are captured, backreferences within the pat- - tern are not supported. + 3. Because no substrings are captured, a number of related features are + not available: - 4. For the same reason, conditional expressions that use a backrefer- - ence as the condition or test for a specific group recursion are not - supported. + (a) Backreferences; - 5. Again for the same reason, script runs are not supported. + (b) Conditional expressions that use a backreference as the condition + or test for a specific group recursion; - 6. Because many paths through the tree may be active, the \K escape se- - quence, which resets the start of the match when encountered (but may + (c) Script runs; + + (d) Scan substring assertions. + + 4. Because many paths through the tree may be active, the \K escape se- + quence, which resets the start of the match when encountered (but may be on some paths and not on others), is not supported. - 7. Callouts are supported, but the value of the capture_top field is + 5. Callouts are supported, but the value of the capture_top field is always 1, and the value of the capture_last field is always 0. - 8. The \C escape sequence, which (in the standard algorithm) always - matches a single code unit, even in a UTF mode, is not supported in - these modes, because the alternative algorithm moves through the sub- - ject string one character (not code unit) at a time, for all active - paths through the tree. + 6. The \C escape sequence, which (in the standard algorithm) always + matches a single code unit, even in a UTF mode, is not supported in UTF + modes because the alternative algorithm moves through the subject + string one character (not code unit) at a time, for all active paths + through the tree. - 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) + 7. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not supported. (*FAIL) is supported, and behaves like a failing negative assertion. - 10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup- + 8. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup- ported by pcre2_dfa_match(). @@ -6068,13 +6075,15 @@ DISADVANTAGES OF THE ALTERNATIVE ALGORITHM partly because it has to search for all possible matches, but is also because it is less susceptible to optimization. - 2. Capturing parentheses, backreferences, script runs, and matching - within invalid UTF string are not supported. + 2. Capturing parentheses and other features such as backreferences that + rely on them are not supported. - 3. Although atomic groups are supported, their use does not provide the + 3. Matching within invalid UTF strings is not supported. + + 4. Although atomic groups are supported, their use does not provide the performance advantage that it does for the standard algorithm. - 4. JIT optimization is not supported. + 5. JIT optimization is not supported. AUTHOR @@ -6086,14 +6095,14 @@ AUTHOR REVISION - Last updated: 19 January 2024 + Last updated: 30 August 2024 Copyright (c) 1997-2024 University of Cambridge. -PCRE2 10.43 19 January 2024 PCRE2MATCHING(3) +PCRE2 10.45 30 August 2024 PCRE2MATCHING(3) ------------------------------------------------------------------------------ - - + + PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3) @@ -6475,8 +6484,8 @@ REVISION PCRE2 10.34 04 September 2019 PCRE2PARTIAL(3) ------------------------------------------------------------------------------ - - + + PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3) @@ -8693,19 +8702,25 @@ BACKREFERENCES ASSERTIONS - An assertion is a test on the characters following or preceding the - current matching point that does not consume any characters. The simple - assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described - above. + An assertion is a test that does not consume any characters. The test + must succeed for the match to continue. The simple assertions coded as + \b, \B, \A, \G, \Z, \z, ^ and $ are described above. - More complicated assertions are coded as parenthesized groups. There - are two kinds: those that look ahead of the current position in the - subject string, and those that look behind it, and in each case an as- - sertion may be positive (must match for the assertion to be true) or - negative (must not match for the assertion to be true). An assertion - group is matched in the normal way, and if it is true, matching contin- - ues after it, but with the matching position in the subject string re- - set to what it was before the assertion was processed. + More complicated assertions are coded as parenthesized groups. If + matching such a group succeeds, matching continues after it, but with + the matching position in the subject string reset to what it was before + the assertion was processed. + + A special kind of assertion, called a "scan substring" assertion, + matches a subpattern against a previously captured substring. This is + described in the section entitled "Scan substring assertions" below. It + is a PCRE2 extension, not compatible with Perl. + + The other goup-based assertions are of two kinds: those that look ahead + of the current position in the subject string, and those that look be- + hind it, and in each case an assertion may be positive (must match for + the assertion to be true) or negative (must not match for the assertion + to be true). The Perl-compatible lookaround assertions are atomic. If an assertion is true, but there is a subsequent matching failure, there is no back- @@ -8972,6 +8987,61 @@ NON-ATOMIC ASSERTIONS groups (see below) must be atomic. +SCAN SUBSTRING ASSERTIONS + + A special kind of assertion, not compatible with Perl, makes it possi- + ble to check the contents of a captured substring by matching it with a + subpattern. Because this involves capturing, this feature is not sup- + ported by pcre2_dfa_match(). + + A scan substring assertion starts with the sequence (*scan_substring: + or (*scs: which is followed by a substring number (absolute or rela- + tive) or a substring name enclosed in single quotes or angle brackets, + within parentheses. The rest of the item is the subpattern that is ap- + plied to the substring, as shown here: + + (*scan_substring:(1)...) + (*scs:(-2)...) + (*scs:('AB')...) + + The pattern match on the substring is always anchored, that is, it must + match from the start of the substring. There is no "bumpalong" if it + does not match at the start. The end of the subject is temporarily re- + set to be the end of the substring, so \Z, \z, and $ will match there. + However, the start of the subject is not reset. This means that ^ + matches only if the substring is actually at the start of the main sub- + ject, but it also means that lookbehind assertions into what precedes + the substring are possible. + + Here is a very simple example: find a word that contains the rare (in + English) sequence of letters "rh" not at the start: + + \b(\w++)(*scs:(1).+rh) + + The first group captures a word which is then scanned by the second + group. This example does not actually need this heavyweight feature; + the same match can be achieved with: + + \b\w+?rh\w*\b + + When things are more complicated, however, this feature can be useful. + For exmple, there is a rather complicated pattern in the PCRE2 test + data that checks an entire subject string for a palindrome, that the + sequence of letters is the same in both directions. Suppose you want to + search for individual words of two or more characters such as "level" + that are palindromes: + + (\b\w{2,}+\b)(*scs:(1)...palindrome-matching-pattern...) + + A scan substring assertion fails if the group it references has not + been set. Within the subpattern, references to other groups work as + normal. Other capturing groups may appear, and will retain their values + during ongoing matching if the assertion succeeds. When PCRE2_DUPNAMES + is set and there are ambiguous group names, if the assertion scans a + group by name, it is the lowest numbered set group of the groups with + that name that is scanned. + + SCRIPT RUNS In concept, a script run is a sequence of characters that are all from @@ -10061,14 +10131,14 @@ AUTHOR REVISION - Last updated: 12 August 2024 + Last updated: 30 August 2024 Copyright (c) 1997-2024 University of Cambridge. -PCRE2 10.45 12 August 2024 PCRE2PATTERN(3) +PCRE2 10.45 30 August 2024 PCRE2PATTERN(3) ------------------------------------------------------------------------------ - - + + PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3) @@ -10322,8 +10392,8 @@ REVISION PCRE2 10.41 27 July 2022 PCRE2PERFORM(3) ------------------------------------------------------------------------------ - - + + PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3) @@ -10680,8 +10750,8 @@ REVISION PCRE2 10.43 19 January 2024 PCRE2POSIX(3) ------------------------------------------------------------------------------ - - + + PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3) @@ -10964,8 +11034,8 @@ REVISION PCRE2 10.32 27 June 2018 PCRE2SERIALIZE(3) ------------------------------------------------------------------------------ - - + + PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3) @@ -11429,6 +11499,18 @@ NON-ATOMIC LOOKAROUND ASSERTIONS (*non_atomic_positive_lookbehind:...) ) +SUBSTRING SCAN ASSERTION + This feature is not Perl-compatible. + + (*scs:(n)...) scan substring by absolute reference + (*scs:(+n)...) scan substring by relative reference + (*scs:(-n)...) scan substring by relative reference + (*scs:()...) scan substring by name + (*scs:('name')...) scan substring by name + + The full name "scan_substring" may be used instead of "scs". + + SCRIPT RUNS (*script_run:...) ) script run, can be backtracked into @@ -11550,14 +11632,14 @@ AUTHOR REVISION - Last updated: 12 August 2024 + Last updated: 30 August 2024 Copyright (c) 1997-2024 University of Cambridge. -PCRE2 10.45 12 August 2024 PCRE2SYNTAX(3) +PCRE2 10.45 30 August 2024 PCRE2SYNTAX(3) ------------------------------------------------------------------------------ - - + + PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3) @@ -12026,5 +12108,5 @@ REVISION PCRE2 10.45 22 July 2024 PCRE2UNICODE(3) ------------------------------------------------------------------------------ - - + + diff --git a/doc/pcre2compat.3 b/doc/pcre2compat.3 index 145cec50..360d2f63 100644 --- a/doc/pcre2compat.3 +++ b/doc/pcre2compat.3 @@ -1,4 +1,4 @@ -.TH PCRE2COMPAT 3 "12 August 2024" "PCRE2 10.45" +.TH PCRE2COMPAT 3 "30 August 2024" "PCRE2 10.45" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "DIFFERENCES BETWEEN PCRE2 AND PERL" @@ -201,6 +201,9 @@ and condition references such as (?(4)...). PCRE2 supports relative group numbers such as +2 and -4 in all three cases. Perl supports both plus and minus for subroutine calls, but only minus for back references, and no relative numbering at all for conditions. +.sp +(m) The scan substring assertion (syntax (*scs:(n)...)) is a PCRE2 extension +that is not available in Perl. .P 20. Perl has different limits than PCRE2. See the .\" HREF @@ -236,6 +239,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 12 August 2024 +Last updated: 30 August 2024 Copyright (c) 1997-2024 University of Cambridge. .fi diff --git a/doc/pcre2demo.3 b/doc/pcre2demo.3 index 4dcf77d5..ace051f5 100644 --- a/doc/pcre2demo.3 +++ b/doc/pcre2demo.3 @@ -1,4 +1,4 @@ -.TH PCRE2DEMO 3 "12 August 2024" "PCRE2 10.44" +.TH PCRE2DEMO 3 "30 August 2024" "PCRE2 10.44" .\"AUTOMATICALLY GENERATED BY PrepareRelease - do not EDIT! .SH NAME PCRE2DEMO - A demonstration C program for PCRE2 diff --git a/doc/pcre2matching.3 b/doc/pcre2matching.3 index 96800eff..7a203e94 100644 --- a/doc/pcre2matching.3 +++ b/doc/pcre2matching.3 @@ -1,4 +1,4 @@ -.TH PCRE2MATCHING 3 "19 January 2024" "PCRE2 10.43" +.TH PCRE2MATCHING 3 "30 August 2024" "PCRE2 10.45" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 MATCHING ALGORITHMS" @@ -7,7 +7,7 @@ PCRE2 - Perl-compatible regular expressions (revised API) This document describes the two different algorithms that are available in PCRE2 for matching a compiled regular expression against a given subject string. The "standard" algorithm is the one provided by the \fBpcre2_match()\fP -function. This works in the same as Perl's matching function, and provide a +function. This works in the same as Perl's matching function, and provides a Perl-compatible matching operation. The just-in-time (JIT) optimization that is described in the .\" HREF @@ -22,7 +22,7 @@ these are described below. .P When there is only one possible way in which a given subject string can match a pattern, the two algorithms give the same answer. A difference arises, however, -when there are multiple possibilities. For example, if the pattern +when there are multiple possibilities. For example, if the anchored pattern .sp ^<.*> .sp @@ -96,9 +96,9 @@ the output vector in decreasing order of length. There is an option to stop the algorithm after the first match (which is necessarily the shortest) is found. .P Note that the size of vector needed to contain all the results depends on the -number of simultaneous matches, not on the number of parentheses in the -pattern. Using \fBpcre2_match_data_create_from_pattern()\fP to create the match -data block is therefore not advisable when doing DFA matching. +number of simultaneous matches, not on the number of capturing parentheses in +the pattern. Using \fBpcre2_match_data_create_from_pattern()\fP to create the +match data block is therefore not advisable when doing DFA matching. .P Note also that all the matches that are found start at the same point in the subject. If the pattern @@ -141,30 +141,34 @@ straightforward to keep track of captured substrings for the different matching possibilities, and PCRE2's implementation of this algorithm does not attempt to do this. This means that no captured substrings are available. .P -3. Because no substrings are captured, backreferences within the pattern are -not supported. +3. Because no substrings are captured, a number of related features are not +available: +.sp +(a) Backreferences; +.sp +(b) Conditional expressions that use a backreference as the condition or test +for a specific group recursion; +.sp +(c) Script runs; +.sp +(d) Scan substring assertions. .P -4. For the same reason, conditional expressions that use a backreference as the -condition or test for a specific group recursion are not supported. -.P -5. Again for the same reason, script runs are not supported. -.P -6. Because many paths through the tree may be active, the \eK escape sequence, +4. Because many paths through the tree may be active, the \eK escape sequence, which resets the start of the match when encountered (but may be on some paths and not on others), is not supported. .P -7. Callouts are supported, but the value of the \fIcapture_top\fP field is +5. Callouts are supported, but the value of the \fIcapture_top\fP field is always 1, and the value of the \fIcapture_last\fP field is always 0. .P -8. The \eC escape sequence, which (in the standard algorithm) always matches a -single code unit, even in a UTF mode, is not supported in these modes, because +6. The \eC escape sequence, which (in the standard algorithm) always matches a +single code unit, even in a UTF mode, is not supported in UTF modes because the alternative algorithm moves through the subject string one character (not code unit) at a time, for all active paths through the tree. .P -9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not +7. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not supported. (*FAIL) is supported, and behaves like a failing negative assertion. .P -10. The PCRE2_MATCH_INVALID_UTF option for \fBpcre2_compile()\fP is not +8. The PCRE2_MATCH_INVALID_UTF option for \fBpcre2_compile()\fP is not supported by \fBpcre2_dfa_match()\fP. . . @@ -194,13 +198,15 @@ The alternative algorithm suffers from a number of disadvantages: because it has to search for all possible matches, but is also because it is less susceptible to optimization. .P -2. Capturing parentheses, backreferences, script runs, and matching within -invalid UTF string are not supported. +2. Capturing parentheses and other features such as backreferences that rely on +them are not supported. .P -3. Although atomic groups are supported, their use does not provide the +3. Matching within invalid UTF strings is not supported. +.P +4. Although atomic groups are supported, their use does not provide the performance advantage that it does for the standard algorithm. .P -4. JIT optimization is not supported. +5. JIT optimization is not supported. . . .SH AUTHOR @@ -217,6 +223,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 19 January 2024 +Last updated: 30 August 2024 Copyright (c) 1997-2024 University of Cambridge. .fi diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index ed05a6e6..5fdb47a0 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "12 August 2024" "PCRE2 10.45" +.TH PCRE2PATTERN 3 "30 August 2024" "PCRE2 10.45" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -382,11 +382,11 @@ the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside a character class, this causes an error, because the character class is then not terminated by a closing square bracket. .P -Another difference from Perl is that any appearance of \eQ or \eE inside what -might otherwise be a quantifier causes PCRE2 not to recognize the sequence as a -quantifier. Perl recognizes a quantifier if (redundantly) either of the numbers -is inside \eQ...\eE, but not if the separating comma is. When not recognized as -a quantifier a sequence such as {\eQ1\eE,2} is treated as the literal string +Another difference from Perl is that any appearance of \eQ or \eE inside what +might otherwise be a quantifier causes PCRE2 not to recognize the sequence as a +quantifier. Perl recognizes a quantifier if (redundantly) either of the numbers +is inside \eQ...\eE, but not if the separating comma is. When not recognized as +a quantifier a sequence such as {\eQ1\eE,2} is treated as the literal string "{1,2}". . . @@ -845,7 +845,7 @@ examples are all equivalent: .sp \ep{bidiclass=al} \ep{BC=al} - \ep{ Bidi_Class : AL } + \ep{ Bidi_Class : AL } \ep{ Bi-di class = Al } \eP{ ^ Bi-di class = Al } .sp @@ -1088,7 +1088,7 @@ explicitly. These properties are: .sp Xan matches characters that have either the L (letter) or the N (number) property. Xps matches the characters tab, linefeed, vertical tab, form feed, or -carriage return, and any other character that has the Z (separator) property +carriage return, and any other character that has the Z (separator) property (this includes the space character). Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl compatibility, but Perl changed. Xwd matches the same characters as Xan, plus those that match Mn (non-spacing mark) or Pc @@ -2419,22 +2419,32 @@ as normal. .SH ASSERTIONS .rs .sp -An assertion is a test on the characters following or preceding the current -matching point that does not consume any characters. The simple assertions -coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described +An assertion is a test that does not consume any characters. The test must +succeed for the match to continue. The simple assertions coded as \eb, \eB, +\eA, \eG, \eZ, \ez, ^ and $ are described .\" HTML .\" above. .\" .P -More complicated assertions are coded as parenthesized groups. There are two -kinds: those that look ahead of the current position in the subject string, and -those that look behind it, and in each case an assertion may be positive (must -match for the assertion to be true) or negative (must not match for the -assertion to be true). An assertion group is matched in the normal way, -and if it is true, matching continues after it, but with the matching position +More complicated assertions are coded as parenthesized groups. If matching such +a group succeeds, matching continues after it, but with the matching position in the subject string reset to what it was before the assertion was processed. .P +A special kind of assertion, called a "scan substring" assertion, matches a +subpattern against a previously captured substring. This is described in the +section entitled +.\" HTML +.\" +"Scan substring assertions" +.\" +below. It is a PCRE2 extension, not compatible with Perl. +.P +The other goup-based assertions are of two kinds: those that look ahead of the +current position in the subject string, and those that look behind it, and in +each case an assertion may be positive (must match for the assertion to be +true) or negative (must not match for the assertion to be true). +.P The Perl-compatible lookaround assertions are atomic. If an assertion is true, but there is a subsequent matching failure, there is no backtracking into the assertion. However, there are some cases where non-atomic assertions can be @@ -2726,6 +2736,61 @@ conditional groups (see below) must be atomic. . . +.\" HTML +.SH "SCAN SUBSTRING ASSERTIONS" +.rs +.sp +A special kind of assertion, not compatible with Perl, makes it possible to +check the contents of a captured substring by matching it with a subpattern. +Because this involves capturing, this feature is not supported by +\fBpcre2_dfa_match()\fP. +.P +A scan substring assertion starts with the sequence (*scan_substring: or +(*scs: which is followed by a substring number (absolute or relative) or a +substring name enclosed in single quotes or angle brackets, within parentheses. +The rest of the item is the subpattern that is applied to the substring, as +shown here: +.sp + (*scan_substring:(1)...) + (*scs:(-2)...) + (*scs:('AB')...) +.sp +The pattern match on the substring is always anchored, that is, it must match +from the start of the substring. There is no "bumpalong" if it does not match +at the start. The end of the subject is temporarily reset to be the end of the +substring, so \eZ, \ez, and $ will match there. However, the start of the +subject is \fInot\fP reset. This means that ^ matches only if the substring is +actually at the start of the main subject, but it also means that lookbehind +assertions into what precedes the substring are possible. +.P +Here is a very simple example: find a word that contains the rare (in English) +sequence of letters "rh" not at the start: +.sp + \eb(\ew++)(*scs:(1).+rh) +.sp +The first group captures a word which is then scanned by the second group. +This example does not actually need this heavyweight feature; the same match +can be achieved with: +.sp + \eb\ew+?rh\ew*\eb +.sp +When things are more complicated, however, this feature can be useful. For +exmple, there is a rather complicated pattern in the PCRE2 test data that +checks an entire subject string for a palindrome, that the sequence of +letters is the same in both directions. Suppose you want to search for +individual words of two or more characters such as "level" that are +palindromes: +.sp + (\eb\ew{2,}+\eb)(*scs:(1)...palindrome-matching-pattern...) +.sp +A scan substring assertion fails if the group it references has not been set. +Within the subpattern, references to other groups work as normal. Other +capturing groups may appear, and will retain their values during ongoing +matching if the assertion succeeds. When PCRE2_DUPNAMES is set and there are +ambiguous group names, if the assertion scans a group by name, it is the lowest +numbered set group of the groups with that name that is scanned. +. +. .SH "SCRIPT RUNS" .rs .sp @@ -3916,6 +3981,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 12 August 2024 +Last updated: 30 August 2024 Copyright (c) 1997-2024 University of Cambridge. .fi diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3 index d69e0508..dee18b0c 100644 --- a/doc/pcre2syntax.3 +++ b/doc/pcre2syntax.3 @@ -1,4 +1,4 @@ -.TH PCRE2SYNTAX 3 "12 August 2024" "PCRE2 10.45" +.TH PCRE2SYNTAX 3 "30 August 2024" "PCRE2 10.45" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" @@ -19,7 +19,7 @@ documentation. This document contains a quick-reference summary of the syntax. \eQ...\eE treat enclosed characters as literal .sp Note that white space inside \eQ...\eE is always treated as literal, even if -PCRE2_EXTENDED is set, causing most other white space to be ignored. Note also +PCRE2_EXTENDED is set, causing most other white space to be ignored. Note also that PCRE2's handling of \eQ...\eE has some differences from Perl's. See the .\" HREF \fBpcre2pattern\fP @@ -490,6 +490,19 @@ These assertions are specific to PCRE2 and are not Perl-compatible. (*non_atomic_positive_lookbehind:...) ) . . +.SH "SUBSTRING SCAN ASSERTION" +.rs +This feature is not Perl-compatible. +.sp + (*scs:(n)...) scan substring by absolute reference + (*scs:(+n)...) scan substring by relative reference + (*scs:(-n)...) scan substring by relative reference + (*scs:()...) scan substring by name + (*scs:('name')...) scan substring by name +.sp +The full name "scan_substring" may be used instead of "scs". +. +. .SH "SCRIPT RUNS" .rs .sp @@ -622,6 +635,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 12 August 2024 +Last updated: 30 August 2024 Copyright (c) 1997-2024 University of Cambridge. .fi diff --git a/src/pcre2_match.c b/src/pcre2_match.c index a363fdfd..b54dfdcf 100644 --- a/src/pcre2_match.c +++ b/src/pcre2_match.c @@ -5645,12 +5645,12 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode, Lsaved_end_subject = mb->end_subject; Ltrue_end_extra = mb->true_end_subject - mb->end_subject; Lsaved_eptr = Feptr; - Lsaved_moptions = mb->moptions; + Lsaved_moptions = mb->moptions; Feptr = mb->start_subject + Fovector[offset]; mb->true_end_subject = mb->end_subject = mb->start_subject + Fovector[offset + 1]; - mb->moptions &= ~PCRE2_NOTEOL; + mb->moptions &= ~PCRE2_NOTEOL; Lframe_type = GF_NOCAPTURE | Fop; for (;;) diff --git a/src/pcre2_match_data.c b/src/pcre2_match_data.c index 5206f881..100e7c9d 100644 --- a/src/pcre2_match_data.c +++ b/src/pcre2_match_data.c @@ -77,8 +77,8 @@ return yield; * Create a match data block using pattern data * *************************************************/ -/* If no context is supplied, use the memory allocator from the code. This code -assumes that a general context contains nothing other than a memory allocator. +/* If no context is supplied, use the memory allocator from the code. This code +assumes that a general context contains nothing other than a memory allocator. If that ever changes, this code will need fixing. */ PCRE2_EXP_DEFN pcre2_match_data * PCRE2_CALL_CONVENTION diff --git a/src/pcre2_pattern_info.c b/src/pcre2_pattern_info.c index 75cbb104..28c780d8 100644 --- a/src/pcre2_pattern_info.c +++ b/src/pcre2_pattern_info.c @@ -230,7 +230,7 @@ switch(what) break; case PCRE2_INFO_NAMETABLE: - *((PCRE2_SPTR *)where) = (PCRE2_SPTR)((const char *)re + + *((PCRE2_SPTR *)where) = (PCRE2_SPTR)((const char *)re + sizeof(pcre2_real_code)); break; diff --git a/src/pcre2_study.c b/src/pcre2_study.c index dcdf4c2a..f7da869e 100644 --- a/src/pcre2_study.c +++ b/src/pcre2_study.c @@ -1135,7 +1135,7 @@ do case OP_ASSERTBACK_NOT: case OP_ASSERT_NA: case OP_ASSERTBACK_NA: - case OP_ASSERT_SCS: + case OP_ASSERT_SCS: ncode += GET(ncode, 1); while (*ncode == OP_ALT) ncode += GET(ncode, 1); ncode += 1 + LINK_SIZE; @@ -1261,7 +1261,7 @@ do case OP_ASSERTBACK: case OP_ASSERTBACK_NOT: case OP_ASSERTBACK_NA: - case OP_ASSERT_SCS: + case OP_ASSERT_SCS: do tcode += GET(tcode, 1); while (*tcode == OP_ALT); tcode += 1 + LINK_SIZE; break;