* Add a --colour flag to pcre2test to colourise the output.
* Comments from the inputfile are in grey (but not those entered in
interactively)
* All other input is in green
* Messages related to PCRE2 api errors are in magenta
* Messages related to errors with using pcre2test itself are in red
* Timing and memory usage information is in blue
* Normal output is in your terminal's default foreground colour
---------
Co-authored-by: Nicholas Wilson <nicholas@nicholaswilson.me.uk>
* Check for pattern/subject/offset/option changes when using PCRE2_SUBSTITUTE_MATCHED.
* Return PCRE2_ERROR_DFA_UFUNC if using PCRE2_SUBSTITUTE_MATCHED after a call to
pcre2_dfa_match().
* Add new error codes to pcre2_substitute when using PCRE2_SUBSTITUTE_MATCHED.
* Change the behaviour of the matching methods so that the match_data fields are populated
on all matches with "(rc >= 0 || rc==NO_MATCH || rc==PARTIAL)". We previously ensured that
every call to a match method guarantees to set the rc field on the match_data.
* Add modifiers to pcre2test to better exercise these pcre2_substitute conditions
---------
Co-authored-by: Isaac Oscar Gariano <IsaacOscar@live.com.au>
* The primary purpose of pcre2_next_match() is to make it much easier for
PCRE2 clients to iterate over matches, without needing an advanced knowledge
of regular expressions.
* Secondly, we can simplify our own code by merging the three duplicate
implementations of the /g global match behaviour: pcre2demo, pcre2_substitute,
and pcre2test.
* Thirdly, as I look closely at the issue, I can improve the documentation.
* Fourthly, I would like to actually simplify the logic, removing a complex loop
which makes several match attempts, swallows duplicate matches, and more.
We can have identical behaviour with a simple retry using
PCRE2_NOTEMPTY_ATSTART.
An additional testing argument, `-malloc` is added to pcre2test and to RunTest.
The ManyConfig tests run this now in CI.
We exercise each malloc failure in the core code by counting how many mallocs are done, then repeating compilation and matching with a failure on each successive malloc.
The pcre2test utility needs quite a few changes to accommodate this.
It is simpler to add a new mode to it, than to make it fully
EBCDIC-native. On an ASCII system, pcre2test performs ASCII I/O, but
tranlates the input when passing it to the fully-EBCDIC-supporting
library.
Fixes#564
The previous API was not extensible to handle multi-character case rules. It required a fair bit of reworking in order to accommodate this. I had to delay the casing transformations to be done later, by buffering up the string to transform, and then allowing the callback to do an in-place transformation on the entire input to be transformed.
* Move some existing character class code into pcre2_compile_class.c
* Add a new flag PCRE2_ALT_EXTENDED_CLASS to change the behaviour of
parsing [...] character classes, to emit new META codes, and new
OP_ECLASS codes for nested character classes with operators
* Document the behaviour relative to the UTS#18 standard
* No JIT support; it falls back to the interpreter. DFA is supported.
* pcre2test: tighten \N{U+hh...} support
When \N{U+hh...} was added it was meant to support all unicode
characters that can be encoded by pcre2test and Perl, but its
use outside what is officially considered valid can be confusing
so print a warning for those cases.
* perltest: add support for hex modifier
The use of \xhh can be ambiguous when used together with the utf modifier,
so allow for describing code points individually in the pattern using hex,
with the same syntax that is already supported by pcre2test.
When providing escaped values in the subject, the syntax can be
ambiguous, so add support for a new escape that is always meant
to refer to a Unicode character and that is already supported
by the library in utf mode.
While at it, refactor the code to support octal escapes and fix
bugs with overlong numbers, as well to simplify the logic that
decides if an escape is encoded as a code unit or as an Unicode
character, that could require multiple code units.
New flag: PCRE2_EXTRA_TURKISH_CASING, and pre-pattern flag
(*TURKISH_CASING).
Also added a pre-pattern flag (*CASELESS_RESTRICT) for this existing
flag.
Eventhough it is documented that invalid escapes will be reported,
the code would fallback in that case and result in a NUL being
generated whenever an incompete \x{ escape was being parsed.
Refactor the code to report the error instead and fix the logic used
for overlong numbers so that the truncation doesn't result in an
unexpected value being used.
There was an old (from PCRE 4.0) test that was affected but which is
no longer relevant, because it could only be triggered with invalid
UTF (which isn't supported), and that was therefore removed as a
result.
Additionally, it was found that the same syntax error was affecting
perltest so correct that as well by reporting syntax errors in the
subject lines.
While at it update related documentation for Perl's compatibility.
It is anticipated that over time, more and more optimizations will be
added to PCRE2, and we want to be able to switch optimizations off/on,
both for testing purposes and to be able to work around bugs in a
released library version.
The number of free bits left in the compile options word is very small.
Hence, we will start putting all optimization enable/disable flags in
a separate word. To switch these off/on, the new API function
pcre2_set_optimization() will be used.
The values which can be passed to pcre2_set_optimization() are
different from the internal flag bit values. The values accepted by
pcre2_set_optimization() are contiguous integers, so there is no
danger of ever running out of them. This means in the future, the
internal representation can be changed at any time without breaking
backwards compatibility. Further, the 'directives' passed to
pcre2_set_optimization() are not restricted to control a single,
specific optimization. As an example, passing PCRE2_OPTIMIZATION_FULL
will turn on all optimizations supported by whatever version of
PCRE2 the client program happens to be linked with.
Co-authored-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Co-authored-by: Zoltan Herczeg <hzmester@freemail.hu>
Since a608946 (Additional PCRE2_EXTRA_ASCII_xxx code, 2023-02-01)
PCRE2_EXTRA_ASCII_BSD could be used to restrict \d to ASCII causing
the following inconsistent behaviour in UCP mode.
PCRE2 version 10.43-DEV 2023-01-15
re> /\d/utf,ucp,ascii_bsd
data> ٣
No match
data>
re> /[[:digit:]]/utf,ucp,ascii_bsd
data> ٣
0: \x{663}
It has been suggested[1] that the change to match \p{Nd} when Unicode
is enabled for [:digit:] might had been unintentional and a bug, as
[:digit:] should be able to be POSIX compatible, so add a new flag
PCRE2_EXTRA_ASCII_DIGIT to avoid changing its definition in UCP mode.
[1] https://lore.kernel.org/git/CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@mail.gmail.com/
Since PCRE2 10.41, the match data contains a pointer to a vector of
frames allocated in the heap and that are used by pcre2_match()
when doing non JIT matches.
There is though, no outside visibility on the size of it, and therefore
the memory it uses is locked away until match_data itself is freed.
Add an API that allows getting that value, so an application could
decide based on its own experienced memory pressure to keep reusing
that match_data or not.
While at it, update the documentation of other related functions for
clarity.
* doc: fix incorrect use of JOIN and typo
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
* doc: reformat of pcre2_substitute to align options
includes some rewording to fit better in an 80 char wide troff output.
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
* doc: update names to pcre2