Fixes#564
The previous API was not extensible to handle multi-character case rules. It required a fair bit of reworking in order to accommodate this. I had to delay the casing transformations to be done later, by buffering up the string to transform, and then allowing the callback to do an in-place transformation on the entire input to be transformed.
I reckon that callers are assuming that when you use the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option, it will calculate the entire memory requirement in one go. Just two calls should be sufficient (rather than needing to loop with a gradually-increasing buffer size).
However, with a substitution callout this is not true. If you call once with PCRE2_SUBSTITUTE_OVERFLOW_LENGTH, the buffer length returned might still not be sufficient for the second call to succeed.
This is because the callout might not be called the first time, but the second time it will be called and can affect control flow, by requiring even more buffer to be used. This occurs even if the callout is completely stateless, idempotent and well-behaved.
This fix ensures that when we skip a callout (due to overflow), we still request enough buffer size for either option that the callout might return.
* Add details on new maintainership
* Remove checked-in autoconf outputs
* Sync & cleanup files with Detrail
* Add CI job for ensuring PrepareRelease is run
* Add Ubuntu-20.04 autoconf runner
* Make CMake installed files match autoconf
* Update acknowledgements
Avoid one crash introduced with recent changes to substitute code as well as clarify what the expected offset value should be when overflowing the provided buffer.
---------
Co-authored-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
I haven't tackled any controversial steps in this PR - simply tidying the formatting.
I have used the `gersemi` tool, which simply "does its thing". I have additionally renamed a few variables to match standard casing conventions (but I am aware that some lowercased variables are used, for example in package-config files, and have left those alone).
* Move some existing character class code into pcre2_compile_class.c
* Add a new flag PCRE2_ALT_EXTENDED_CLASS to change the behaviour of
parsing [...] character classes, to emit new META codes, and new
OP_ECLASS codes for nested character classes with operators
* Document the behaviour relative to the UTS#18 standard
* No JIT support; it falls back to the interpreter. DFA is supported.
New flag: PCRE2_EXTRA_TURKISH_CASING, and pre-pattern flag
(*TURKISH_CASING).
Also added a pre-pattern flag (*CASELESS_RESTRICT) for this existing
flag.
Perl does not use $0 anymore to refer to the text of the matched subject
and `pcre2_substitute()` was recently updated to also provide that value
using the variable Perl prefers: `$&`.
In a similar context, either as part of the formatted output from a match
or during the processing of a callback, teach pcre2grep to also populate
$&.
While at it, update the ChangeLog with recent changes.
It is anticipated that over time, more and more optimizations will be
added to PCRE2, and we want to be able to switch optimizations off/on,
both for testing purposes and to be able to work around bugs in a
released library version.
The number of free bits left in the compile options word is very small.
Hence, we will start putting all optimization enable/disable flags in
a separate word. To switch these off/on, the new API function
pcre2_set_optimization() will be used.
The values which can be passed to pcre2_set_optimization() are
different from the internal flag bit values. The values accepted by
pcre2_set_optimization() are contiguous integers, so there is no
danger of ever running out of them. This means in the future, the
internal representation can be changed at any time without breaking
backwards compatibility. Further, the 'directives' passed to
pcre2_set_optimization() are not restricted to control a single,
specific optimization. As an example, passing PCRE2_OPTIMIZATION_FULL
will turn on all optimizations supported by whatever version of
PCRE2 the client program happens to be linked with.
Co-authored-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Co-authored-by: Zoltan Herczeg <hzmester@freemail.hu>
Historically, pcre2grep has done minor processing of the patterns that
were read through the `-f` option.
The end result is that for some patterns there are different results
depending if they were provided through `-e`, `-f` or as a parameter
in the command line.
Add a flag that could be provided to skip that processing so that the
same pattern file used with other grep implementations could be used
directly for the same result.