Additionally, I have attempted to clean up some CMake issues to make the
package's build interface cleaner, in particular, avoiding polluting the
parent directory's include path with our config.h file (if PCRE2 is being
included as a subdirectory).
This re-adds changes from Theodore's commit:
def175f4a9
and partially reverts changes from Carlo's commit:
92d56a1f7c
---------
Co-authored-by: Theodore Tsirpanis <teo@tsirpanis.gr>
* The primary purpose of pcre2_next_match() is to make it much easier for
PCRE2 clients to iterate over matches, without needing an advanced knowledge
of regular expressions.
* Secondly, we can simplify our own code by merging the three duplicate
implementations of the /g global match behaviour: pcre2demo, pcre2_substitute,
and pcre2test.
* Thirdly, as I look closely at the issue, I can improve the documentation.
* Fourthly, I would like to actually simplify the logic, removing a complex loop
which makes several match attempts, swallows duplicate matches, and more.
We can have identical behaviour with a simple retry using
PCRE2_NOTEMPTY_ATSTART.
We won't implement more advanced/alternative global replacement strategies, but we can at least write a few sentences explaining how to do it in application code.
Both the Autoconf and CMake build systems are updated to detect linker support for symbol versioning.
Currently, Linux, Solaris, and FreeBSD are tested and working. Windows (COFF) and macOS (Mach-O) have no symbol versioning.
There is an Autoconf/CMake flag to opt out of the versioning behaviour.
We have four files which have .c extensions, but which are actually #included rather than treated as their own compilation unit.
This goes against conventions - Autotools, CMake, and Bazel all assume that the .h/.c distinction indicates which files are compilation units.
pcre2_jit_match.c -> _inc.h
pcre2_jit_misc.c -> _inc.h
pcre2_printint.c -> _inc.h
pcre2_ucptables.c -> _inc.h
An additional testing argument, `-malloc` is added to pcre2test and to RunTest.
The ManyConfig tests run this now in CI.
We exercise each malloc failure in the core code by counting how many mallocs are done, then repeating compilation and matching with a failure on each successive malloc.
Add a simple hash code for group names to improve search speed.
Ignore duplicates when group names are searched.
Improve finding of duplicates (they have the same name pointer).
Improve creating name table (duplicates are handled in one step).
Create a new file for name management.
The pcre2test utility needs quite a few changes to accommodate this.
It is simpler to add a new mode to it, than to make it fully
EBCDIC-native. On an ASCII system, pcre2test performs ASCII I/O, but
tranlates the input when passing it to the fully-EBCDIC-supporting
library.
Debian's "lintian" picked this up - line 950 in the man page starts
with a ' which is how you start a roff request. You can reproduce the
warning thus:
```
LC_ALL=C.UTF-8 MANROFFSEQ='' MANWIDTH=80 \
man --warnings -E UTF-8 -l -Tutf8 -Z doc/pcre2grep.1 >/dev/null
```
The fix is to add a zero-width space (`\&`) to the start of the
relevant line (indeed `groff_man(7)` suggests exactly this use for \&).
---------
Co-authored-by: Matthew Vernon <matthew@debian.org>
Fixes#564
The previous API was not extensible to handle multi-character case rules. It required a fair bit of reworking in order to accommodate this. I had to delay the casing transformations to be done later, by buffering up the string to transform, and then allowing the callback to do an in-place transformation on the entire input to be transformed.
I reckon that callers are assuming that when you use the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option, it will calculate the entire memory requirement in one go. Just two calls should be sufficient (rather than needing to loop with a gradually-increasing buffer size).
However, with a substitution callout this is not true. If you call once with PCRE2_SUBSTITUTE_OVERFLOW_LENGTH, the buffer length returned might still not be sufficient for the second call to succeed.
This is because the callout might not be called the first time, but the second time it will be called and can affect control flow, by requiring even more buffer to be used. This occurs even if the callout is completely stateless, idempotent and well-behaved.
This fix ensures that when we skip a callout (due to overflow), we still request enough buffer size for either option that the callout might return.
* Add details on new maintainership
* Remove checked-in autoconf outputs
* Sync & cleanup files with Detrail
* Add CI job for ensuring PrepareRelease is run
* Add Ubuntu-20.04 autoconf runner
* Make CMake installed files match autoconf
* Update acknowledgements
Avoid one crash introduced with recent changes to substitute code as well as clarify what the expected offset value should be when overflowing the provided buffer.
---------
Co-authored-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
I haven't tackled any controversial steps in this PR - simply tidying the formatting.
I have used the `gersemi` tool, which simply "does its thing". I have additionally renamed a few variables to match standard casing conventions (but I am aware that some lowercased variables are used, for example in package-config files, and have left those alone).
* Move some existing character class code into pcre2_compile_class.c
* Add a new flag PCRE2_ALT_EXTENDED_CLASS to change the behaviour of
parsing [...] character classes, to emit new META codes, and new
OP_ECLASS codes for nested character classes with operators
* Document the behaviour relative to the UTS#18 standard
* No JIT support; it falls back to the interpreter. DFA is supported.
Change the minimum framesize value to match what the code can
support, while at it, refactor some of the conditionals used
so that extracting the framesize is more reliable (as the
assert is polymorphic) and update other seemingly unrelated bits
* pcre2test: tighten \N{U+hh...} support
When \N{U+hh...} was added it was meant to support all unicode
characters that can be encoded by pcre2test and Perl, but its
use outside what is officially considered valid can be confusing
so print a warning for those cases.
* perltest: add support for hex modifier
The use of \xhh can be ambiguous when used together with the utf modifier,
so allow for describing code points individually in the pattern using hex,
with the same syntax that is already supported by pcre2test.
When providing escaped values in the subject, the syntax can be
ambiguous, so add support for a new escape that is always meant
to refer to a Unicode character and that is already supported
by the library in utf mode.
While at it, refactor the code to support octal escapes and fix
bugs with overlong numbers, as well to simplify the logic that
decides if an escape is encoded as a code unit or as an Unicode
character, that could require multiple code units.