This ensures aligned data store even when the range is repeated.
Furthermore character lists are stored once regerdless of repeats.
Co-authored-by: Zoltan Herczeg <hzmester@freemail.hu>
* Move some existing character class code into pcre2_compile_class.c
* Add a new flag PCRE2_ALT_EXTENDED_CLASS to change the behaviour of
parsing [...] character classes, to emit new META codes, and new
OP_ECLASS codes for nested character classes with operators
* Document the behaviour relative to the UTS#18 standard
* No JIT support; it falls back to the interpreter. DFA is supported.
Change the minimum framesize value to match what the code can
support, while at it, refactor some of the conditionals used
so that extracting the framesize is more reliable (as the
assert is polymorphic) and update other seemingly unrelated bits
Use a similar syntax to pcre2test to set a per pattern locale, and
teach pcre2test to recognize the modifier as perl compatible.
While at it, update tests and fix a recent regresion that wasn't
covered by them.
* pcre2test: tighten \N{U+hh...} support
When \N{U+hh...} was added it was meant to support all unicode
characters that can be encoded by pcre2test and Perl, but its
use outside what is officially considered valid can be confusing
so print a warning for those cases.
* perltest: add support for hex modifier
The use of \xhh can be ambiguous when used together with the utf modifier,
so allow for describing code points individually in the pattern using hex,
with the same syntax that is already supported by pcre2test.
When providing escaped values in the subject, the syntax can be
ambiguous, so add support for a new escape that is always meant
to refer to a Unicode character and that is already supported
by the library in utf mode.
While at it, refactor the code to support octal escapes and fix
bugs with overlong numbers, as well to simplify the logic that
decides if an escape is encoded as a code unit or as an Unicode
character, that could require multiple code units.
New flag: PCRE2_EXTRA_TURKISH_CASING, and pre-pattern flag
(*TURKISH_CASING).
Also added a pre-pattern flag (*CASELESS_RESTRICT) for this existing
flag.
Perl does not use $0 anymore to refer to the text of the matched subject
and `pcre2_substitute()` was recently updated to also provide that value
using the variable Perl prefers: `$&`.
In a similar context, either as part of the formatted output from a match
or during the processing of a callback, teach pcre2grep to also populate
$&.
While at it, update the ChangeLog with recent changes.
UCD 16 makes a lot of changes to scripts, so make sure that we have
sufficient coverage by keeping the original autogenerated tests in
addition.
Complete the code updates for changes to ScriptExtensions.txt which
is no longer sorted by script and allow for multiple unicode property
test files, depending on Unicode version.
The original asserts weren't very useful in debug mode as they
were lacking information on where they were being triggered and
were also unreliable and dangerous as they could result in
important code being removed and trigger crashes (even in non
debug mode).
Instead of implementing one generic assert for both modes, build
a more useful one for each one, so PCRE2_UNREACHABLE() could be
also used in non debug builds to help with optimization.
Reinstate all original assertions to use the new versions, which
will have the sideeffect of fixing indentation issues introduced
in the original, and include additional asserts that were provided
as the original ones were being audited for safety. Note that during
such audit the use of the original asserts might had been refactored
so it also includes all those relevant code changes.
While at it, update cmake and CI to help with testing as well as
other documentation.
Co-authored-by: Alex Dowad <alexinbeijing@gmail.com>
Eventhough it is documented that invalid escapes will be reported,
the code would fallback in that case and result in a NUL being
generated whenever an incompete \x{ escape was being parsed.
Refactor the code to report the error instead and fix the logic used
for overlong numbers so that the truncation doesn't result in an
unexpected value being used.
There was an old (from PCRE 4.0) test that was affected but which is
no longer relevant, because it could only be triggered with invalid
UTF (which isn't supported), and that was therefore removed as a
result.
Additionally, it was found that the same syntax error was affecting
perltest so correct that as well by reporting syntax errors in the
subject lines.
While at it update related documentation for Perl's compatibility.
Updates to Unicode files to Unicode 16, adjusts tests, and the
scripts used to parse UCD, to adapt to minor formatting differences
in UCD 16.
The `GenerateTest26.py` and `GenerateCommon.py` had a regexp to
extract properties from the `ScriptExtensions.txt` file. Previously,
all property lines had one space after space-separated list of scripts.
In UCD-16, this list is adjusted with right-padding, which throws off
the parser.
This commit adjusts the regexps to ignore padding spaces.
JIT has several good uses of unsigned integer wraparounds, that
clang's UBSAN doesn't like (which is controversial, because it is
clearly not undefined behaviour), but since it is usually good to
know when they happen by accident it makes sense to make sure the
rest of PCRE2 codebase benefits from checking it.
While at it, upgrade the version of the base image to use a newer
OS as a canary from when the rest of the jobs upgrade themselves
and be a little more strict to catch other constructs that are not
welcomed in our codebase.
Since 4f6c43d (Add assertion macros, use new PCRE2_UNREACHABLE
assertion at unreachable points in code (#446), 2024-08-28) and
then again after 04dc664 (Implement PCRE2_UNREACHABLE assertion
for MS Visual C++ (#465), 2024-09-10), this API could return
random values on failure.
Remove assertion, until it could be added back in a way that
wouldn't trigger a crash in non debug builds or result in the
function returning without an API expected value.
As reported recently by ef218fb (Guard against out-of-bounds memory
access when parsing LIMIT_HEAP et al (#463), 2024-09-07), a malformed
pattern could result in reading 1 byte past its end.
Fix a similar issue that affects all VERBs and add test cases to
ensure the original bug and all its siblings are no longer an issue.
While at it fix the wording of the related documentation.
Make the documentation of the new API more useful at a first glance
by providing the list of values that can be used while keeping all
details in pcre2api.3, and just like it is done in other similar pages.
While at it reorder the entries for the directives in pcre2api.3 so it
is more natural and to match the one used here.
__builting_unreachable() implementation is not defined and has
been known to not trigger failures under some compilers.
Default instead to using assert(), which has the added benefit
of printing a descriptive message and it is also likely more
portable as it is part of ANSI C.
While at it, really allow configuring builtins with cmake.