mirror of
https://github.com/PCRE2Project/pcre2.git
synced 2025-10-17 23:57:23 +08:00
pcre2grep: add --posix-pattern-file for compatibility with other grep (#428)
Historically, pcre2grep has done minor processing of the patterns that were read through the `-f` option. The end result is that for some patterns there are different results depending if they were provided through `-e`, `-f` or as a parameter in the command line. Add a flag that could be provided to skip that processing so that the same pattern file used with other grep implementations could be used directly for the same result.
This commit is contained in:

committed by
GitHub

parent
3b90149f3c
commit
c63d7c992e
10
ChangeLog
10
ChangeLog
@@ -7,14 +7,18 @@ there is also the log of commit messages.
|
||||
Version 10.45 xx-xxx-2024
|
||||
-------------------------
|
||||
|
||||
1. Change 6 of 10.44 broke 32-bit compiles because pcre2test's reporting of
|
||||
memory size was changed to the entire compiled data block, instead of just the
|
||||
pattern and tables data, so as to align with the new length restriction.
|
||||
1. Change 6 of 10.44 broke 32-bit tests because pcre2test's reporting of
|
||||
memory size was changed to the entire compiled data block, instead of just the
|
||||
pattern and tables data, so as to align with the new length restriction.
|
||||
Because the block's header contains pointers, this meant the pcre2test output
|
||||
was different in 32-bit mode. A patch by Carlo reverts to the preevious state
|
||||
and makes sure that any limit set by pcre2_set_max_pattern_compiled_length()
|
||||
also avoids the internal struct overhead.
|
||||
|
||||
2. Add --posix-pattern-file to pcre2grep to allow processing of empty patterns
|
||||
through the -f option, as well as patterns that end in space characters for
|
||||
compatibility with other grep tools.
|
||||
|
||||
|
||||
Version 10.44 07-June-2024
|
||||
--------------------------
|
||||
|
29
RunGrepTest
29
RunGrepTest
@@ -861,6 +861,35 @@ echo "---------------------------- Test 153 -----------------------------" >>tes
|
||||
(cd $srcdir; $valgrind $vjs $pcre2grep -nA3 --no-group-separator 'four' ./testdata/grepinputx) >>testtrygrep
|
||||
echo "RC=$?" >>testtrygrep
|
||||
|
||||
echo "---------------------------- Test 154 -----------------------------" >>testtrygrep
|
||||
>testtemp1grep
|
||||
(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
|
||||
echo "RC=$?" >>testtrygrep
|
||||
|
||||
echo "---------------------------- Test 155 -----------------------------" >>testtrygrep
|
||||
echo "" >testtemp1grep
|
||||
(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
|
||||
echo "RC=$?" >>testtrygrep
|
||||
|
||||
echo "---------------------------- Test 156 -----------------------------" >>testtrygrep
|
||||
echo "" >testtemp1grep
|
||||
(cd $srcdir; $valgrind $vjs $pcre2grep --posix-pattern-file --file $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
|
||||
echo "RC=$?" >>testtrygrep
|
||||
|
||||
echo "---------------------------- Test 157 -----------------------------" >>testtrygrep
|
||||
echo "spaces " >testtemp1grep
|
||||
(cd $srcdir; $valgrind $vjs $pcre2grep -o --posix-pattern-file --file=$builddir/testtemp1grep ./testdata/grepinputv >testtemp2grep && $valgrind $vjs $pcre2grep -q "s " testtemp2grep) >>testtrygrep
|
||||
echo "RC=$?" >>testtrygrep
|
||||
|
||||
echo "---------------------------- Test 158 -----------------------------" >>testtrygrep
|
||||
echo "spaces." >testtemp1grep
|
||||
(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
|
||||
echo "RC=$?" >>testtrygrep
|
||||
|
||||
echo "---------------------------- Test 159 -----------------------------" >>testtrygrep
|
||||
printf "spaces.\015\012" >testtemp1grep
|
||||
(cd $srcdir; $valgrind $vjs $pcre2grep --posix-pattern-file -f$builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
|
||||
echo "RC=$?" >>testtrygrep
|
||||
|
||||
# Now compare the results.
|
||||
|
||||
|
@@ -27,9 +27,9 @@ DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets, in a compile context, the maximum size (in bytes) for the
|
||||
memory needed to hold the compiled version of a pattern that is compiled with
|
||||
this context. The result is always zero. If a pattern that is passed to
|
||||
<b>pcre2_compile()</b> with this context needs more memory, an error is
|
||||
memory needed to hold the compiled version of a pattern that is using this
|
||||
context. The result is always zero. If a pattern that is passed to
|
||||
<b>pcre2_compile()</b> referencing this context needs more memory, an error is
|
||||
generated. The default is the largest number that a PCRE2_SIZE variable can
|
||||
hold, which is effectively unlimited.
|
||||
</P>
|
||||
|
@@ -391,9 +391,10 @@ Read patterns from the file, one per line. As is the case with patterns on the
|
||||
command line, no delimiters should be used. What constitutes a newline when
|
||||
reading the file is the operating system's default interpretation of \n. The
|
||||
<b>--newline</b> option has no effect on this option. Trailing white space is
|
||||
removed from each line, and blank lines are ignored. An empty file contains no
|
||||
removed from each line, and blank lines are ignored unless the
|
||||
<b>--posix-pattern-file</b> option is also provided. An empty file contains no
|
||||
patterns and therefore matches nothing. Patterns read from a file in this way
|
||||
may contain binary zeros, which are treated as ordinary data characters.
|
||||
may contain binary zeros, which are treated as ordinary character literals.
|
||||
<br>
|
||||
<br>
|
||||
If this option is given more than once, all the specified files are read. A
|
||||
@@ -808,6 +809,15 @@ when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while
|
||||
allowing \w to match Unicode letters and digits.
|
||||
</P>
|
||||
<P>
|
||||
<b>--posix-pattern-file</b>
|
||||
When patterns are provided with the <b>-f</b> option, do not trim trailing
|
||||
spaces or ignore empty lines in a similar way than other grep tools. To keep
|
||||
the behaviour consistent with older versions, if the pattern read was
|
||||
terminated with CRLF (as character literals) then both characters won't be
|
||||
included as part of it, so if you really need to have pattern ending in '\r',
|
||||
use a escape sequence or provide it by a different method.
|
||||
</P>
|
||||
<P>
|
||||
<b>-q</b>, <b>--quiet</b>
|
||||
Work quietly, that is, display nothing except error messages. The exit
|
||||
status indicates whether or not any matches were found.
|
||||
|
@@ -337,9 +337,10 @@ Read patterns from the file, one per line. As is the case with patterns on the
|
||||
command line, no delimiters should be used. What constitutes a newline when
|
||||
reading the file is the operating system's default interpretation of \en. The
|
||||
\fB--newline\fP option has no effect on this option. Trailing white space is
|
||||
removed from each line, and blank lines are ignored. An empty file contains no
|
||||
removed from each line, and blank lines are ignored unless the
|
||||
\fB--posix-pattern-file\fP option is also provided. An empty file contains no
|
||||
patterns and therefore matches nothing. Patterns read from a file in this way
|
||||
may contain binary zeros, which are treated as ordinary data characters.
|
||||
may contain binary zeros, which are treated as ordinary character literals.
|
||||
.sp
|
||||
If this option is given more than once, all the specified files are read. A
|
||||
data line is output if any of the patterns match it. A file name can be given
|
||||
@@ -701,6 +702,14 @@ option settings within patterns that affect individual classes. For example,
|
||||
when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while
|
||||
allowing \ew to match Unicode letters and digits.
|
||||
.TP
|
||||
\fB--posix-pattern-file\fP
|
||||
When patterns are provided with the \fB-f\fP option, do not trim trailing
|
||||
spaces or ignore empty lines in a similar way than other grep tools. To keep
|
||||
the behaviour consistent with older versions, if the pattern read was
|
||||
terminated with CRLF (as character literals) then both characters won't be
|
||||
included as part of it, so if you really need to have pattern ending in '\er',
|
||||
use a escape sequence or provide it by a different method.
|
||||
.TP
|
||||
\fB-q\fP, \fB--quiet\fP
|
||||
Work quietly, that is, display nothing except error messages. The exit
|
||||
status indicates whether or not any matches were found.
|
||||
|
@@ -145,7 +145,8 @@ sure both macros are undefined; an emulation function will then be used. */
|
||||
/* Define to 1 if you have the <unistd.h> header file. */
|
||||
#undef HAVE_UNISTD_H
|
||||
|
||||
/* Define to 1 if the compiler supports simple visibility declarations. */
|
||||
/* Define to 1 if the compiler supports GCC compatible visibility
|
||||
declarations. */
|
||||
#undef HAVE_VISIBILITY
|
||||
|
||||
/* Define to 1 if you have the <wchar.h> header file. */
|
||||
|
@@ -290,6 +290,7 @@ static BOOL show_total_count = FALSE;
|
||||
static BOOL silent = FALSE;
|
||||
static BOOL utf = FALSE;
|
||||
static BOOL posix_digit = FALSE;
|
||||
static BOOL posix_pattern_file = FALSE;
|
||||
|
||||
static uint8_t utf8_buffer[8];
|
||||
|
||||
@@ -428,6 +429,7 @@ used to identify them. */
|
||||
#define N_POSIX_DIGIT (-26)
|
||||
#define N_GROUP_SEPARATOR (-27)
|
||||
#define N_NO_GROUP_SEPARATOR (-28)
|
||||
#define N_POSIX_PATFILE (-29)
|
||||
|
||||
static option_item optionlist[] = {
|
||||
{ OP_NODATA, N_NULL, NULL, "", "terminate options" },
|
||||
@@ -449,6 +451,7 @@ static option_item optionlist[] = {
|
||||
{ OP_PATLIST, 'e', &match_patdata, "regex(p)=pattern", "specify pattern (may be used more than once)" },
|
||||
{ OP_NODATA, 'F', NULL, "fixed-strings", "patterns are sets of newline-separated strings" },
|
||||
{ OP_FILELIST, 'f', &pattern_files_data, "file=path", "read patterns from file" },
|
||||
{ OP_NODATA, N_POSIX_PATFILE, NULL, "posix-pattern-file", "use POSIX semantics for pattern files" },
|
||||
{ OP_FILELIST, N_FILE_LIST, &file_lists_data, "file-list=path","read files to search from file" },
|
||||
{ OP_NODATA, N_FOFFSETS, NULL, "file-offsets", "output file offsets, not text" },
|
||||
{ OP_STRING, N_GROUP_SEPARATOR, &group_separator, "group-separator=text", "set separator between groups of lines" },
|
||||
@@ -1448,7 +1451,34 @@ while ((c = fgetc(f)) != EOF)
|
||||
return yield;
|
||||
}
|
||||
|
||||
/*************************************************
|
||||
* Read one pattern from file *
|
||||
*************************************************/
|
||||
|
||||
/* Wrap around read_one_line() to make sure any terminating '\n' is not
|
||||
included in the pattern and empty patterns are correctly identified.
|
||||
|
||||
Arguments:
|
||||
buffer the buffer to read into
|
||||
length maximum number of characters to read and report how many were
|
||||
f the file
|
||||
|
||||
Returns: TRUE if a pattern was read into buffer
|
||||
*/
|
||||
|
||||
static BOOL
|
||||
read_pattern(char *buffer, PCRE2_SIZE *length, FILE *f)
|
||||
{
|
||||
*buffer = '\0';
|
||||
*length = read_one_line(buffer, *length, f);
|
||||
if (*length > 0 && buffer[*length-1] == '\n') *length = *length - 1;
|
||||
if (posix_pattern_file && *length > 0 && buffer[*length-1] == '\r')
|
||||
{
|
||||
*length = *length - 1;
|
||||
if (*length == 0) return TRUE;
|
||||
}
|
||||
return (*length > 0 || *buffer == '\n');
|
||||
}
|
||||
|
||||
/*************************************************
|
||||
* Find end of line *
|
||||
@@ -3598,6 +3628,7 @@ switch(letter)
|
||||
case N_NOJIT: use_jit = FALSE; break;
|
||||
case N_ALLABSK: extra_options |= PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK; break;
|
||||
case N_NO_GROUP_SEPARATOR: group_separator = NULL; break;
|
||||
case N_POSIX_PATFILE: posix_pattern_file = TRUE; break;
|
||||
case 'a': binary_files = BIN_TEXT; break;
|
||||
case 'c': count_only = TRUE; break;
|
||||
case N_POSIX_DIGIT: posix_digit = TRUE; break;
|
||||
@@ -3808,11 +3839,15 @@ else
|
||||
filename = name;
|
||||
}
|
||||
|
||||
while ((patlen = read_one_line(buffer, sizeof(buffer), f)) > 0)
|
||||
while ((patlen = sizeof(buffer)) && read_pattern(buffer, &patlen, f))
|
||||
{
|
||||
while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
|
||||
if (!posix_pattern_file)
|
||||
{
|
||||
while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
|
||||
}
|
||||
|
||||
linenumber++;
|
||||
if (patlen == 0) continue; /* Skip blank lines */
|
||||
if (!posix_pattern_file && patlen == 0) continue; /* Skip blank lines */
|
||||
|
||||
/* Note: this call to add_pattern() puts a pointer to the local variable
|
||||
"buffer" into the pattern chain. However, that pointer is used only when
|
||||
|
1
testdata/grepinputv
vendored
1
testdata/grepinputv
vendored
@@ -7,3 +7,4 @@ The word is cat in this line
|
||||
The caterpillar sat on the mat
|
||||
The snowcat is not an animal
|
||||
A buried feline in the syndicate
|
||||
trailing spaces
|
||||
|
26
testdata/grepoutput
vendored
26
testdata/grepoutput
vendored
@@ -464,6 +464,7 @@ The word is cat in this line
|
||||
The caterpillar sat on the mat
|
||||
The snowcat is not an animal
|
||||
A buried feline in the syndicate
|
||||
trailing spaces
|
||||
RC=0
|
||||
---------------------------- Test 52 ------------------------------
|
||||
fox [1;31mjumps[0m
|
||||
@@ -1169,6 +1170,7 @@ The word is cat in this line
|
||||
The word is cat in this line
|
||||
The caterpillar sat on the mat
|
||||
The snowcat is not an animal
|
||||
A buried feline in the syndicate
|
||||
trailing spaces
|
||||
|
||||
RC=0
|
||||
@@ -1253,3 +1255,27 @@ RC=0
|
||||
34:fourteen
|
||||
35-fifteen
|
||||
36-sixteen
|
||||
37-seventeen
|
||||
RC=0
|
||||
---------------------------- Test 154 -----------------------------
|
||||
RC=1
|
||||
---------------------------- Test 155 -----------------------------
|
||||
RC=1
|
||||
---------------------------- Test 156 -----------------------------
|
||||
The quick brown
|
||||
fox jumps
|
||||
over the lazy dog.
|
||||
This time it jumps and jumps and jumps.
|
||||
This line contains \E and (regex) *meta* [characters].
|
||||
The word is cat in this line
|
||||
The caterpillar sat on the mat
|
||||
The snowcat is not an animal
|
||||
A buried feline in the syndicate
|
||||
trailing spaces
|
||||
RC=0
|
||||
---------------------------- Test 157 -----------------------------
|
||||
RC=0
|
||||
---------------------------- Test 158 -----------------------------
|
||||
trailing spaces
|
||||
RC=0
|
||||
---------------------------- Test 159 -----------------------------
|
||||
|
Reference in New Issue
Block a user