pcre2grep: add --posix-pattern-file for compatibility with other grep (#428)

Historically, pcre2grep has done minor processing of the patterns that
were read through the `-f` option.

The end result is that for some patterns there are different results
depending if they were provided through `-e`, `-f` or as a parameter
in the command line.

Add a flag that could be provided to skip that processing so that the
same pattern file used with other grep implementations could be used
directly for the same result.
This commit is contained in:
Carlo Marcelo Arenas Belón
2024-06-18 07:45:13 -07:00
committed by GitHub
parent 3b90149f3c
commit c63d7c992e
9 changed files with 129 additions and 14 deletions

View File

@@ -7,14 +7,18 @@ there is also the log of commit messages.
Version 10.45 xx-xxx-2024
-------------------------
1. Change 6 of 10.44 broke 32-bit compiles because pcre2test's reporting of
memory size was changed to the entire compiled data block, instead of just the
pattern and tables data, so as to align with the new length restriction.
1. Change 6 of 10.44 broke 32-bit tests because pcre2test's reporting of
memory size was changed to the entire compiled data block, instead of just the
pattern and tables data, so as to align with the new length restriction.
Because the block's header contains pointers, this meant the pcre2test output
was different in 32-bit mode. A patch by Carlo reverts to the preevious state
and makes sure that any limit set by pcre2_set_max_pattern_compiled_length()
also avoids the internal struct overhead.
2. Add --posix-pattern-file to pcre2grep to allow processing of empty patterns
through the -f option, as well as patterns that end in space characters for
compatibility with other grep tools.
Version 10.44 07-June-2024
--------------------------

View File

@@ -861,6 +861,35 @@ echo "---------------------------- Test 153 -----------------------------" >>tes
(cd $srcdir; $valgrind $vjs $pcre2grep -nA3 --no-group-separator 'four' ./testdata/grepinputx) >>testtrygrep
echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 154 -----------------------------" >>testtrygrep
>testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 155 -----------------------------" >>testtrygrep
echo "" >testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 156 -----------------------------" >>testtrygrep
echo "" >testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep --posix-pattern-file --file $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 157 -----------------------------" >>testtrygrep
echo "spaces " >testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep -o --posix-pattern-file --file=$builddir/testtemp1grep ./testdata/grepinputv >testtemp2grep && $valgrind $vjs $pcre2grep -q "s " testtemp2grep) >>testtrygrep
echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 158 -----------------------------" >>testtrygrep
echo "spaces." >testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 159 -----------------------------" >>testtrygrep
printf "spaces.\015\012" >testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep --posix-pattern-file -f$builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep
# Now compare the results.

View File

@@ -27,9 +27,9 @@ DESCRIPTION
</b><br>
<P>
This function sets, in a compile context, the maximum size (in bytes) for the
memory needed to hold the compiled version of a pattern that is compiled with
this context. The result is always zero. If a pattern that is passed to
<b>pcre2_compile()</b> with this context needs more memory, an error is
memory needed to hold the compiled version of a pattern that is using this
context. The result is always zero. If a pattern that is passed to
<b>pcre2_compile()</b> referencing this context needs more memory, an error is
generated. The default is the largest number that a PCRE2_SIZE variable can
hold, which is effectively unlimited.
</P>

View File

@@ -391,9 +391,10 @@ Read patterns from the file, one per line. As is the case with patterns on the
command line, no delimiters should be used. What constitutes a newline when
reading the file is the operating system's default interpretation of \n. The
<b>--newline</b> option has no effect on this option. Trailing white space is
removed from each line, and blank lines are ignored. An empty file contains no
removed from each line, and blank lines are ignored unless the
<b>--posix-pattern-file</b> option is also provided. An empty file contains no
patterns and therefore matches nothing. Patterns read from a file in this way
may contain binary zeros, which are treated as ordinary data characters.
may contain binary zeros, which are treated as ordinary character literals.
<br>
<br>
If this option is given more than once, all the specified files are read. A
@@ -808,6 +809,15 @@ when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while
allowing \w to match Unicode letters and digits.
</P>
<P>
<b>--posix-pattern-file</b>
When patterns are provided with the <b>-f</b> option, do not trim trailing
spaces or ignore empty lines in a similar way than other grep tools. To keep
the behaviour consistent with older versions, if the pattern read was
terminated with CRLF (as character literals) then both characters won't be
included as part of it, so if you really need to have pattern ending in '\r',
use a escape sequence or provide it by a different method.
</P>
<P>
<b>-q</b>, <b>--quiet</b>
Work quietly, that is, display nothing except error messages. The exit
status indicates whether or not any matches were found.

View File

@@ -337,9 +337,10 @@ Read patterns from the file, one per line. As is the case with patterns on the
command line, no delimiters should be used. What constitutes a newline when
reading the file is the operating system's default interpretation of \en. The
\fB--newline\fP option has no effect on this option. Trailing white space is
removed from each line, and blank lines are ignored. An empty file contains no
removed from each line, and blank lines are ignored unless the
\fB--posix-pattern-file\fP option is also provided. An empty file contains no
patterns and therefore matches nothing. Patterns read from a file in this way
may contain binary zeros, which are treated as ordinary data characters.
may contain binary zeros, which are treated as ordinary character literals.
.sp
If this option is given more than once, all the specified files are read. A
data line is output if any of the patterns match it. A file name can be given
@@ -701,6 +702,14 @@ option settings within patterns that affect individual classes. For example,
when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while
allowing \ew to match Unicode letters and digits.
.TP
\fB--posix-pattern-file\fP
When patterns are provided with the \fB-f\fP option, do not trim trailing
spaces or ignore empty lines in a similar way than other grep tools. To keep
the behaviour consistent with older versions, if the pattern read was
terminated with CRLF (as character literals) then both characters won't be
included as part of it, so if you really need to have pattern ending in '\er',
use a escape sequence or provide it by a different method.
.TP
\fB-q\fP, \fB--quiet\fP
Work quietly, that is, display nothing except error messages. The exit
status indicates whether or not any matches were found.

View File

@@ -145,7 +145,8 @@ sure both macros are undefined; an emulation function will then be used. */
/* Define to 1 if you have the <unistd.h> header file. */
#undef HAVE_UNISTD_H
/* Define to 1 if the compiler supports simple visibility declarations. */
/* Define to 1 if the compiler supports GCC compatible visibility
declarations. */
#undef HAVE_VISIBILITY
/* Define to 1 if you have the <wchar.h> header file. */

View File

@@ -290,6 +290,7 @@ static BOOL show_total_count = FALSE;
static BOOL silent = FALSE;
static BOOL utf = FALSE;
static BOOL posix_digit = FALSE;
static BOOL posix_pattern_file = FALSE;
static uint8_t utf8_buffer[8];
@@ -428,6 +429,7 @@ used to identify them. */
#define N_POSIX_DIGIT (-26)
#define N_GROUP_SEPARATOR (-27)
#define N_NO_GROUP_SEPARATOR (-28)
#define N_POSIX_PATFILE (-29)
static option_item optionlist[] = {
{ OP_NODATA, N_NULL, NULL, "", "terminate options" },
@@ -449,6 +451,7 @@ static option_item optionlist[] = {
{ OP_PATLIST, 'e', &match_patdata, "regex(p)=pattern", "specify pattern (may be used more than once)" },
{ OP_NODATA, 'F', NULL, "fixed-strings", "patterns are sets of newline-separated strings" },
{ OP_FILELIST, 'f', &pattern_files_data, "file=path", "read patterns from file" },
{ OP_NODATA, N_POSIX_PATFILE, NULL, "posix-pattern-file", "use POSIX semantics for pattern files" },
{ OP_FILELIST, N_FILE_LIST, &file_lists_data, "file-list=path","read files to search from file" },
{ OP_NODATA, N_FOFFSETS, NULL, "file-offsets", "output file offsets, not text" },
{ OP_STRING, N_GROUP_SEPARATOR, &group_separator, "group-separator=text", "set separator between groups of lines" },
@@ -1448,7 +1451,34 @@ while ((c = fgetc(f)) != EOF)
return yield;
}
/*************************************************
* Read one pattern from file *
*************************************************/
/* Wrap around read_one_line() to make sure any terminating '\n' is not
included in the pattern and empty patterns are correctly identified.
Arguments:
buffer the buffer to read into
length maximum number of characters to read and report how many were
f the file
Returns: TRUE if a pattern was read into buffer
*/
static BOOL
read_pattern(char *buffer, PCRE2_SIZE *length, FILE *f)
{
*buffer = '\0';
*length = read_one_line(buffer, *length, f);
if (*length > 0 && buffer[*length-1] == '\n') *length = *length - 1;
if (posix_pattern_file && *length > 0 && buffer[*length-1] == '\r')
{
*length = *length - 1;
if (*length == 0) return TRUE;
}
return (*length > 0 || *buffer == '\n');
}
/*************************************************
* Find end of line *
@@ -3598,6 +3628,7 @@ switch(letter)
case N_NOJIT: use_jit = FALSE; break;
case N_ALLABSK: extra_options |= PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK; break;
case N_NO_GROUP_SEPARATOR: group_separator = NULL; break;
case N_POSIX_PATFILE: posix_pattern_file = TRUE; break;
case 'a': binary_files = BIN_TEXT; break;
case 'c': count_only = TRUE; break;
case N_POSIX_DIGIT: posix_digit = TRUE; break;
@@ -3808,11 +3839,15 @@ else
filename = name;
}
while ((patlen = read_one_line(buffer, sizeof(buffer), f)) > 0)
while ((patlen = sizeof(buffer)) && read_pattern(buffer, &patlen, f))
{
while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
if (!posix_pattern_file)
{
while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
}
linenumber++;
if (patlen == 0) continue; /* Skip blank lines */
if (!posix_pattern_file && patlen == 0) continue; /* Skip blank lines */
/* Note: this call to add_pattern() puts a pointer to the local variable
"buffer" into the pattern chain. However, that pointer is used only when

1
testdata/grepinputv vendored
View File

@@ -7,3 +7,4 @@ The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
trailing spaces

26
testdata/grepoutput vendored
View File

@@ -464,6 +464,7 @@ The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
trailing spaces
RC=0
---------------------------- Test 52 ------------------------------
fox jumps
@@ -1169,6 +1170,7 @@ The word is cat in this line
The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
trailing spaces
RC=0
@@ -1253,3 +1255,27 @@ RC=0
34:fourteen
35-fifteen
36-sixteen
37-seventeen
RC=0
---------------------------- Test 154 -----------------------------
RC=1
---------------------------- Test 155 -----------------------------
RC=1
---------------------------- Test 156 -----------------------------
The quick brown
fox jumps
over the lazy dog.
This time it jumps and jumps and jumps.
This line contains \E and (regex) *meta* [characters].
The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
trailing spaces
RC=0
---------------------------- Test 157 -----------------------------
RC=0
---------------------------- Test 158 -----------------------------
trailing spaces
RC=0
---------------------------- Test 159 -----------------------------