This commit attempts to surface binary filtering in a slightly more
user friendly way. Namely, before, ripgrep would silently stop
searching a file if it detected a NUL byte, even if it had previously
printed a match. This can lead to the user quite reasonably assuming
that there are no more matches, since a partial search is fairly
unintuitive. (ripgrep has this behavior by default because it really
wants to NOT search binary files at all, just like it doesn't search
gitignored or hidden files.)
With this commit, if a match has already been printed and ripgrep detects
a NUL byte, then it will print a warning message indicating that the search
stopped prematurely.
Moreover, this commit adds a new flag, --binary, which causes ripgrep to
stop filtering binary files, but in a way that still avoids dumping
binary data into terminals. That is, the --binary flag makes ripgrep
behave more like grep's default behavior.
For files explicitly specified in a search, e.g., `rg foo some-file`,
then no binary filtering is applied (just like no gitignore and no
hidden file filtering is applied). Instead, ripgrep behaves as if you
gave the --binary flag for all explicitly given files.
This was a fairly invasive change, and potentially increases the UX
complexity of ripgrep around binary files. (Before, there were two
binary modes, where as now there are three.) However, ripgrep is now a
bit louder with warning messages when binary file detection might
otherwise be hiding potential matches, so hopefully this is a net
improvement.
Finally, the `-uuu` convenience now maps to `--no-ignore --hidden
--binary`, since this is closer to the actualy intent of the
`--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a
consequence, `rg -uuu foo` should now search roughly the same number of
bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the
same number of bytes as `grep -ra foo`. (The "roughly" weasel word is
used because grep's and ripgrep's binary file detection might differ
somewhat---perhaps based on buffer sizes---which can impact exactly what
is and isn't searched.)
See the numerous tests in tests/binary.rs for intended behavior.
Fixes#306, Fixes#855
We initially did not have this impl because the first revision of the Sink
trait was much more complicated. In particular, each method was
parameterized over a Matcher. But not every Sink impl actually needs a
Matcher, and it is just as easy to borrow a Matcher explicitly, so the
added parameterization wasn't holding its own.
This does permit Sink implementations to be used as trait objects. One
key use case here is to reduce compile times, since there is quite a bit
of code inside grep-searcher that is parameterized on Sink. Unfortunately,
that code is *also* parameterized on Matcher, and the various printers in
grep-printer are also parameterized on Matcher, which means Sink trait
objects are necessary but no sufficient for a major reduction in compile
times. Unfortunately, the path to making Matcher object safe isn't quite
clear. Extension traits maybe? There's also stuff in the Serde ecosystem
that might help, but the type shenanigans can get pretty gnarly.
libripgrep is not any one library, but rather, a collection of libraries
that roughly separate the following key distinct phases in a grep
implementation:
1. Pattern matching (e.g., by a regex engine).
2. Searching a file using a pattern matcher.
3. Printing results.
Ultimately, both (1) and (3) are defined by de-coupled interfaces, of
which there may be multiple implementations. Namely, (1) is satisfied by
the `Matcher` trait in the `grep-matcher` crate and (3) is satisfied by
the `Sink` trait in the `grep2` crate. The searcher (2) ties everything
together and finds results using a matcher and reports those results
using a `Sink` implementation.
Closes#162