Commit Graph

5 Commits

Author SHA1 Message Date
Andrew Gallant
813c676eca searcher: fix roll buffer bug
This commit fixes a subtle bug in how the line buffer was rolling its
contents. Specifically, when ripgrep searches without memory maps,
it uses a "roll" buffer for incremental line oriented search without
needing to read the entire file into memory at once. The roll buffer
works by reading a chunk of bytes from the file into memory, and then
searching everything in that buffer up to the last `\n` byte. The bytes
*after* the last `\n` byte are preserved, since they likely correspond
to *part* of the next line. Once ripgrep is done searching the buffer,
it "rolls" the buffer such that the start of the next line is at the
beginning of the buffer, and then ripgrep reads more data into the
buffer starting at the (possibly) partial end of that line.

The implication of this strategy, necessarily so, is that a buffer must
be big enough to fit a single line in memory. This is because the regex
engine needs a contiguous block of memory to search, so there is no way
to search anything smaller than a single line. So if a file contains a
single line with 7.5 million bytes, then the buffer will grow to be at
least that size. (Many files have super long lines like this, but they
tend to be *binary* files, which ripgrep will detect and stop searching
unless the user forces it with the `-a/--text` flag. So in practice,
they aren't usually a problem. However, in this case, #1335 found a case
where a plain text file had a line with 7.5 million bytes.)

Now, for performance reasons, ripgrep reuses these buffers across its
search. Typically, it will create `N` of these line buffers when it
starts (where `N` is the number of threads it is using), and then reuse
them without creating any new ones as it searches through files.

This means that if you search a file with a very long line, that buffer
will expand to be big enough to store that line. ripgrep never contracts
these buffers, so once it searches the next file, ripgrep will continue
to use this large buffer. While it might be prudent to contract these
buffers in some circumstances, this isn't otherwise inherently a
problem. The memory has already been allocated, and there isn't much
cost to using it, other than the fact that ripgrep hangs on to it and
never gives it back to the OS.

However, the `roll` implementation described above had a really
important bug in it that was impacted by the size of the buffer.
Specifically, it used the following to "roll" the partial line at the
end of the buffer to the beginning:

    self.buf.copy_within_str(self.pos.., 0);

Which means that if the buffer is very large, ripgrep will copy
*everything* from `self.pos` (which might be very small, e.g., for small
files) to the end of the buffer, and move it to the beginning of the
buffer. This will happen repeatedly each time the buffer is used to
search small files, which winds up being quite a large slow down if the
line was exceptionally large (say, megabytes).

It turns out that copying everything is completely unnecessary. We only
need to copy the remainder of the last read to the beginning of the
buffer. Everything *after* the last read in the buffer is just free
space that can be filled for the next read. So, all we need to do is
copy just those bytes:

    self.buf.copy_within_str(self.pos..self.end, 0);

... which is typically much much smaller than the rest of the buffer.

This was likely also causing small performance losses in other cases as
well. For example, when searching a lot of small files, ripgrep would
likely do a lot more copying than necessary. Although, given that the
default buffer size is 8KB, this extra copying was likely pretty small,
and was thus harder to observe.

Fixes #1335
2019-08-02 07:23:27 -04:00
Andrew Gallant
b93762ea7a bstr: update everything to bstr 0.2 2019-06-26 16:47:33 -04:00
Andrew Gallant
a7d26c8f14 binary: rejigger ripgrep's handling of binary files
This commit attempts to surface binary filtering in a slightly more
user friendly way. Namely, before, ripgrep would silently stop
searching a file if it detected a NUL byte, even if it had previously
printed a match. This can lead to the user quite reasonably assuming
that there are no more matches, since a partial search is fairly
unintuitive. (ripgrep has this behavior by default because it really
wants to NOT search binary files at all, just like it doesn't search
gitignored or hidden files.)

With this commit, if a match has already been printed and ripgrep detects
a NUL byte, then it will print a warning message indicating that the search
stopped prematurely.

Moreover, this commit adds a new flag, --binary, which causes ripgrep to
stop filtering binary files, but in a way that still avoids dumping
binary data into terminals. That is, the --binary flag makes ripgrep
behave more like grep's default behavior.

For files explicitly specified in a search, e.g., `rg foo some-file`,
then no binary filtering is applied (just like no gitignore and no
hidden file filtering is applied). Instead, ripgrep behaves as if you
gave the --binary flag for all explicitly given files.

This was a fairly invasive change, and potentially increases the UX
complexity of ripgrep around binary files. (Before, there were two
binary modes, where as now there are three.) However, ripgrep is now a
bit louder with warning messages when binary file detection might
otherwise be hiding potential matches, so hopefully this is a net
improvement.

Finally, the `-uuu` convenience now maps to `--no-ignore --hidden
--binary`, since this is closer to the actualy intent of the
`--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a
consequence, `rg -uuu foo` should now search roughly the same number of
bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the
same number of bytes as `grep -ra foo`. (The "roughly" weasel word is
used because grep's and ripgrep's binary file detection might differ
somewhat---perhaps based on buffer sizes---which can impact exactly what
is and isn't searched.)

See the numerous tests in tests/binary.rs for intended behavior.

Fixes #306, Fixes #855
2019-04-14 19:29:27 -04:00
Andrew Gallant
7dcbff9a9b searcher: partially migrate to bstr
This commit causes grep-searcher to use byte strings internally for its
line buffer support. We manage to remove a use of `unsafe` by doing this
(by pushing it down into `bstr`).

We stop short of using byte strings everywhere else because we rely
heavily on the `impl ops::Index<[u8]> for grep_matcher::Match` impl,
which isn't available for byte strings. (It is premature to make bstr a
public dep of a core crate like grep-matcher, but maybe some day.)
2019-04-05 23:24:08 -04:00
Andrew Gallant
d9ca529356 libripgrep: initial commit introducing libripgrep
libripgrep is not any one library, but rather, a collection of libraries
that roughly separate the following key distinct phases in a grep
implementation:

  1. Pattern matching (e.g., by a regex engine).
  2. Searching a file using a pattern matcher.
  3. Printing results.

Ultimately, both (1) and (3) are defined by de-coupled interfaces, of
which there may be multiple implementations. Namely, (1) is satisfied by
the `Matcher` trait in the `grep-matcher` crate and (3) is satisfied by
the `Sink` trait in the `grep2` crate. The searcher (2) ties everything
together and finds results using a matcher and reports those results
using a `Sink` implementation.

Closes #162
2018-08-20 07:10:19 -04:00