Commit Graph

10 Commits

Author SHA1 Message Date
Andrew Gallant
8c6595c215 searcher: fix performance bug with -A/--after-context when searching stdin
This was a crazy subtle bug where ripgrep could slow down exponentially
as increasingly larger values of `-A/--after-context` were used. But,
interestingly, this would only occur when searching `stdin` and _not_
when searching the same data as a regular file.

This confounded me because ripgrep, pretty early on, erases the
difference between searching a single file and `stdin`. So it wasn't
like there were different code paths. And I mistakenly assumed that they
would otherwise behave the same as they are just treated as streams.

But... it turns out that running `read` on a `stdin` versus a regular
file seems to behave differently. At least on my Linux system, with
`stdin`, `read` never seems to fill the buffer with more than 64K. But
with a regular file, `read` pretty reliably fills the caller's buffer
with as much space as declared.

Of course, it is expected that `read` doesn't *have* to fill up the
caller's buffer, and ripgrep is generally fine with that. But when
`-A/--after-context` is used with a very large value---big enough that
the default buffer capacity is too small---then more heap memory needs
to be allocated to correctly handle all cases. This can result in
passing buffers bigger than 64K to `read`.

While we *correctly* handle `read` calls that don't fill the buffer,
it turns out that if we don't fill the buffer, then we get into a
pathological case where we aren't processing as many bytes as we could.
That is, because of the `-A/--after-context` causing us to keep a lot of
bytes around while we roll the buffer and because reading from `stdin`
gives us fewer bytes than normal, we weren't amortizing our `read` calls
as well as we should have been. Indeed, our buffer capacity increases
specifically take this amortization into account, but we weren't taking
advantage of it.

We fix this by putting `read` into an inner loop that ensures our
buffer gets filled up. This fixes the performance bug:

```
$ (time rg ZQZQZQZQZQ bigger.txt --no-mmap -A9999) | wc -l

real    1.330
user    0.767
sys     0.559
maxmem  29 MB
faults  0
10000

$ cat bigger.txt | (time rg ZQZQZQZQZQ --no-mmap -A9999) | wc -l

real    2.355
user    0.860
sys     0.613
maxmem  29 MB
faults  0
10000

$ (time rg ZQZQZQZQZQ bigger.txt --no-mmap -A99999) | wc -l

real    3.636
user    3.091
sys     0.537
maxmem  29 MB
faults  0
100000

$ cat bigger.txt | (time rg ZQZQZQZQZQ --no-mmap -A99999) | wc -l

real    4.918
user    3.236
sys     0.710
maxmem  29 MB
faults  0
100000

$ (time rg ZQZQZQZQZQ bigger.txt --no-mmap -A999999) | wc -l

real    5.430
user    4.666
sys     0.750
maxmem  51 MB
faults  0
1000000

$ cat bigger.txt | (time rg ZQZQZQZQZQ --no-mmap -A999999) | wc -l

real    6.894
user    4.907
sys     0.850
maxmem  51 MB
faults  0
1000000
```

For comparison, here is GNU grep:

```
$ cat bigger.txt | (time grep ZQZQZQZQZQ -A9999) | wc -l

real    1.466
user    0.159
sys     0.839
maxmem  29 MB
faults  0
10000

$ cat bigger.txt | (time grep ZQZQZQZQZQ -A99999) | wc -l

real    1.663
user    0.166
sys     0.941
maxmem  29 MB
faults  0
100000

$ cat bigger.txt | (time grep ZQZQZQZQZQ -A999999) | wc -l

real    1.631
user    0.204
sys     0.910
maxmem  29 MB
faults  0
1000000
```

GNU grep is still notably faster. We'll fix that in the next commit.

Fixes #3184
2025-10-14 14:27:43 -04:00
Riccardo Attilio Galli
c93fc793a0 searcher: add more tests for replace_bytes
... and add a comment explaining an optimization.

Closes #2729
2025-09-19 21:08:19 -04:00
Andrew Gallant
59212d08d3 style: fix new lints
The Rust compiler seems to have gotten smarter at finding unused or
redundant imports.
2024-03-07 09:37:48 -05:00
Andrew Gallant
9ed7565fcb cli: error when searching for NUL
Basically, unless the -a/--text flag is given, it is generally always an
error to search for an explicit NUL byte because the binary detection
will prevent it from matching.

Fixes #1838
2023-11-25 15:03:53 -05:00
Andrew Gallant
f7ff34fdf9 searcher: simplify 'replace_bytes' routine
I did this in the course of trying to optimize it. I don't believe I
made it any faster, but the refactoring led to code that I think is
more readable.
2023-10-09 20:29:52 -04:00
Andrew Gallant
d53b7310ee searcher: polish
This updates some dependencies and brings code style in line with my
current practice.
2023-10-09 20:29:52 -04:00
Andrew Gallant
ac8fecbbf2 deps: upgrade bstr to 1.1 2023-01-05 08:21:15 -05:00
aricha1940
1c3eebefec searcher: update outdated comment for buffer size
Looks like this was accidentally left set to 8 in commit 46fb77c.

PR #1839
2021-03-31 08:18:38 -04:00
Andrew Gallant
46fb77c20c searcher: bump buffer size
This increases the initial buffer size from 8KB to 64KB. This actually
leads to a reasonably noticeable improvement in at least one work-load,
and is unlikely to regress in any other case. Also, since Rust programs
(at least on Linux) seem to always use a minimum of 6-8MB of memory,
adding an extra 56KB is negligible.

Before:

    $ hyperfine -i "rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap"
    Benchmark #1: rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap
      Time (mean ± σ):      2.109 s ±  0.012 s    [User: 565.5 ms, System: 1541.6 ms]
      Range (min … max):    2.094 s …  2.128 s    10 runs

After:

    $ hyperfine -i "rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap"
    Benchmark #1: rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap
      Time (mean ± σ):      1.802 s ±  0.006 s    [User: 462.3 ms, System: 1337.9 ms]
      Range (min … max):    1.795 s …  1.814 s    10 runs
2021-03-23 17:45:02 -04:00
Andrew Gallant
fdd8510fdd repo: move all source code in crates directory
The top-level listing was just getting a bit too long for my taste. So
put all of the code in one directory and shrink the large top-level mess
to a small top-level mess.

NOTE: This commit only contains renames. The subsequent commit will
actually make ripgrep build again. We do it this way with the naive hope
that this will make it easier for git history to track the renames.
Sigh.
2020-02-17 19:24:53 -05:00