mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2025-10-31 08:51:55 +00:00
llama-bench : clarify benchmarked parts of the computation (#16823)
This commit is contained in:
@@ -82,6 +82,9 @@ Using the `-d <n>` option, each test can be run at a specified context depth, pr
|
||||
|
||||
For a description of the other options, see the [main example](../main/README.md).
|
||||
|
||||
> [!NOTE]
|
||||
> The measurements with `llama-bench` do not include the times for tokenization and for sampling.
|
||||
|
||||
## Examples
|
||||
|
||||
### Text generation with different models
|
||||
@@ -131,7 +134,7 @@ $ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
|
||||
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | pp 64 | 33.52 ± 0.03 |
|
||||
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | tg 16 | 15.32 ± 0.05 |
|
||||
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | pp 64 | 59.00 ± 1.11 |
|
||||
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | tg 16 | 16.41 ± 0.79 ||
|
||||
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | tg 16 | 16.41 ± 0.79 |
|
||||
|
||||
### Different numbers of layers offloaded to the GPU
|
||||
|
||||
|
||||
Reference in New Issue
Block a user