llama-bench : clarify benchmarked parts of the computation (#16823)

This commit is contained in:
Georgi Gerganov
2025-10-28 19:41:43 +02:00
committed by GitHub
parent 8284efc35c
commit a8ca18b4b8

View File

@@ -82,6 +82,9 @@ Using the `-d <n>` option, each test can be run at a specified context depth, pr
For a description of the other options, see the [main example](../main/README.md). For a description of the other options, see the [main example](../main/README.md).
> [!NOTE]
> The measurements with `llama-bench` do not include the times for tokenization and for sampling.
## Examples ## Examples
### Text generation with different models ### Text generation with different models
@@ -131,7 +134,7 @@ $ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | pp 64 | 33.52 ± 0.03 | | llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | pp 64 | 33.52 ± 0.03 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | tg 16 | 15.32 ± 0.05 | | llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | tg 16 | 15.32 ± 0.05 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | pp 64 | 59.00 ± 1.11 | | llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | pp 64 | 59.00 ± 1.11 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | tg 16 | 16.41 ± 0.79 || | llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | tg 16 | 16.41 ± 0.79 |
### Different numbers of layers offloaded to the GPU ### Different numbers of layers offloaded to the GPU