llama-bench: add --devices and --list-devices support (#16039)

* * llama-bench: add --devices support - Support --devices same as llama-server - Provide for benchmarking different device combinations - Include --list-devices like llama-server for convenience * fix: field display ordering restored * fix: integrated the rpc devices - aimed to mimic the server as much as possible * cleanup: defaults for list-devices - handle dup device listing with RPC * cleanup: remove dup device load calls * docs: update llama-bench - added the recently added n-cpu-moe option to the docs while in there * llama-bench: rpc device simplification * rpc servers unify with other devices earlier, simplifying code * --list-devices made stateless and simpler * various cleanup
2025-10-27 08:21:30 +00:00 · 2025-09-19 15:15:21 -07:00
parent f432d8d83e
commit be79d9fdd9
2 changed files with 182 additions and 76 deletions
--- a/tools/llama-bench/README.md
+++ b/tools/llama-bench/README.md
@@ -30,8 +30,10 @@ options:
  --delay <0...N> (seconds)                 delay between each test (default: 0)
  -o, --output <csv|json|jsonl|md|sql>      output format printed to stdout (default: md)
  -oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
+  --list-devices                            list available devices and exit
  -v, --verbose                             verbose output
  --progress                                print test progress indicators
+  -rpc, --rpc <rpc_servers>                 register RPC devices (comma separated)

 test parameters:
  -m, --model <filename>                    (default: models/7B/ggml-model-q4_0.gguf)
@@ -48,11 +50,12 @@ test parameters:
  --cpu-strict <0|1>                        (default: 0)
  --poll <0...100>                          (default: 50)
  -ngl, --n-gpu-layers <n>                  (default: 99)
-  -rpc, --rpc <rpc_servers>                 (default: none)
+  -ncmoe, --n-cpu-moe <n>                   (default: 0)
  -sm, --split-mode <none|layer|row>        (default: layer)
  -mg, --main-gpu <i>                       (default: 0)
  -nkvo, --no-kv-offload <0|1>              (default: 0)
  -fa, --flash-attn <0|1>                   (default: 0)
+  -dev, --device <dev0/dev1/...>            (default: auto)
  -mmp, --mmap <0|1>                        (default: 1)
  -embd, --embeddings <0|1>                 (default: 0)
  -ts, --tensor-split <ts0/ts1/..>          (default: 0)