Rewrite llama-run to use llama-server

llama-run works fine, but falls well behind llama-server functionality. Integrate llama-server with llama-run. Signed-off-by: Eric Curtin <ericcurtin17@gmail.com>
2025-10-27 08:21:30 +00:00 · 2025-09-05 10:46:06 +00:00
parent a81283820a
commit 7b717fb4b2
2 changed files with 380 additions and 1228 deletions
--- a/tools/run/README.md
+++ b/tools/run/README.md
@@ -3,50 +3,30 @@
 The purpose of this example is to demonstrate a minimal usage of llama.cpp for running models.

 ```bash
-llama-run granite3-moe
+llama-run -hf llama.cpp/example/run
 ```

 ```bash
-Description:
-  Runs a llm
+Usage: llama-run [server-options]

-Usage:
-  llama-run [options] model [prompt]
+This tool starts a llama-server process and provides an interactive chat interface.
+All options except --port are passed through to llama-server.

-Options:
-  -c, --context-size <value>
-      Context size (default: 2048)
-  -n, -ngl, --ngl <value>
-      Number of GPU layers (default: 0)
-  --temp <value>
-      Temperature (default: 0.8)
-  -v, --verbose, --log-verbose
-      Set verbosity level to infinity (i.e. log all messages, useful for debugging)
-  -h, --help
-      Show help message
+Common options:
+  -h, --help                  Show this help
+  -m,    --model FNAME        model path (default: `models/$filename` with filename from `--hf-file`
+                              or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
+  -hf,   -hfr, --hf-repo      <user>/<model>[:quant]
+                              Hugging Face model repository; quant is optional, case-insensitive,
+                              default to Q4_K_M, or falls back to the first file in the repo if
+                              Q4_K_M doesn't exist.
+                              mmproj is also downloaded automatically if available. to disable, add
+                              --no-mmproj
+                              example: unsloth/phi-4-GGUF:q4_k_m
+                              (default: unused)
+  -c, --ctx-size N            Context size
+  -n, --predict N             Number of tokens to predict
+  -t, --threads N             Number of threads

-Commands:
-  model
-      Model is a string with an optional prefix of
-      huggingface:// (hf://), ollama://, https:// or file://.
-      If no protocol is specified and a file exists in the specified
-      path, file:// is assumed, otherwise if a file does not exist in
-      the specified path, ollama:// is assumed. Models that are being
-      pulled are downloaded with .partial extension while being
-      downloaded and then renamed as the file without the .partial
-      extension when complete.
-
-Examples:
-  llama-run llama3
-  llama-run ollama://granite-code
-  llama-run ollama://smollm:135m
-  llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
-  llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
-  llama-run ms://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
-  llama-run modelscope://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
-  llama-run https://example.com/some-file1.gguf
-  llama-run some-file2.gguf
-  llama-run file://some-file3.gguf
-  llama-run --ngl 999 some-file4.gguf
-  llama-run --ngl 999 some-file5.gguf Hello World
+For all server options, run: llama-server --help
 ```
--- a/tools/run/run.cpp
+++ b/tools/run/run.cpp