mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2025-10-31 08:51:55 +00:00
Rewrite llama-run to use llama-server
llama-run works fine, but falls well behind llama-server functionality. Integrate llama-server with llama-run. Signed-off-by: Eric Curtin <ericcurtin17@gmail.com>
This commit is contained in:
@@ -3,50 +3,30 @@
|
|||||||
The purpose of this example is to demonstrate a minimal usage of llama.cpp for running models.
|
The purpose of this example is to demonstrate a minimal usage of llama.cpp for running models.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
llama-run granite3-moe
|
llama-run -hf llama.cpp/example/run
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
Description:
|
Usage: llama-run [server-options]
|
||||||
Runs a llm
|
|
||||||
|
|
||||||
Usage:
|
This tool starts a llama-server process and provides an interactive chat interface.
|
||||||
llama-run [options] model [prompt]
|
All options except --port are passed through to llama-server.
|
||||||
|
|
||||||
Options:
|
Common options:
|
||||||
-c, --context-size <value>
|
-h, --help Show this help
|
||||||
Context size (default: 2048)
|
-m, --model FNAME model path (default: `models/$filename` with filename from `--hf-file`
|
||||||
-n, -ngl, --ngl <value>
|
or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
|
||||||
Number of GPU layers (default: 0)
|
-hf, -hfr, --hf-repo <user>/<model>[:quant]
|
||||||
--temp <value>
|
Hugging Face model repository; quant is optional, case-insensitive,
|
||||||
Temperature (default: 0.8)
|
default to Q4_K_M, or falls back to the first file in the repo if
|
||||||
-v, --verbose, --log-verbose
|
Q4_K_M doesn't exist.
|
||||||
Set verbosity level to infinity (i.e. log all messages, useful for debugging)
|
mmproj is also downloaded automatically if available. to disable, add
|
||||||
-h, --help
|
--no-mmproj
|
||||||
Show help message
|
example: unsloth/phi-4-GGUF:q4_k_m
|
||||||
|
(default: unused)
|
||||||
|
-c, --ctx-size N Context size
|
||||||
|
-n, --predict N Number of tokens to predict
|
||||||
|
-t, --threads N Number of threads
|
||||||
|
|
||||||
Commands:
|
For all server options, run: llama-server --help
|
||||||
model
|
|
||||||
Model is a string with an optional prefix of
|
|
||||||
huggingface:// (hf://), ollama://, https:// or file://.
|
|
||||||
If no protocol is specified and a file exists in the specified
|
|
||||||
path, file:// is assumed, otherwise if a file does not exist in
|
|
||||||
the specified path, ollama:// is assumed. Models that are being
|
|
||||||
pulled are downloaded with .partial extension while being
|
|
||||||
downloaded and then renamed as the file without the .partial
|
|
||||||
extension when complete.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
llama-run llama3
|
|
||||||
llama-run ollama://granite-code
|
|
||||||
llama-run ollama://smollm:135m
|
|
||||||
llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
|
|
||||||
llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
|
|
||||||
llama-run ms://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
|
|
||||||
llama-run modelscope://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
|
|
||||||
llama-run https://example.com/some-file1.gguf
|
|
||||||
llama-run some-file2.gguf
|
|
||||||
llama-run file://some-file3.gguf
|
|
||||||
llama-run --ngl 999 some-file4.gguf
|
|
||||||
llama-run --ngl 999 some-file5.gguf Hello World
|
|
||||||
```
|
```
|
||||||
|
|||||||
1548
tools/run/run.cpp
1548
tools/run/run.cpp
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user