mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Files

Daniel Bevenius d0991da39d server : add support for external server for tests (#16243 )

This commit adds support for using an externally started llama-server
instance for the server tests. This can be enabled by setting the
DEBUG_EXTERNAL environment variable.

The motivation for this is to allow debugging of the server itself
when investigating a test failure. Instructions for how to do this are
added to the README.md file in the tests directory.

2025-09-25 11:36:47 +02:00

3.4 KiB

Raw Blame History

Server tests

Python based server tests scenario using pytest.

Tests target GitHub workflows job runners with 4 vCPU.

Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in n_predict, kv_size.

Install dependencies

pip install -r requirements.txt

Run tests

Build the server

cd ../../..
cmake -B build
cmake --build build --target llama-server

Start the test: ./tests.sh

It's possible to override some scenario steps values with environment variables:

variable	description
`PORT`	`context.server_port` to set the listening port of the server during scenario, default: `8080`
`LLAMA_SERVER_BIN_PATH`	to change the server binary path, default: `../../../build/bin/llama-server`
`DEBUG`	to enable steps and server verbose mode `--verbose`
`N_GPU_LAYERS`	number of model layers to offload to VRAM `-ngl --n-gpu-layers`
`LLAMA_CACHE`	by default server tests re-download models to the `tmp` subfolder. Set this to your cache (e.g. `$HOME/Library/Caches/llama.cpp` on Mac or `$HOME/.cache/llama.cpp` on Unix) to avoid this

To run slow tests (will download many models, make sure to set LLAMA_CACHE if needed):

SLOW_TESTS=1 ./tests.sh

To run with stdout/stderr display in real time (verbose output, but useful for debugging):

DEBUG=1 ./tests.sh -s -v -x

To run all the tests in a file:

./tests.sh unit/test_chat_completion.py -v -x

To run a single test:

./tests.sh unit/test_chat_completion.py::test_invalid_chat_completion_req

Hint: You can compile and run test in single command, useful for local developement:

cmake --build build -j --target llama-server && ./tools/server/tests/tests.sh

To see all available arguments, please refer to pytest documentation

Debugging external llama-server

It can sometimes be useful to run the server in a debugger when invesigating test failures. To do this, the environment variable DEBUG_EXTERNAL=1 can be set which will cause the test to skip starting a llama-server itself. Instead, the server can be started in a debugger.

Example using gdb:

$ gdb --args ../../../build/bin/llama-server \
    --host 127.0.0.1 --port 8080 \
    --temp 0.8 --seed 42 \
    --hf-repo ggml-org/models --hf-file tinyllamas/stories260K.gguf \
    --batch-size 32 --no-slots --alias tinyllama-2 --ctx-size 512 \
    --parallel 2 --n-predict 64

And a break point can be set in before running:

(gdb) br server.cpp:4604
(gdb) r
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

And then the test in question can be run in another terminal:

(venv) $ env DEBUG_EXTERNAL=1 ./tests.sh unit/test_chat_completion.py -v -x

And this should trigger the breakpoint and allow inspection of the server state in the debugger terminal.

3.4 KiB Raw Blame History

Server tests

Install dependencies

Run tests

Debugging external llama-server

3.4 KiB

Raw Blame History