mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Files

Radoslav Gerganov 68ee98ae18 server : return HTTP 400 if prompt exceeds context length (#16486 )

In streaming mode when prompt exceeds context length, the server returns
HTTP 200 status code with a JSON error in the body.  This is very
confusing and inconsistent with all other inference engines which return
HTTP 4xx error in this case.

This patch fixes this problem and makes the server return HTTP 400 in
such cases.

2025-10-10 16:11:07 +02:00

unit

server : return HTTP 400 if prompt exceeds context length (#16486 )

2025-10-10 16:11:07 +02:00

.gitignore

llama : move end-user examples to tools directory (#13249 )

2025-05-02 20:27:13 +02:00

conftest.py

llama : move end-user examples to tools directory (#13249 )

2025-05-02 20:27:13 +02:00

pytest.ini

llama : move end-user examples to tools directory (#13249 )

2025-05-02 20:27:13 +02:00

README.md

server : add support for external server for tests (#16243 )

2025-09-25 11:36:47 +02:00

requirements.txt

requirements : update transformers/torch for Embedding Gemma (#15828 )

2025-09-09 06:06:52 +02:00

tests.sh

scripts : make the shell scripts cross-platform (#14341 )

2025-06-30 10:17:18 +02:00

utils.py

server : return HTTP 400 if prompt exceeds context length (#16486 )

2025-10-10 16:11:07 +02:00

README.md

Server tests

Python based server tests scenario using pytest.

Tests target GitHub workflows job runners with 4 vCPU.

Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in n_predict, kv_size.

Install dependencies

pip install -r requirements.txt

Run tests

Build the server

cd ../../..
cmake -B build
cmake --build build --target llama-server

Start the test: ./tests.sh

It's possible to override some scenario steps values with environment variables:

variable	description
`PORT`	`context.server_port` to set the listening port of the server during scenario, default: `8080`
`LLAMA_SERVER_BIN_PATH`	to change the server binary path, default: `../../../build/bin/llama-server`
`DEBUG`	to enable steps and server verbose mode `--verbose`
`N_GPU_LAYERS`	number of model layers to offload to VRAM `-ngl --n-gpu-layers`
`LLAMA_CACHE`	by default server tests re-download models to the `tmp` subfolder. Set this to your cache (e.g. `$HOME/Library/Caches/llama.cpp` on Mac or `$HOME/.cache/llama.cpp` on Unix) to avoid this

To run slow tests (will download many models, make sure to set LLAMA_CACHE if needed):

SLOW_TESTS=1 ./tests.sh

To run with stdout/stderr display in real time (verbose output, but useful for debugging):

DEBUG=1 ./tests.sh -s -v -x

To run all the tests in a file:

./tests.sh unit/test_chat_completion.py -v -x

To run a single test:

./tests.sh unit/test_chat_completion.py::test_invalid_chat_completion_req

Hint: You can compile and run test in single command, useful for local developement:

cmake --build build -j --target llama-server && ./tools/server/tests/tests.sh

To see all available arguments, please refer to pytest documentation

Debugging external llama-server

It can sometimes be useful to run the server in a debugger when invesigating test failures. To do this, the environment variable DEBUG_EXTERNAL=1 can be set which will cause the test to skip starting a llama-server itself. Instead, the server can be started in a debugger.

Example using gdb:

$ gdb --args ../../../build/bin/llama-server \
    --host 127.0.0.1 --port 8080 \
    --temp 0.8 --seed 42 \
    --hf-repo ggml-org/models --hf-file tinyllamas/stories260K.gguf \
    --batch-size 32 --no-slots --alias tinyllama-2 --ctx-size 512 \
    --parallel 2 --n-predict 64

And a break point can be set in before running:

(gdb) br server.cpp:4604
(gdb) r
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

And then the test in question can be run in another terminal:

(venv) $ env DEBUG_EXTERNAL=1 ./tests.sh unit/test_chat_completion.py -v -x

And this should trigger the breakpoint and allow inspection of the server state in the debugger terminal.