* requirements : update transformers/torch for Embedding Gemma
This commit updates the requirements to support converting
Embedding Gemma 300m models.
The motivation for this change is that during development I had a local
copy of the transformers package which is what I used for converting
the models. This was a mistake on my part and I should have also updated
my transformers version to the official release.
I had checked the requirements/requirements-convert_legacy_llama.txt
file and noted that the version was >=4.45.1,<5.0.0 and came to the
conculusion that no updated would be needed, this assumed that
Embedding Gemma would be in a transformers release at the time
Commit fb15d649ed ("llama : add support
for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to
convert themselves would be able to do so. However, Embedding Gemma is
a preview release and this commit updates the requirements to use this
preview release.
* resolve additional python dependencies
* fix pyright errors in tokenizer test and remove unused import
Server tests
Python based server tests scenario using pytest.
Tests target GitHub workflows job runners with 4 vCPU.
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail.
To mitigate it, you can increase values in n_predict, kv_size.
Install dependencies
pip install -r requirements.txt
Run tests
- Build the server
cd ../../..
cmake -B build
cmake --build build --target llama-server
- Start the test:
./tests.sh
It's possible to override some scenario steps values with environment variables:
| variable | description |
|---|---|
PORT |
context.server_port to set the listening port of the server during scenario, default: 8080 |
LLAMA_SERVER_BIN_PATH |
to change the server binary path, default: ../../../build/bin/llama-server |
DEBUG |
to enable steps and server verbose mode --verbose |
N_GPU_LAYERS |
number of model layers to offload to VRAM -ngl --n-gpu-layers |
LLAMA_CACHE |
by default server tests re-download models to the tmp subfolder. Set this to your cache (e.g. $HOME/Library/Caches/llama.cpp on Mac or $HOME/.cache/llama.cpp on Unix) to avoid this |
To run slow tests (will download many models, make sure to set LLAMA_CACHE if needed):
SLOW_TESTS=1 ./tests.sh
To run with stdout/stderr display in real time (verbose output, but useful for debugging):
DEBUG=1 ./tests.sh -s -v -x
To run all the tests in a file:
./tests.sh unit/test_chat_completion.py -v -x
To run a single test:
./tests.sh unit/test_chat_completion.py::test_invalid_chat_completion_req
Hint: You can compile and run test in single command, useful for local developement:
cmake --build build -j --target llama-server && ./tools/server/tests/tests.sh
To see all available arguments, please refer to pytest documentation