 7a2c92637a
			
		
	
	7a2c92637a
	
	
	
		
			
			* ci: bench: change trigger path to not spawn on each PR * ci: bench: add more file type for phi-2: q8_0 and f16. - do not show the comment by default * ci: bench: add seed parameter in k6 script * ci: bench: artefact name perf job * Add iteration in the commit status, reduce again the autocomment * ci: bench: add per slot metric in the commit status * Fix trailing spaces
Server benchmark tools
Benchmark is using k6.
Install k6
Follow instruction from: https://k6.io/docs/get-started/installation/
Example for ubuntu:
snap install k6
Download a dataset
This dataset was originally proposed in vLLM benchmarks.
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Download a model
Example for PHI-2
../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf
Start the server
The server must answer OAI Chat completion requests on http://localhost:8080/v1 or according to the environment variable SERVER_BENCH_URL.
Example:
server --host localhost --port 8080 \
  --model ggml-model-q4_0.gguf \
  --cont-batching \
  --metrics \
  --parallel 8 \
  --batch-size 512 \
  --ctx-size 4096 \
  --log-format text \
  -ngl 33
Run the benchmark
For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:
k6 run script.js --duration 10m --iterations 500 --vus 8
The benchmark values can be overridden with:
- SERVER_BENCH_URLserver url prefix for chat completions, default- http://localhost:8080/v1
- SERVER_BENCH_N_PROMPTStotal prompts to randomly select in the benchmark, default- 480
- SERVER_BENCH_MODEL_ALIASmodel alias to pass in the completion request, default- my-model
- SERVER_BENCH_MAX_TOKENSmax tokens to predict, default:- 512
- SERVER_BENCH_DATASETpath to the benchmark dataset file
- SERVER_BENCH_MAX_PROMPT_TOKENSmaximum prompt tokens to filter out in the dataset: default- 1024
- SERVER_BENCH_MAX_CONTEXTmaximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default- 2048
Note: the local tokenizer is just a string space split, real number of tokens will differ.
Or with k6 options:
SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8
To debug http request use --http-debug="full".
Metrics
Following metrics are available computed from the OAI chat completions response usage:
- llamacpp_tokens_secondTrend of- usage.total_tokens / request duration
- llamacpp_prompt_tokensTrend of- usage.prompt_tokens
- llamacpp_prompt_tokens_total_counterCounter of- usage.prompt_tokens
- llamacpp_completion_tokensTrend of- usage.completion_tokens
- llamacpp_completion_tokens_total_counterCounter of- usage.completion_tokens
- llamacpp_completions_truncated_rateRate of completions truncated, i.e. if- finish_reason === 'length'
- llamacpp_completions_stop_rateRate of completions stopped by the model, i.e. if- finish_reason === 'stop'
The script will fail if too many completions are truncated, see llamacpp_completions_truncated_rate.
K6 metrics might be compared against server metrics, with:
curl http://localhost:8080/metrics