mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-10-31 08:51:55 +00:00 
			
		
		
		
	 525213d2f5
			
		
	
	525213d2f5
	
	
	
		
			
			* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
		
			
				
	
	
		
			22 lines
		
	
	
		
			711 B
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			22 lines
		
	
	
		
			711 B
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
| # run with ./test.sh --tags wrong_usage
 | |
| @wrong_usage
 | |
| Feature: Wrong usage of llama.cpp server
 | |
| 
 | |
|   #3969 The user must always set --n-predict option
 | |
|   # to cap the number of tokens any completion request can generate
 | |
|   # or pass n_predict/max_tokens in the request.
 | |
|   Scenario: Infinite loop
 | |
|     Given a server listening on localhost:8080
 | |
|     And   a model file stories260K.gguf
 | |
|     # Uncomment below to fix the issue
 | |
|     #And   64 server max tokens to predict
 | |
|     Then  the server is starting
 | |
|     Given a prompt:
 | |
|       """
 | |
|       Go to: infinite loop
 | |
|       """
 | |
|     # Uncomment below to fix the issue
 | |
|     #And   128 max tokens to predict
 | |
|     Given concurrent completion requests
 | |
|     Then all prompts are predicted
 |