mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-10-30 08:42:00 +00:00 
			
		
		
		
	 9e359a4f47
			
		
	
	9e359a4f47
	
	
	
		
			
			* server: #5655 - continue to update other slots on embedding concurrent request. * server: tests: add multi users embeddings as fixed * server: tests: adding OAI compatible embedding concurrent endpoint * server: tests: adding OAI compatible embedding with multiple inputs
		
			
				
	
	
		
			83 lines
		
	
	
		
			2.9 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			83 lines
		
	
	
		
			2.9 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
| @llama.cpp
 | |
| Feature: llama.cpp server
 | |
| 
 | |
|   Background: Server startup
 | |
|     Given a server listening on localhost:8080
 | |
|     And   a model file stories260K.gguf
 | |
|     And   a model alias tinyllama-2
 | |
|     And   42 as server seed
 | |
|       # KV Cache corresponds to the total amount of tokens
 | |
|       # that can be stored across all independent sequences: #4130
 | |
|       # see --ctx-size and #5568
 | |
|     And   32 KV cache size
 | |
|     And   1 slots
 | |
|     And   embeddings extraction
 | |
|     And   32 server max tokens to predict
 | |
|     Then  the server is starting
 | |
|     Then  the server is healthy
 | |
| 
 | |
|   Scenario: Health
 | |
|     Then the server is ready
 | |
|     And  all slots are idle
 | |
| 
 | |
|   Scenario Outline: Completion
 | |
|     Given a prompt <prompt>
 | |
|     And   <n_predict> max tokens to predict
 | |
|     And   a completion request with no api error
 | |
|     Then  <n_predicted> tokens are predicted matching <re_content>
 | |
| 
 | |
|     Examples: Prompts
 | |
|       | prompt                           | n_predict | re_content                   | n_predicted |
 | |
|       | I believe the meaning of life is | 8         | read                         | 8           |
 | |
|       | Write a joke about AI            | 64        | (park<or>friends<or>scared)+ | 32          |
 | |
| 
 | |
|   Scenario Outline: OAI Compatibility
 | |
|     Given a model <model>
 | |
|     And   a system prompt <system_prompt>
 | |
|     And   a user prompt <user_prompt>
 | |
|     And   <max_tokens> max tokens to predict
 | |
|     And   streaming is <enable_streaming>
 | |
|     Given an OAI compatible chat completions request with no api error
 | |
|     Then  <n_predicted> tokens are predicted matching <re_content>
 | |
| 
 | |
|     Examples: Prompts
 | |
|       | model        | system_prompt               | user_prompt                          | max_tokens | re_content                 | n_predicted | enable_streaming |
 | |
|       | llama-2      | Book                        | What is the best book                | 8          | (Mom<or>what)+             | 8           | disabled         |
 | |
|       | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64         | (thanks<or>happy<or>bird)+ | 32          | enabled          |
 | |
| 
 | |
|   Scenario: Embedding
 | |
|     When embeddings are computed for:
 | |
|     """
 | |
|     What is the capital of Bulgaria ?
 | |
|     """
 | |
|     Then embeddings are generated
 | |
| 
 | |
|   Scenario: OAI Embeddings compatibility
 | |
|     Given a model tinyllama-2
 | |
|     When an OAI compatible embeddings computation request for:
 | |
|     """
 | |
|     What is the capital of Spain ?
 | |
|     """
 | |
|     Then embeddings are generated
 | |
| 
 | |
|   Scenario: OAI Embeddings compatibility with multiple inputs
 | |
|     Given a model tinyllama-2
 | |
|     Given a prompt:
 | |
|       """
 | |
|       In which country Paris is located ?
 | |
|       """
 | |
|     And a prompt:
 | |
|       """
 | |
|       Is Madrid the capital of Spain ?
 | |
|       """
 | |
|     When an OAI compatible embeddings computation request for multiple inputs
 | |
|     Then embeddings are generated
 | |
| 
 | |
| 
 | |
|   Scenario: Tokenize / Detokenize
 | |
|     When tokenizing:
 | |
|     """
 | |
|     What is the capital of France ?
 | |
|     """
 | |
|     Then tokens can be detokenize
 |