mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-11-04 09:32:00 +00:00 
			
		
		
		
	* Update brute force test: add_special * Update brute force test: default values for add_bos_token and add_eos_token * Enable rtrim when pre-inserting BOS Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Revert "server : fix test regexes"
		
			
				
	
	
		
			113 lines
		
	
	
		
			4.9 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			113 lines
		
	
	
		
			4.9 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
@llama.cpp
 | 
						|
@server
 | 
						|
Feature: llama.cpp server
 | 
						|
 | 
						|
  Background: Server startup
 | 
						|
    Given a server listening on localhost:8080
 | 
						|
    And   a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
 | 
						|
    And   a model file test-model.gguf
 | 
						|
    And   a model alias tinyllama-2
 | 
						|
    And   BOS token is 1
 | 
						|
    And   42 as server seed
 | 
						|
      # KV Cache corresponds to the total amount of tokens
 | 
						|
      # that can be stored across all independent sequences: #4130
 | 
						|
      # see --ctx-size and #5568
 | 
						|
    And   256 KV cache size
 | 
						|
    And   32 as batch size
 | 
						|
    And   2 slots
 | 
						|
    And   64 server max tokens to predict
 | 
						|
    And   prometheus compatible metrics exposed
 | 
						|
    Then  the server is starting
 | 
						|
    Then  the server is healthy
 | 
						|
 | 
						|
  Scenario: Health
 | 
						|
    Then the server is ready
 | 
						|
    And  all slots are idle
 | 
						|
 | 
						|
 | 
						|
  Scenario Outline: Completion
 | 
						|
    Given a prompt <prompt>
 | 
						|
    And   <n_predict> max tokens to predict
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  <n_predicted> tokens are predicted matching <re_content>
 | 
						|
    And   the completion is <truncated> truncated
 | 
						|
    And   <n_prompt> prompt tokens are processed
 | 
						|
    And   prometheus metrics are exposed
 | 
						|
    And   metric llamacpp:tokens_predicted is <n_predicted>
 | 
						|
 | 
						|
    Examples: Prompts
 | 
						|
      | prompt                                                                    | n_predict | re_content                                  | n_prompt | n_predicted | truncated |
 | 
						|
      | I believe the meaning of life is                                          | 8         | (read\|going)+                              | 18       | 8           | not       |
 | 
						|
      | Write a joke about AI from a very long prompt which will not be truncated | 256       | (princesses\|everyone\|kids\|Anna\|forest)+ | 46       | 64          | not       |
 | 
						|
 | 
						|
  Scenario: Completion prompt truncated
 | 
						|
    Given a prompt:
 | 
						|
    """
 | 
						|
    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
 | 
						|
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
 | 
						|
    Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
 | 
						|
    Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
 | 
						|
    """
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  64 tokens are predicted matching fun|Annaks|popcorns|pictry|bowl
 | 
						|
    And   the completion is  truncated
 | 
						|
    And   109 prompt tokens are processed
 | 
						|
 | 
						|
 | 
						|
  Scenario Outline: OAI Compatibility
 | 
						|
    Given a model <model>
 | 
						|
    And   a system prompt <system_prompt>
 | 
						|
    And   a user prompt <user_prompt>
 | 
						|
    And   <max_tokens> max tokens to predict
 | 
						|
    And   streaming is <enable_streaming>
 | 
						|
    Given an OAI compatible chat completions request with no api error
 | 
						|
    Then  <n_predicted> tokens are predicted matching <re_content>
 | 
						|
    And   <n_prompt> prompt tokens are processed
 | 
						|
    And   the completion is <truncated> truncated
 | 
						|
 | 
						|
    Examples: Prompts
 | 
						|
      | model        | system_prompt               | user_prompt                          | max_tokens | re_content                        | n_prompt | n_predicted | enable_streaming | truncated |
 | 
						|
      | llama-2      | Book                        | What is the best book                | 8          | (Here\|what)+                     | 77       | 8           | disabled         | not       |
 | 
						|
      | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128        | (thanks\|happy\|bird\|Annabyear)+ | -1       | 64          | enabled          |           |
 | 
						|
 | 
						|
 | 
						|
  Scenario Outline: OAI Compatibility w/ response format
 | 
						|
    Given a model test
 | 
						|
    And   a system prompt test
 | 
						|
    And   a user prompt test
 | 
						|
    And   a response format <response_format>
 | 
						|
    And   10 max tokens to predict
 | 
						|
    Given an OAI compatible chat completions request with no api error
 | 
						|
    Then  <n_predicted> tokens are predicted matching <re_content>
 | 
						|
 | 
						|
    Examples: Prompts
 | 
						|
      | response_format                                                     | n_predicted | re_content             |
 | 
						|
      | {"type": "json_object", "schema": {"const": "42"}}                  | 5           | "42"                   |
 | 
						|
      | {"type": "json_object", "schema": {"items": [{"type": "integer"}]}} | 10          | \[ -300 \]             |
 | 
						|
      | {"type": "json_object"}                                             | 10          | \{ " Jacky.            |
 | 
						|
 | 
						|
 | 
						|
  Scenario: Tokenize / Detokenize
 | 
						|
    When tokenizing:
 | 
						|
    """
 | 
						|
    What is the capital of France ?
 | 
						|
    """
 | 
						|
    Then tokens can be detokenized
 | 
						|
    And  tokens do not begin with BOS
 | 
						|
 | 
						|
  Scenario: Tokenize w/ BOS
 | 
						|
    Given adding special tokens
 | 
						|
    When  tokenizing:
 | 
						|
    """
 | 
						|
    What is the capital of Germany?
 | 
						|
    """
 | 
						|
    Then  tokens begin with BOS
 | 
						|
    Given first token is removed
 | 
						|
    Then  tokens can be detokenized
 | 
						|
 | 
						|
  Scenario: Models available
 | 
						|
    Given available models
 | 
						|
    Then  1 models are supported
 | 
						|
    Then  model 0 is identified by tinyllama-2
 | 
						|
    Then  model 0 is trained on 128 tokens context
 |