mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-11-03 09:22:01 +00:00 
			
		
		
		
	* Update brute force test: add_special * Update brute force test: default values for add_bos_token and add_eos_token * Enable rtrim when pre-inserting BOS Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Revert "server : fix test regexes"
		
			
				
	
	
		
			59 lines
		
	
	
		
			2.4 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			59 lines
		
	
	
		
			2.4 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
@llama.cpp
 | 
						|
@slotsave
 | 
						|
Feature: llama.cpp server slot management
 | 
						|
 | 
						|
  Background: Server startup
 | 
						|
    Given a server listening on localhost:8080
 | 
						|
    And   a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
 | 
						|
    And   prompt caching is enabled
 | 
						|
    And   2 slots
 | 
						|
    And   . as slot save path
 | 
						|
    And   2048 KV cache size
 | 
						|
    And   42 as server seed
 | 
						|
    And   24 max tokens to predict
 | 
						|
    Then  the server is starting
 | 
						|
    Then  the server is healthy
 | 
						|
 | 
						|
  Scenario: Save and Restore Slot
 | 
						|
    # First prompt in slot 1 should be fully processed
 | 
						|
    Given a user prompt "What is the capital of France?"
 | 
						|
    And   using slot id 1
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  24 tokens are predicted matching (Lily|cake)
 | 
						|
    And   22 prompt tokens are processed
 | 
						|
    When  the slot 1 is saved with filename "slot1.bin"
 | 
						|
    Then  the server responds with status code 200
 | 
						|
    # Since we have cache, this should only process the last tokens
 | 
						|
    Given a user prompt "What is the capital of Germany?"
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  24 tokens are predicted matching (Thank|special)
 | 
						|
    And   7 prompt tokens are processed
 | 
						|
    # Loading the original cache into slot 0,
 | 
						|
    # we should only be processing 1 prompt token and get the same output
 | 
						|
    When  the slot 0 is restored with filename "slot1.bin"
 | 
						|
    Then  the server responds with status code 200
 | 
						|
    Given a user prompt "What is the capital of France?"
 | 
						|
    And   using slot id 0
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  24 tokens are predicted matching (Lily|cake)
 | 
						|
    And   1 prompt tokens are processed
 | 
						|
    # For verification that slot 1 was not corrupted during slot 0 load, same thing
 | 
						|
    Given a user prompt "What is the capital of Germany?"
 | 
						|
    And   using slot id 1
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  24 tokens are predicted matching (Thank|special)
 | 
						|
    And   1 prompt tokens are processed
 | 
						|
 | 
						|
  Scenario: Erase Slot
 | 
						|
    Given a user prompt "What is the capital of France?"
 | 
						|
    And   using slot id 1
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  24 tokens are predicted matching (Lily|cake)
 | 
						|
    And   22 prompt tokens are processed
 | 
						|
    When  the slot 1 is erased
 | 
						|
    Then  the server responds with status code 200
 | 
						|
    Given a user prompt "What is the capital of France?"
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  24 tokens are predicted matching (Lily|cake)
 | 
						|
    And   22 prompt tokens are processed
 |