mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-11-03 09:22:01 +00:00 
			
		
		
		
	* server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo
		
			
				
	
	
		
			119 lines
		
	
	
		
			4.2 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			119 lines
		
	
	
		
			4.2 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
@llama.cpp
 | 
						|
@results
 | 
						|
Feature: Results
 | 
						|
 | 
						|
  Background: Server startup
 | 
						|
    Given a server listening on localhost:8080
 | 
						|
    And   a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
 | 
						|
    And   a model file test-model-00001-of-00003.gguf
 | 
						|
    And   128 as batch size
 | 
						|
    And   1024 KV cache size
 | 
						|
    And   128 max tokens to predict
 | 
						|
    And   continuous batching
 | 
						|
 | 
						|
  Scenario Outline: consistent results with same seed
 | 
						|
    Given <n_slots> slots
 | 
						|
    And   1.0 temperature
 | 
						|
    Then  the server is starting
 | 
						|
    Then  the server is healthy
 | 
						|
 | 
						|
    Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
 | 
						|
 | 
						|
    Given concurrent completion requests
 | 
						|
    Then the server is busy
 | 
						|
    Then the server is idle
 | 
						|
    And  all slots are idle
 | 
						|
    Then all predictions are equal
 | 
						|
    Examples:
 | 
						|
      | n_slots |
 | 
						|
      | 1       |
 | 
						|
      # FIXME: unified KV cache nondeterminism
 | 
						|
      # | 2       |
 | 
						|
 | 
						|
  Scenario Outline: different results with different seed
 | 
						|
    Given <n_slots> slots
 | 
						|
    And   1.0 temperature
 | 
						|
    Then  the server is starting
 | 
						|
    Then  the server is healthy
 | 
						|
 | 
						|
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
 | 
						|
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
 | 
						|
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
 | 
						|
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45
 | 
						|
 | 
						|
    Given concurrent completion requests
 | 
						|
    Then the server is busy
 | 
						|
    Then the server is idle
 | 
						|
    And  all slots are idle
 | 
						|
    Then all predictions are different
 | 
						|
    Examples:
 | 
						|
      | n_slots |
 | 
						|
      | 1       |
 | 
						|
      | 2       |
 | 
						|
 | 
						|
  Scenario Outline: consistent results with same seed and varying batch size
 | 
						|
    Given 4 slots
 | 
						|
    And   <temp> temperature
 | 
						|
    # And   0 as draft
 | 
						|
    Then  the server is starting
 | 
						|
    Then  the server is healthy
 | 
						|
 | 
						|
    Given 1 prompts "Write a very long story about AI." with seed 42
 | 
						|
    And   concurrent completion requests
 | 
						|
    # Then the server is busy # Not all slots will be utilized.
 | 
						|
    Then  the server is idle
 | 
						|
    And   all slots are idle
 | 
						|
 | 
						|
    Given <n_parallel> prompts "Write a very long story about AI." with seed 42
 | 
						|
    And   concurrent completion requests
 | 
						|
    # Then the server is busy # Not all slots will be utilized.
 | 
						|
    Then the server is idle
 | 
						|
    And  all slots are idle
 | 
						|
 | 
						|
    Then all predictions are equal
 | 
						|
    Examples:
 | 
						|
      | n_parallel | temp |
 | 
						|
      | 1          | 0.0  |
 | 
						|
      | 1          | 1.0  |
 | 
						|
      # FIXME: unified KV cache nondeterminism
 | 
						|
      # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
 | 
						|
      # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
 | 
						|
      # and https://github.com/ggerganov/llama.cpp/pull/7347 .
 | 
						|
      # | 2          | 0.0  |
 | 
						|
      # | 4          | 0.0  |
 | 
						|
      # | 2          | 1.0  |
 | 
						|
      # | 4          | 1.0  |
 | 
						|
 | 
						|
  Scenario Outline: consistent token probs with same seed and prompt
 | 
						|
    Given <n_slots> slots
 | 
						|
    And   <n_kv> KV cache size
 | 
						|
    And   1.0 temperature
 | 
						|
    And   <n_predict> max tokens to predict
 | 
						|
    Then  the server is starting
 | 
						|
    Then  the server is healthy
 | 
						|
 | 
						|
    Given 1 prompts "The meaning of life is" with seed 42
 | 
						|
    And   concurrent completion requests
 | 
						|
    # Then the server is busy # Not all slots will be utilized.
 | 
						|
    Then  the server is idle
 | 
						|
    And   all slots are idle
 | 
						|
 | 
						|
    Given <n_parallel> prompts "The meaning of life is" with seed 42
 | 
						|
    And   concurrent completion requests
 | 
						|
    # Then the server is busy # Not all slots will be utilized.
 | 
						|
    Then the server is idle
 | 
						|
    And  all slots are idle
 | 
						|
 | 
						|
    Then all token probabilities are equal
 | 
						|
    Examples:
 | 
						|
      | n_slots | n_kv | n_predict | n_parallel |
 | 
						|
      | 4       | 1024 | 1         | 1          |
 | 
						|
      # FIXME: unified KV cache nondeterminism
 | 
						|
      # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
 | 
						|
      # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
 | 
						|
      # and https://github.com/ggerganov/llama.cpp/pull/7347 .
 | 
						|
      # | 4       | 1024 | 1         | 4          |
 | 
						|
      # | 4       | 1024 | 100       | 1          |
 | 
						|
      # This test still fails even the above patches; the first token probabilities are already different.
 | 
						|
      # | 4       | 1024 | 100       | 4          |
 |