mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-11-03 09:22:01 +00:00 
			
		
		
		
	* server : refactor slot input data, move tokenizer to HTTP thread * move prompt_tokens.empty() check * fix incorrect if branch * fix infinite generation loop * bring back infill validation * add infill test * try fixing format_infill * fix test * remove redundant code * rename completion to inference * update docs * use llama_tokens everywhere
		
			
				
	
	
		
			37 lines
		
	
	
		
			1.5 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			37 lines
		
	
	
		
			1.5 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
@llama.cpp
 | 
						|
@infill
 | 
						|
Feature: llama.cpp server
 | 
						|
 | 
						|
  # The current model is made by adding FIM tokens to the existing stories260K
 | 
						|
  # We may want to use a better model in the future, maybe something like SmolLM 360M
 | 
						|
 | 
						|
  Background: Server startup
 | 
						|
    Given a server listening on localhost:8080
 | 
						|
    And   a model file tinyllamas/stories260K-infill.gguf from HF repo ggml-org/models
 | 
						|
    And   a model file test-model-infill.gguf
 | 
						|
    And   a model alias tinyllama-infill
 | 
						|
    And   42 as server seed
 | 
						|
    And   1024 as batch size
 | 
						|
    And   1024 as ubatch size
 | 
						|
    And   2048 KV cache size
 | 
						|
    And   64 max tokens to predict
 | 
						|
    And   0.0 temperature
 | 
						|
    Then  the server is starting
 | 
						|
    Then  the server is healthy
 | 
						|
 | 
						|
  Scenario: Infill without input_extra
 | 
						|
    Given a prompt "Complete this"
 | 
						|
    And   an infill input extra none none
 | 
						|
    And   an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_"
 | 
						|
    And   an infill input suffix "}\n"
 | 
						|
    And   an infill request with no api error
 | 
						|
    Then  64 tokens are predicted matching One|day|she|saw|big|scary|bird
 | 
						|
 | 
						|
  Scenario: Infill with input_extra
 | 
						|
    Given a prompt "Complete this"
 | 
						|
    And   an infill input extra "llama.h" "LLAMA_API int32_t llama_n_threads();\n"
 | 
						|
    And   an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_"
 | 
						|
    And   an infill input suffix "}\n"
 | 
						|
    And   an infill request with no api error
 | 
						|
    Then  64 tokens are predicted matching cuts|Jimmy|mom|came|into|the|room"
 |