mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-11-02 09:12:03 +00:00 
			
		
		
		
	* server : add lora hotswap endpoint * handle lora_no_apply * fix build * updae docs * clean up struct def * fix build * add LoRA test * fix style
		
			
				
	
	
		
			37 lines
		
	
	
		
			1.1 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			37 lines
		
	
	
		
			1.1 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
@llama.cpp
 | 
						|
@lora
 | 
						|
Feature: llama.cpp server
 | 
						|
 | 
						|
  Background: Server startup
 | 
						|
    Given a server listening on localhost:8080
 | 
						|
    And   a model url https://huggingface.co/ggml-org/stories15M_MOE/resolve/main/stories15M_MOE-F16.gguf
 | 
						|
    And   a model file stories15M_MOE-F16.gguf
 | 
						|
    And   a model alias stories15M_MOE
 | 
						|
    And   a lora adapter file from https://huggingface.co/ggml-org/stories15M_MOE/resolve/main/moe_shakespeare15M.gguf
 | 
						|
    And   42 as server seed
 | 
						|
    And   1024 as batch size
 | 
						|
    And   1024 as ubatch size
 | 
						|
    And   2048 KV cache size
 | 
						|
    And   64 max tokens to predict
 | 
						|
    And   0.0 temperature
 | 
						|
    Then  the server is starting
 | 
						|
    Then  the server is healthy
 | 
						|
 | 
						|
  Scenario: Completion LoRA disabled
 | 
						|
    Given switch off lora adapter 0
 | 
						|
    Given a prompt:
 | 
						|
    """
 | 
						|
    Look in thy glass
 | 
						|
    """
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  64 tokens are predicted matching little|girl|three|years|old
 | 
						|
 | 
						|
  Scenario: Completion LoRA enabled
 | 
						|
    Given switch on lora adapter 0
 | 
						|
    Given a prompt:
 | 
						|
    """
 | 
						|
    Look in thy glass
 | 
						|
    """
 | 
						|
    And   a completion request with no api error
 | 
						|
    Then  64 tokens are predicted matching eye|love|glass|sun
 |