mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-11-04 09:32:00 +00:00 
			
		
		
		
	server : update doc to clarify n_keep when there is bos token (#8619)
This commit is contained in:
		@@ -444,7 +444,7 @@ node index.js
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
    `n_predict`: Set the maximum number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. Default: `-1`, where `-1` is infinity.
 | 
					    `n_predict`: Set the maximum number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. Default: `-1`, where `-1` is infinity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    `n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded.
 | 
					    `n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token.
 | 
				
			||||||
    By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
 | 
					    By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    `stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
 | 
					    `stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
 | 
				
			||||||
 
 | 
				
			|||||||
		Reference in New Issue
	
	Block a user