Radoslav Gerganov 
							
						 
					 
					
						
						
							
						
						2b6b55a59f 
					 
					
						
						
							
							server : include usage statistics only when user request them ( #16052 )  
						
						... 
						
						
						
						* server : include usage statistics only when user request them
When serving the OpenAI compatible API, we should check if
{"stream_options": {"include_usage": true} is set in the request when
deciding whether we should send usage statistics
closes : #16048 
* add unit test 
						
						
					 
					
						2025-09-18 10:36:57 +00:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						61bdfd5298 
					 
					
						
						
							
							server : implement prompt processing progress report in stream mode ( #15827 )  
						
						... 
						
						
						
						* server : implement `return_progress`
* add timings.cache_n
* add progress.time_ms
* add test
* fix test for chat/completions
* readme: add docs on timings
* use ggml_time_us
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2025-09-06 13:35:04 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						a68d914426 
					 
					
						
						
							
							server: add exceed_context_size_error type ( #15780 )  
						
						... 
						
						
						
						* server: add exceed_context_size_error type
* change error code to 400 
						
						
					 
					
						2025-09-04 11:50:23 +02:00 
						 
				 
			
				
					
						
							
							
								teo 
							
						 
					 
					
						
						
							
						
						1bc664a26a 
					 
					
						
						
							
							server: fix OpenAI API compatibility for usage statistics in chat streams ( #15444 )  
						
						
						
						
					 
					
						2025-08-21 00:10:08 +02:00 
						 
				 
			
				
					
						
							
							
								Lukas Straub 
							
						 
					 
					
						
						
							
						
						a9f77a8be3 
					 
					
						
						
							
							server : add openai-style logit_bias support ( #14946 )  
						
						... 
						
						
						
						Signed-off-by: Lukas Straub <lukasstraub2@web.de > 
						
						
					 
					
						2025-07-31 14:08:23 +02:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						ddef99522d 
					 
					
						
						
							
							server : fix assistant prefilling when content is an array ( #14360 )  
						
						
						
						
					 
					
						2025-07-05 09:17:14 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						f5cd27b71d 
					 
					
						
						
							
							server: streaming of tool calls and thoughts when --jinja is on (#12379 )  
						
						... 
						
						
						
						* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com > 
						
						
					 
					
						2025-05-25 01:48:08 +01:00 
						 
				 
			
				
					
						
							
							
								Dorin-Andrei Geman 
							
						 
					 
					
						
						
							
						
						42158ae2e8 
					 
					
						
						
							
							server : fix first message identification ( #13634 )  
						
						... 
						
						
						
						* server : fix first message identification
When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626 ) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
* server : Fix checks for first role message for stream=True
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
---------
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com > 
						
						
					 
					
						2025-05-21 15:07:57 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						1d36b3670b 
					 
					
						
						
							
							llama : move end-user examples to tools directory ( #13249 )  
						
						... 
						
						
						
						* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co > 
						
						
					 
					
						2025-05-02 20:27:13 +02:00