Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						3c3635d2f2 
					 
					
						
						
							
							server : speed up tests ( #15836 )  
						
						... 
						
						
						
						* server : speed up tests
* clean up
* restore timeout_seconds in some places
* flake8
* explicit offline 
						
						
					 
					
						2025-09-06 14:45:24 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						e81b8e4b7f 
					 
					
						
						
							
							llama: use FA + max. GPU layers by default ( #15434 )  
						
						... 
						
						
						
						* llama: use max. GPU layers by default, auto -fa
* ggml-backend: abort instead of segfault 
						
						
					 
					
						2025-08-30 16:32:10 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						f5cd27b71d 
					 
					
						
						
							
							server: streaming of tool calls and thoughts when --jinja is on (#12379 )  
						
						... 
						
						
						
						* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com > 
						
						
					 
					
						2025-05-25 01:48:08 +01:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						1d36b3670b 
					 
					
						
						
							
							llama : move end-user examples to tools directory ( #13249 )  
						
						... 
						
						
						
						* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co > 
						
						
					 
					
						2025-05-02 20:27:13 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						669912d9a5 
					 
					
						
						
							
							tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 )  
						
						... 
						
						
						
						* sampler: turn lazy grammar trigger words to regexes
* add scripts/tool_bench.sh & .py
* constrain llama json output regardless of function name if matches at beginning
* update relaxed newline space rule in grammar tests
* support add_generation_prompt query parameter (useful for /apply_template)
* Update src/llama-grammar.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2025-03-05 13:05:13 +00:00