llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-20 12:07:33 +00:00

Author	SHA1	Message	Date
hksdpc255	1920345c3b	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 ) * Add files via upload * fix unit test * fix crashes for --reasoning-format=none * Patch buggy official MiniMax-M2 chat template * add upstream minja fix: https://github.com/ochafik/minja/pull/7 * Fix <think> token not generated * add test copied from https://github.com/ggml-org/llama.cpp/pull/16946 * cleanup * Hopes to fix the compilation error on CI * Delete chat template patching since it’s fixed by upstream Minja * Remove undeeded Minimax-M2 template patch https://github.com/ochafik/minja/pull/7#issuecomment-3480356100 * Add proper handling of optional parameters with test merged tests from: `23d4bb75c4` * Fix making all tool parameters optional * Move xml tool parser to separate file * cleanup & add tests for GLM4.5 * add streaming tests & enhancement & cleanups Add streaming test for both GLM 4.5 and minimax-m2. Cleanup for preserved_tokens. Cleanup for grammar rule name. Enhance the parser's stability. * cleanup & add support for Kimi-K2 Qwen3-Coder Apriel-1.5 Xiaomi-MiMo * apply suggestions from reviewers * fix a misuse for data.grammar_lazy * fix grammar when tool have no argument * Fix `no triggers set for lazy grammar!` for GLM4.5/4.6. Insert additional stops for Kimi-K2 * update chat.cpp * fix grammar for GLM 4.5/4.6 * Try fix Jinja template for GLM * Try fix GLM-4.6.jinja * Update common/chat-parser-xml-toolcall.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * improve chat template for GLM, rename Kimi-K2 template to Kimi-K2-Thinking * Improve Kimi-K2 chat template * Fix unit test * Fix "Invalid tool call arguments passed" in a rare case. In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation. --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-18 18:54:15 +01:00
Piotr Wilkin (ilintar)	34fcc5a4ac	model : Apertus model implementation (#15852 ) * First attempt * No permute during convert (fixes qk tensors), proper norm application. * RoPE = NeoX * Coherence! * Migrate xielu params from tensors to hyperparameters * Simple CUDA kernel * Revert stupid LLM refactorings * Chat template support * configchecker / flake8 errors * Reorder unary.cu * I do conclude that LLMs are, in fact, stupid. * Fix after merge * Final newline * Make xIELU an UNARY_OP * Final newline * Correctly account for parameter shift * Argh. * Update ggml/src/ggml-cpu/unary-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Refactor: remove unused methods, inline and factorize softplus, add const modifiers * Revert CUDA changes, implement xIELU as a separate OP * Pesky newline * Add float2half / half2float for F16 inputs/outputs * CUDA variants, attempt 2 * Actually, attempt 3 * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Missing convert header * Proper formula and reference for xIELU in the comments. * Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add tensor mappings for Apertus to global list instead * Fix lazy on scalars * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Add comment about the constraints on positive/negative alpha * Change `softplus` to `ggml_softplus` --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-02 20:43:22 +03:00
Piotr	3cb203c89f	llama-chat : Do not throw when tool parsing fails (#14012 ) Currently when a model generates output which looks like a tool call, but is invalid an exception is thrown and not handled, causing the cli or llama-server to bail. Instead, handle the chat parser exception and simply return the generated text in such cases. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-06-14 17:25:15 +01:00
Georgi Gerganov	53f925074d	sync : vendor (#13901 ) * sync : vendor ggml-ci * cont : fix httplib version ggml-ci * cont : fix lint * cont : fix lint * vendor : move to common folder /vendor ggml-ci * cont : fix lint * cont : move httplib to /vendor + use json_fwd.hpp ggml-ci * cont : fix server build ggml-ci * cont : add missing headers ggml-ci * cont : header clean-up ggml-ci	2025-05-30 16:25:45 +03:00
Olivier Chafik	03f582ae8f	server: fix streaming crashes (#13786 ) * add preludes to content on partial regex match * allow all parsers to parse non-tool-call content. * tweak order of <\|python_tag\|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash	2025-05-26 16:03:57 +01:00
Olivier Chafik	f5cd27b71d	`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379 ) * add common_json w/ support for truncated json healing * add common_chat_msg_diff * partial common_chat_parse * refactor parser w/ optionals * server: wire chat diffs in stream mode * fix trigger of thinking models (must happen after thoughts are closed) * fix functionary v3.2 raw python! * rename: common_chat_syntax (now contains format) * rm common_regex.at_start * don't return empty <think></think> * accommodate yet another deepseek r1 distill fantasy syntax (`<｜tool▁calls｜>`) * fix QwQ 32B tool call parsing after thoughts (hermes2) * better logs for grammar triggers * consume spaces after parse_json_tool_calls * fix required tool calls w/ thinking models that have pre-opened thinking tags * fix thinking model's initial trigger + test qwq's template * run most test_tool_call tests in stream + non-stream modes * make functionary v3.2 parsing more strict (differentiate first match from others) * send final diff from server, to close off raw python arguments * support partial content streaming in Generic mode * tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5) * Update function-calling.md * Update tool_bench.py * chat-parser: remove input from exception (llm output may contain PII) --------- Co-authored-by: ochafik <ochafik@google.com> Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>	2025-05-25 01:48:08 +01:00

6 Commits