llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Author	SHA1	Message	Date
Sigbjørn Skjæret	835b2b915c	model : add GroveMoE support (#15510 ) * add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes	2025-09-25 19:50:28 +02:00
Douglas Hanley	b5bd037832	llama : add support for qwen3 reranker (#15824 )	2025-09-25 11:53:09 +03:00
Johannes Gäßler	e789095502	llama: print memory breakdown on exit (#15860 ) * llama: print memory breakdown on exit	2025-09-24 16:53:48 +02:00
Uilian Ries	152729f884	common : add missing chrono header for common.cpp (#16211 ) Signed-off-by: Uilian Ries <uilianries@gmail.com>	2025-09-24 09:53:47 +03:00
Adrien Gallouët	37a23c17bd	common : enable `--offline` mode without curl support (#16137 ) * common : use the json parser Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : enable --offline mode without CURL support This change refactors the download logic to properly support offline mode even when the project is built without CURL. Without this commit, using `--offline` would give the following error: error: built without CURL, cannot download model from the internet even if all the files are already cached. Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-22 15:13:51 +03:00
Haiyue Wang	d05affbab7	common : remove unused local variables (#16140 ) These two local variables 'arg' and 'arg_prefix' have been overriden by: 1. for (const auto & arg : opt.args) 2. for (int i = 1; i < argc; i++) { const std::string arg_prefix = "--"; std::string arg = argv[i];	2025-09-22 11:48:42 +03:00
shun095	f432d8d83e	chat: Fix streaming parser for granite models (#15682 ) * fix(chat): fix streaming parser for granite models * tests: add test cases for Granite models chat parser	2025-09-19 09:57:30 -06:00
Xuan-Son Nguyen	4b8560ab56	chat : fix build on arm64 (#16101 )	2025-09-19 13:02:51 +07:00
Eric Curtin	4ca088b036	Add resumable downloads for llama-server model loading (#15963 ) - Implement resumable downloads in common_download_file_single function - Add detection of partial download files (.downloadInProgress) - Check server support for HTTP Range requests via Accept-Ranges header - Implement HTTP Range request with "bytes=<start>-" header - Open files in append mode when resuming vs create mode for new downloads Signed-off-by: Eric Curtin <eric.curtin@docker.com>	2025-09-18 16:22:50 +01:00
David Ribeiro Alves	cd08fc3ecc	common : Fix corrupted memory error on json grammar initialization (#16038 ) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void>>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 #7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 #8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const, common_chat_templates_inputs const&) chat.cpp:1992 #9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const, common_chat_templates_inputs const&) chat.cpp:2074 #10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void>>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) #1 0x000100873910 in std::sys::pal::unix:🧵:Thread:🆕:h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING	2025-09-17 11:08:02 +03:00
jacekpoplawski	8ff206097c	llama-bench: add --n-cpu-moe support (#15952 ) * llama-bench: add --n-cpu-moe support Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.	2025-09-16 16:17:08 +02:00
Aman Gupta	6d758839ff	Add LLaDA-7b-MoE diffusion model (#16003 )	2025-09-16 10:38:28 +08:00
Sigbjørn Skjæret	b8e09f08b9	model : add grok-2 support (#15539 ) * add grok-2 support * type fix * type fix * type fix * "fix" vocab for invalid sequences * fix expert tensor mapping and spaces in vocab * add chat template * fix norm tensor mapping * rename layer_out_norm to ffn_post_norm * ensure ffn_post_norm is mapped * fix experts merging * remove erroneous FFN_GATE entry * concatenate split tensors and add more metadata * process all expert layers and try cat instead of hstack * add support for community BPE vocab * fix expert feed forward length and ffn_down concat * commit this too * add ffn_up/gate/down, unsure if sequence is right * add ffn_gate/down/up to tensor names * correct residual moe (still not working) * mess-- * fix embedding scale being applied twice * add built in chat template * change beta fast for grok if default value * remove spm vocab in favor of community bpe vocab * change attention temp length metadata type to integer * update attention temp length metadata * remove comment * replace M_SQRT2 with std::sqrt(2) * add yarn metadata, move defaults to hparams	2025-09-14 23:00:59 +02:00
Diego Devesa	50f4281a6f	llama : allow using iGPUs with --device (#15951 ) * llama : allow using iGPUs with --device * mtmd : allow iGPU * rpc-server : allow iGPU	2025-09-13 16:49:49 +02:00
Eric Curtin	4bf5549269	Add docker protocol support for llama-server model loading (#15790 ) To pull and run models via: llama-server -dr gemma3 Add some validators and sanitizers for Docker Model urls and metadata Signed-off-by: Eric Curtin <eric.curtin@docker.com>	2025-09-12 16:31:50 +01:00
Georgi Gerganov	f088b6a84f	server : adjust prompt similarity thold + add logs (#15913 ) ggml-ci	2025-09-12 17:02:55 +03:00
Aldehir Rojas	7057faf64b	json : support `enum` values within `allOf` (#15830 )	2025-09-08 16:14:32 -05:00
Jesse	88021565f0	chat : Deepseek V3.1 reasoning and tool calling support (OpenAI Style) (#15533 ) * Add DeepSeek V3.1 thinking mode support - Added COMMON_CHAT_FORMAT_DEEPSEEK_V3_1 enum value - Created common_chat_params_init_deepseek_v3_1() function (currently uses R1 implementation) - Created common_chat_parse_deepseek_v3_1() function that handles V3.1 thinking format: - Extracts reasoning content before '</think>' tag into reasoning_content - Extracts regular content after '</think>' tag into content - No opening '<think>' tag in V3.1 format - Added detection logic for V3.1 templates based on pattern: 'message['prefix'] is defined and message['prefix'] and thinking' - Added V3.1 case to parsing switch statement This addresses the issue where V3.1 outputs reasoning content followed by '</think>' and then regular content without the opening '<think>' tag. * Another attempt by V3.1 non-thinking * Fix test, but it's not asserting anything. * Ignore vim swap files in tests dir * Update the test * Try using try_find_literal instead of regex * passing test * Revert "Try using try_find_literal instead of regex" This reverts commit `c50d887ec2`. * Remove unnecessary change * Remove comment * Add code to handle non-thinking mode. * Try to set message['prefix'] when thinking is enabled. * This fixes reasoning, but breaks normal content. We need state in the chat parser. * DeepSeek V3.1 thinking is now the default. Disable with `--reasoning-budget 0`. * Simplify (DeepSeek V3.1 reasoning) * Fix sign inversion bug * Add some tool calling code (not working). * Tool calls working in non-reasoning mode. * Attempt a unit test for tool call parsing. * Passing test * Add tests for both happy path and broken fenced DeepSeek V3.1 tool call variants. * Passing DeepSeek V3.1 tool call tests, but model is not working. * Revert assistance response prefill change. Not my monkeys. * Add fenced_thinking unit test variant. Passes, but thinking tool calling still isn't working for some reason. * Tests pass in reasoning mode. Also e2e tool test passes. * Make a copy of the parse_json_tool_calls function for deepseek-v3.1 so as to not accidentally introduce regressions. * Fix thinking_forced_open logic. tool calling broken. Need to add another test case. * That's what I get for cargo culting a newline. * Add multi tool call test for deepseek v3.1 non-reasoning * Move test, remove .gitignore change * Place deepseek-v3.1 reasoning test directly into existing reasoning function per CISC's request. * Address whitespace CI failure. * Merge two assert_equals per CISC's request. * Add DeepSeek-V3.1 tests to tests/test-chat.cpp per CISC's request. * Merge deepseek V3.1 and regular parse_json_tool_calls() function behaviors by adding optional update_cursor argument. * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * DeepSeek V3.1 fix reasoning_format none * Strip grammar down to strictly what we expect based on model card. Throw out parts we cargo culted from R1 that don't make sense. * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * DeepSeek V3.1 - Add edge case where thinking is forced open, there is tool calling in the reasoning content, but then the model just stops the output without closing the </think> tag, so it's not a partial. In this case, use the tool call in the reasoning content. * DeepSeek V3.1 - simplify update_cursor * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix indent --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-08 16:59:48 +02:00
Gabe Goodhart	5fac79cbc7	Thinking model disabled assistant prefill (#15404 ) * feat: Set enable_thinking IFF not disabled and supported Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix inverted logic condition for prefill error Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always parse the enable_thinking kwarg to overwrite the default value From what I can tell, this started as a Qwen3-specific keyword, but from the use in `chat.cpp` translates this inputs.enable_thinking to the right thinking kwarg for the given model, this is now more of a standardized kwarg, so it should always override the default value when sent as part of the chat_template_kwargs field in the API. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Don't limit tempalte expansion check to jinja With the use_jinja check, non-jinja models would enable thinking and always fail assistant prefill Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add the error text to json type errors in json_value Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Explicitly reject string values for "enable_thinking" There are too many possible "truthy" / "falsy" strings and too many ambiguous strings that don't have a clear truthy/falsy value, so the simplest thing to do here is to reject the request. Ideally, this would be a 422 (Unprocessable Entity), but right now it's coming back as a 500. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move logic for detecting template enable_thinking support to common Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use raw pointer for common chat template function Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-09-05 14:31:24 -06:00
Eric Curtin	408ff524b4	Implement --log-colors with always/never/auto (#15792 ) With auto by default Signed-off-by: Eric Curtin <ericcurtin17@gmail.com>	2025-09-05 19:43:59 +01:00
ExtReMLapin	4fd1242bef	chat : fixed crash when Hermes 2 <tool_call> had a newline before it (#15639 ) Co-authored-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>	2025-09-05 01:24:08 +02:00
Piotr Wilkin (ilintar)	b2426e469e	chat : nemotron thinking & toolcalling support (#15676 ) * feat: nemotron thinking & toolcalling support * Trailing whitespaces * Corrected template for Nemotron * Template and parser fixes * Final template and grammar changes * Whitespace * Always do lazy grammar processing since </think> tag will always be there. * Allow extra content after toolcall * Whitespace * New tests: thinking + tools, tools + content, thinking + tools + content (new!) * Whitespace * Remove cURL test script	2025-09-05 01:22:22 +02:00
Eric Curtin	badb80cadb	Document the new max GPU layers default in help (#15771 ) This is a key change, just letting users know. Signed-off-by: Eric Curtin <ericcurtin17@gmail.com>	2025-09-04 10:49:44 +01:00
Johannes Gäßler	c466abe158	llama: -fa 1/0/-1 aliases for -fa on/off/auto (#15746 )	2025-09-02 18:17:26 +02:00
Georgi Gerganov	e92d53b29e	sampling : optimize samplers by reusing bucket sort (#15665 ) * sampling : optimize sorting using bucket sort in more places ggml-ci * sampling : do not sort in dist sampler ggml-ci * sampling : avoid heap allocations for sort buffers ggml-ci * common : add option to sort sampling candidates by probability ggml-ci * sampling : revert the change for preserving sort buffers * sampling : use std::copy instead of memcpy * sampling : clarify purpose of partial sort helpers ggml-ci * cont : remove wrong comment [no ci] * common : update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-31 20:41:02 +03:00
Georgi Gerganov	0d161f021a	server : enable /slots by default and make it secure (#15630 ) * server : enable /slots by default and make it secure ggml-ci * server : fix tests to pass `--no-slots` when necessary * server : extend /props with info about enabled endpoints	2025-08-31 20:11:58 +03:00
Johannes Gäßler	e81b8e4b7f	llama: use FA + max. GPU layers by default (#15434 ) * llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault	2025-08-30 16:32:10 +02:00
Piotr Wilkin (ilintar)	60e5eee31f	chat : Seed OSS thinking + tool call support (#15552 ) * Reasoning and tool-calling support for Seed OSS * Fix grammar and partial parsing * Whitespace * New chat template * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Remove unused 'purge_healing_marker' helper --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-08-29 14:53:41 +02:00
Sigbjørn Skjæret	84ab83cc0b	model : jina-embeddings-v3 support (#13693 ) * initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * fix vocab parsing with only tokenizer.json * set mask token lstrip attribute * additional unk_token_id fallback just in case [no ci] * revert vocab_size() change [no ci] * merge tensor loading into general bert * rope * add lora embedding and loading (non-functional) * export separate lora ggufs instead * add adapter metadata api * use std::string * convert_hf_to_lora compatibility * fix assert * apply suggestions from review * apply suggestion from review	2025-08-28 15:49:50 +02:00
Georgi Gerganov	da54f9f1a2	presets : add qwen3-30B-a3b FIM (#15616 )	2025-08-27 15:48:07 +03:00
Daniel Bevenius	fcca2182a1	common : add -m to bash completion for --model [no ci] (#15591 ) This commit updates the bash completion script to include the -m short option for the --model argument. The motivation for this is that currently tab completion only works the full --model option, and it is nice to have it work for the short option as well.	2025-08-27 10:28:53 +02:00
Aldehir Rojas	32732f2459	model : gpt-oss add response_format support (#15494 )	2025-08-22 11:04:08 -05:00
Georgi Gerganov	9ebebef62f	llama : remove KV cache defragmentation logic (#15473 ) ggml-ci	2025-08-22 12:22:13 +03:00
Diego Devesa	54a241f505	sched : fix possible use of wrong ids tensor when offloading moe prompt processing (#15488 )	2025-08-21 23:09:32 +02:00
Jie Fu (傅杰)	2f3dbffb17	common : fix incorrect print of non-ascii characters in the logging (#15466 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-08-21 11:54:34 +03:00
Daniel Bevenius	657b8a77bd	chat: handle gpt-oss return/end token inconsistency (#15421 ) This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<\|return\|>` token with the `<\|end\|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<\|return\|>` when `add_generation_prompt` is false, and `<\|end\|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart\|>' instead of '<\|start\|>'. Resolves: https://github.com/ggml-org/llama.cpp/issues/15417	2025-08-20 14:26:01 +02:00
Jie Fu (傅杰)	ec5ab1a36c	common : fix context shift help message (#15448 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-08-20 13:33:30 +03:00
Gian-Carlo Pascutto	1e19f5d462	common : Add top-nsigma sampler to help globally (#15428 ) Fixes #15423.	2025-08-19 19:58:14 +03:00
Georgi Gerganov	d2fcd91cf9	server : disable context shift by default (#15416 ) * server : disable context shift by default ggml-ci * server : make scopr of test parameters local	2025-08-19 16:46:37 +03:00
Xuan-Son Nguyen	e9288e8869	chat : clarify the meaning of reasoning_format (#15408 ) * chat : clarify the meaning of reasoning_format * add link to this PR	2025-08-19 10:29:36 +02:00
Daniel Bevenius	5e6229a840	common : fix double bos, use common_chat_templates for add_bos and add_eos (#15326 ) This commit updates common_chat_templates_apply_jinja to use the the add_bos and add_eos parameters from the chat template instead of the inputs. The motivation for this is that currently if the `add_bos` and `add_eos` from the input parameters are used it is possible to there will be a missmatch between the model and the chat template which can lead to the the removal of duplicate BOS/EOS tokens in chat.cpp `apply` to not happen leading to two BOS tokens being added to the template.	2025-08-15 19:50:52 +02:00
Diego Devesa	f75b830647	chat : include kwargs in template example (#15309 )	2025-08-14 10:28:29 -07:00
Aldehir Rojas	b204a5a234	gpt-oss: implement harmony parsing (#15181 ) * model : add harmony parser for gpt-oss * gpt-oss : fix grammar trigger from causing empty stack * gpt-oss: tweak the grammar trigger again * gpt-oss : add support for recipient in role header * gpt-oss : fix ungrouped tool calls in grammar * gpt-oss : loosen function name matching during parse * gpt-oss : clean up workarounds * gpt-oss : add template tests * gpt-oss : simulate thinking and tool call tags * gpt-oss : undo think tags when reasoning_format is none * gpt-oss : set special tokens back to user defined * gpt-oss : update openai-gpt-oss template * server : filter out harmony thought messages * gpt-oss : simplify parsing	2025-08-14 17:23:11 +03:00
Georgi Gerganov	d32e03f449	server : add SWA checkpoints (#15293 ) * server : add SWA checkpoints ggml-ci * cont : server clean-up * server : handle state restore fails * llama : add extended llama_state_seq_ API * server : do not make checkpoints if --swa-full ggml-ci * llama : remove flags value for NONE * server : configure number of SWA checkpoints with CLI arg ggml-ci * args : fix scope of new argument	2025-08-14 14:59:50 +03:00
Jonathan Graehl	5cdb27e091	finetune: SGD optimizer, more CLI args (#13873 ) * examples/finetune -opt SGD (stochastic gradient descent) memory opt add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy eventually drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wdalpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs) Vulkan: Implement GGML_OP_OPT_STEP_SGD * tests: Fix OPT_STEP_SGD test-backend-ops * SGD op param store weight-decay and not 1-alphawd minor + cosmetic changes * fix vulkan sgd * try CI fix --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-14 12:03:57 +02:00
Sigbjørn Skjæret	b3e16665e1	server : enable -td and -tbd parameters (#15172 )	2025-08-13 15:43:00 +02:00
Copilot	d8914fc47e	common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191 ) * Checkpoint from VS Code for coding agent session * Initial plan * Fix typo in --override-tensor-draft flag implementation * Add null termination for speculative tensor buffer overrides * Apply suggestions from code review * Apply suggestions from code review * Extract tensor override parsing logic to common function (addresses @slaren's feedback) * Apply suggestions from code review * Apply suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-08-13 12:44:40 +02:00
Xuan-Son Nguyen	fba5c0d680	chat : hotfix gpt-oss jinja raising an exception (#15243 ) * chat : hotfix gpt-oss jinja raising an exception * fix	2025-08-11 15:31:35 +02:00
Xuan-Son Nguyen	53d0a12658	server : allow specifying reasoning_format in HTTP request (#15238 )	2025-08-11 14:48:41 +02:00
Sachin Desai	3db4da56a5	chat : support Granite model reasoning and tool call (#14864 )	2025-08-06 20:27:30 +02:00

1 2 3 4 5 ...

572 Commits