llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Author	SHA1	Message	Date
Francis Couture-Harpin	833d03c25d	convert : for FP8, use scale type to decide auto type	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	34680f07d2	gguf-py : handle cross-filesystem file range copies	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	34f37c283b	convert : better logging of partially reflinkable tensors	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	2499e47cfd	gguf-py : allow previewing reflinked size on non-Linux platforms	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	8ef4136b20	convert : remove unused field ModelTensorInfo.src_qtype	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	be600e2622	convert : more robust default ftype detection	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	ec07416dcf	gguf-py : improve reflink size logging * gguf-py : move reflinking functions to lazy	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	cec3449507	convert : allow sharding reflinked models	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	fb879b40c0	convert : use F32 operations on Mamba A_log This matches the previous behavior for BF16 tensors.	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	6792f66a93	convert : detect filesystem block size for reflinks * convert : use direct copies when possible Using os.copy_file_range where available, and falling back to shutil.copyfileobj otherwise. * gguf : handle misaligned offset more cleanly	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	34bd024267	gguf-py : fix flake8 lint	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	7724bf9e4f	convert : fix reflinks for stacked MoE tensors	2025-09-09 14:36:34 -04:00
Francis Couture-Harpin	f7394cdaf4	convert : use reflinks for faster conversion	2025-09-09 14:36:32 -04:00
Francis Couture-Harpin	e582f1ac63	convert : fix no-lazy dtypes from direct safetensors	2025-09-09 14:33:01 -04:00
Francis Couture-Harpin	0edc189842	gguf-py : order safetensors tensors by name Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor	2025-09-09 14:33:01 -04:00
Francis Couture-Harpin	ca8f736fe4	convert : parse safetensors directly	2025-09-09 14:33:01 -04:00
Francis Couture-Harpin	0d5cfed596	Merge branch 'master' into compilade/convert-prequant	2025-09-09 14:23:06 -04:00
Jeff Bolz	4f63cd705c	vulkan: Fix OOB accesses in soft_max_back (#15861 ) b6431	2025-09-09 14:41:15 +02:00
Johannes Gäßler	17bc5a815f	HIP: use v_dot2_f32_f16 instruction for FA (#15884 ) b6430	2025-09-09 14:04:43 +02:00
lksj92hs	ed54e32558	Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886 ) b6429	2025-09-09 14:01:15 +02:00
Aman Gupta	a972faebed	CUDA: Add mul_mat_id support for the mmf kernel (#15767 ) * CUDA: Add mul_mat_id support the mmf Add support for mul_mat_id for bs < 16 * Review: use warp_size, fix should_use_mmf condition * Launch one block per expert, stride along n_expert_used * templatize mul_mat_id * Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids * Reduce compile times by dividing mmf into f16, bf16 and f32 variants * Divide mmf by ncols_dst * Add missing files * Fix MUSA/HIP builds b6428	2025-09-09 14:38:02 +08:00
Johannes Gäßler	550cf726e1	CUDA: fix GET_ROWS for large tensors (#15882 ) b6427	2025-09-09 08:11:01 +02:00
Georgi Gerganov	c252ce67c4	contrib : add notes about merging PRs (#15881 ) * contrib : add notes about merging PRs * Update CONTRIBUTING.md Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-09 08:42:10 +03:00
Daniel Bevenius	70cd37dbbe	requirements : update transformers/torch for Embedding Gemma (#15828 ) * requirements : update transformers/torch for Embedding Gemma This commit updates the requirements to support converting Embedding Gemma 300m models. The motivation for this change is that during development I had a local copy of the transformers package which is what I used for converting the models. This was a mistake on my part and I should have also updated my transformers version to the official release. I had checked the requirements/requirements-convert_legacy_llama.txt file and noted that the version was >=4.45.1,<5.0.0 and came to the conculusion that no updated would be needed, this assumed that Embedding Gemma would be in a transformers release at the time Commit `fb15d649ed` ("llama : add support for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to convert themselves would be able to do so. However, Embedding Gemma is a preview release and this commit updates the requirements to use this preview release. * resolve additional python dependencies * fix pyright errors in tokenizer test and remove unused import	2025-09-09 06:06:52 +02:00
Piotr Wilkin (ilintar)	acc1b008cf	model-conversion : add extra debugging support for model conversion (#15877 ) * feat: Extra debugging support for model conversion - added BF16 support for llama-callback-eval and support for dumping intermediate steps in run-org-model.py b6424	2025-09-09 06:05:55 +02:00
Aldehir Rojas	7057faf64b	json : support `enum` values within `allOf` (#15830 ) b6423	2025-09-08 16:14:32 -05:00
j-k	fe1c92cd7b	media : add llama1 icon (#15878 ) Add svg and png based off llama1-icon.svg	2025-09-08 21:57:01 +03:00
Jeff Bolz	e68aa10d8f	vulkan: sort graph to allow more parallel execution (#15850 ) * vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split b6421	2025-09-09 02:10:07 +08:00
Aman Gupta	0a16bf52e6	CUDA: generate_cu_files.py - add missing mxfp4 (#15880 )	2025-09-09 01:23:46 +08:00
Jesse	88021565f0	chat : Deepseek V3.1 reasoning and tool calling support (OpenAI Style) (#15533 ) * Add DeepSeek V3.1 thinking mode support - Added COMMON_CHAT_FORMAT_DEEPSEEK_V3_1 enum value - Created common_chat_params_init_deepseek_v3_1() function (currently uses R1 implementation) - Created common_chat_parse_deepseek_v3_1() function that handles V3.1 thinking format: - Extracts reasoning content before '</think>' tag into reasoning_content - Extracts regular content after '</think>' tag into content - No opening '<think>' tag in V3.1 format - Added detection logic for V3.1 templates based on pattern: 'message['prefix'] is defined and message['prefix'] and thinking' - Added V3.1 case to parsing switch statement This addresses the issue where V3.1 outputs reasoning content followed by '</think>' and then regular content without the opening '<think>' tag. * Another attempt by V3.1 non-thinking * Fix test, but it's not asserting anything. * Ignore vim swap files in tests dir * Update the test * Try using try_find_literal instead of regex * passing test * Revert "Try using try_find_literal instead of regex" This reverts commit `c50d887ec2`. * Remove unnecessary change * Remove comment * Add code to handle non-thinking mode. * Try to set message['prefix'] when thinking is enabled. * This fixes reasoning, but breaks normal content. We need state in the chat parser. * DeepSeek V3.1 thinking is now the default. Disable with `--reasoning-budget 0`. * Simplify (DeepSeek V3.1 reasoning) * Fix sign inversion bug * Add some tool calling code (not working). * Tool calls working in non-reasoning mode. * Attempt a unit test for tool call parsing. * Passing test * Add tests for both happy path and broken fenced DeepSeek V3.1 tool call variants. * Passing DeepSeek V3.1 tool call tests, but model is not working. * Revert assistance response prefill change. Not my monkeys. * Add fenced_thinking unit test variant. Passes, but thinking tool calling still isn't working for some reason. * Tests pass in reasoning mode. Also e2e tool test passes. * Make a copy of the parse_json_tool_calls function for deepseek-v3.1 so as to not accidentally introduce regressions. * Fix thinking_forced_open logic. tool calling broken. Need to add another test case. * That's what I get for cargo culting a newline. * Add multi tool call test for deepseek v3.1 non-reasoning * Move test, remove .gitignore change * Place deepseek-v3.1 reasoning test directly into existing reasoning function per CISC's request. * Address whitespace CI failure. * Merge two assert_equals per CISC's request. * Add DeepSeek-V3.1 tests to tests/test-chat.cpp per CISC's request. * Merge deepseek V3.1 and regular parse_json_tool_calls() function behaviors by adding optional update_cursor argument. * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * DeepSeek V3.1 fix reasoning_format none * Strip grammar down to strictly what we expect based on model card. Throw out parts we cargo culted from R1 that don't make sense. * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * DeepSeek V3.1 - Add edge case where thinking is forced open, there is tool calling in the reasoning content, but then the model just stops the output without closing the </think> tag, so it's not a partial. In this case, use the tool call in the reasoning content. * DeepSeek V3.1 - simplify update_cursor * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix indent --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6419	2025-09-08 16:59:48 +02:00
Xuan-Son Nguyen	56920f5665	server : bring back timings_per_token (#15879 ) b6418	2025-09-08 16:50:05 +02:00
Georgi Gerganov	b0d52998b9	cuda : fix supports_op condition for get_rows when number of blocks is too large (#15868 ) * cuda : fix supports_op condition for get_rows when src1->ne2 > 1 ggml-ci * ggml : add comment about ggml_get_rows ggml-ci * cuda : add FIXME [no ci] * cuda : update support condition ggml-ci	2025-09-08 13:56:51 +03:00
Georgi Gerganov	f28d4f4ac9	metal : refactor + optimize (#15857 ) * metal : refactor ggml-ci * cont : refactor FA-vec kernel * cont : print metal library load time * minor : warn to debug + bettern kernel names ggml-ci * metal : optimize mul_mv q8_0 ggml-ci * metal : simplify FA pipeline creation functions ggml-ci * metal : improve naming consistency * metal : safer function constants offsets ggml-ci * metal : comments ggml-ci b6416	2025-09-08 13:34:56 +03:00
Xuan-Son Nguyen	9fcb29f22f	ggml: allow casting between f32 and i32 (#15783 ) * ggml: allow casting between f32 and i32 * fix cuda * add vulkan * fix CPU non-cont * add non-cont test case * add note * extend test number range * correct note * add cont version for vulkan b6415	2025-09-08 12:33:01 +02:00
Sigbjørn Skjæret	5ef22d281d	CUDA: non-contiguous src0 not supported for PAD (#15869 ) b6414	2025-09-08 12:55:44 +03:00
Daniel Bevenius	233d773d02	convert : force setting sliding_window from original config (#15867 ) * convert : force setting sliding_window from original config This commit modifies the set_gguf_parameters method for EmbeddingGemma so that it reads the sliding_window parameter from the original model config.json and uses that value. The motivation for this change is that the Gemma3TextConfig constructor adjusts the sliding_window value, which can lead to inconsistencies when converting models as we expects this value to match the original model's configuration. Refs: `bb45d3631e/src/transformers/models/gemma3/configuration_gemma3.py (L230)` * fix flake8 error * add link to huggingface PR	2025-09-08 09:44:34 +02:00
Georgi Gerganov	a885dcff11	batched-bench : fix llama_synchronize usage during prompt processing (#15835 ) ggml-ci b6412	2025-09-08 10:27:07 +03:00
Georgi Gerganov	663027fd54	context : fix n_outputs during reserve (#15858 ) ggml-ci	2025-09-08 10:26:36 +03:00
Georgi Gerganov	cf0e3ba150	model : avoid ggml_cont_3d for fused QKV weights (#15662 ) * model : avoid ggml_cont_3d for fused QKV weights ggml-ci * kv-cache : make cpy_k and cpy_v implementation more readable ggml-ci * cont : add comments ggml-ci * cont : minor fix [no ci] * cont : one more fix * cont : clarity ggml-ci * kv-cache : require contiguous heads of k_cur and v_cur ggml-ci	2025-09-08 10:25:33 +03:00
Jeff Bolz	d413dca003	tests: large sizes for get_rows (#15687 ) b6409	2025-09-07 23:23:41 -05:00
Chenguang Li	85ca66a746	CANN: Stream sync between devices for acl_graph (#15809 ) * CANN: Switch to stream synchronization Switch to stream synchronization because events are not effective. Co-authored-by: hipudding <huafengchun@gmail.com> * CANN: add Comments --------- Co-authored-by: hipudding <huafengchun@gmail.com> b6408	2025-09-08 10:03:29 +08:00
Jeff Bolz	3976dfbe00	vulkan: support im2col_3d (#15795 ) b6407	2025-09-07 13:50:26 -05:00
Aaron Teo	d36e61c580	ggml-cpu: clean up s390x SIMD (#15855 ) * ggml-cpu: clean up s390x simd Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `0da4b6aa07`) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix hsum data types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b6406	2025-09-08 02:18:28 +08:00
Jeff Bolz	c97b5e5854	vulkan: Support pad_ext (#15794 ) b6405	2025-09-07 19:00:49 +02:00
Jeff Bolz	267e99867f	vulkan: Use larger loads in scalar/coopmat1 matmul (#15729 ) I think glslang will translate an access like x[i][1].z to OpAccessChain ... x, i, 1, 2 OpLoad float16_t ... rather than loading all of x[i] in a single OpLoad. Change the code to explicitly load the vector/matrix. b6404	2025-09-07 18:53:07 +02:00
Daniel Bevenius	3b15924d71	ggml WebGPU: remove userdata from request adapter callback (#15527 ) * ggml WebGPU: remove userdata from request adapter callback This commit removes the `userdata` parameter from the WebGPU request adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function captures the `webgpu_context` directly. The motivation for this change is to simplify the code and improve readability. * inline the callback lambda into the RequestAdapter call This commit removes the callback lambda variable and inlines it directly into the RequestAdapter call. b6403	2025-09-07 11:19:45 +03:00
Johannes Gäßler	79bc429262	CUDA: faster tile FA (Pascal/AMD), headsize 256 (#15769 ) b6402	2025-09-07 00:26:28 +02:00
Charles Xu	c4df49a42d	kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16 (#15817 ) b6401	2025-09-06 22:08:43 +08:00
Xuan-Son Nguyen	3c3635d2f2	server : speed up tests (#15836 ) * server : speed up tests * clean up * restore timeout_seconds in some places * flake8 * explicit offline	2025-09-06 14:45:24 +02:00
Xuan-Son Nguyen	61bdfd5298	server : implement prompt processing progress report in stream mode (#15827 ) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b6399	2025-09-06 13:35:04 +02:00

1 2 3 4 5 ...

6452 Commits