llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	b3964c1e89	metal : optimize FA vec for large sequences and BS <= 8 (#15566 ) * metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci b6286	2025-08-26 14:22:14 +03:00
Xuan-Son Nguyen	79a546220c	mtmd : support Kimi VL model (#15458 ) * convert : fix tensor naming conflict for llama 4 vision * convert ok * support kimi vision model * clean up * fix style * fix calc number of output tokens * refactor resize_position_embeddings * add test case * rename build fn * correct a small bug b6285	2025-08-26 12:54:19 +02:00
Georgi Gerganov	85cc1ae998	context : print graph stats for memory-less contexts (#15586 ) ggml-ci b6284	2025-08-26 12:47:00 +03:00
Georgi Gerganov	1d8d83deaa	metal : improve `MUL_MAT_ID` (#15541 ) * metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci b6283	2025-08-26 12:46:15 +03:00
tc-mb	c4e9239064	model : support MiniCPM-V 4.5 (#15575 ) b6282	2025-08-26 10:05:55 +02:00
Sigbjørn Skjæret	39842a7f73	gguf-py : remove erroneous FFN_GATE entry (#15583 )	2025-08-26 09:08:08 +02:00
Sigbjørn Skjæret	0fd90db585	metal : remove contiguous assertion for src0 in IM2COL (#15577 ) * remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op b6280	2025-08-26 09:51:43 +03:00
Yoshi_likes_e4	4c37636b3e	Add a warning for special devices (#15563 ) * Add warning * Print the devices names * Add newlines * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix vector names --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b6279	2025-08-26 08:15:33 +02:00
Jeff Bolz	34bdbbd7c2	vulkan: Remove splitting for mul_mat_id (#15568 ) row_ids only needs to hold the BN rows for the current tile. b6278	2025-08-26 06:42:44 +02:00
Qeeweew	74f52f77f2	CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (#15451 ) * CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <xiapc@outlook.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b6277	2025-08-25 23:21:22 +02:00
lhez	f7207b0415	opencl: fix support ops condition for `rms_norm` (#15560 ) b6276	2025-08-25 14:18:09 -07:00
Ruben Ortlam	4d917cd4f6	vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565 ) b6275	2025-08-25 17:56:59 +02:00
Jeff Bolz	886b97a5d6	tests: Generate unique input values for count_equal (#15487 ) This avoids backend-dependent behavior for argmax that leads to intermittent failures. b6274	2025-08-25 10:47:16 -05:00
Ihar Hrachyshka	111f8d06f0	metal: fix regression when no metal devices are present (#15531 ) b6273	2025-08-25 18:27:34 +03:00
Johannes Gäßler	5eff6ec9b1	CUDA: MoE helper in device code, better tile sizes (#15525 ) * CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks b6272	2025-08-25 17:23:40 +02:00
Daniel Bevenius	dfd9b5f6c7	model-conversion : set pooling type to none in logits.cpp (#15564 ) This commit explicitly sets the pooling type to 'none' in the logits.cpp to support models that have a pooling type specified. The motivation for this is that some models may have a pooling type set in the model file (.gguf file) and for this specific case where we only want to extract logits, we need to ensure that no pooling is used to so that we are comparing raw logits and not pooled embeddings. b6271	2025-08-25 15:00:43 +02:00
Daniel Bevenius	5a6bc6b1a6	model-conversion : add model card template for embeddings [no ci] (#15557 ) * model-conversion: add model card template for embeddings [no ci] This commit adds a separate model card template (model repository README.md template) for embedding models. The motivation for this is that there server command for the embedding model is a little different and some addition information can be useful in the model card for embedding models which might not be directly relevant for causal models. * squash! model-conversion: add model card template for embeddings [no ci] Fix pyright lint error. * remove --pooling override and clarify embd_normalize usage	2025-08-25 14:25:25 +02:00
Georgi Gerganov	6b64f74b55	batched-bench : fix unified KV cache handling + pp timing (#15562 ) * batched-bench : fix unified KV cache handling + pp timing * cont : run dummy token only with split KV cache b6269	2025-08-25 13:56:43 +03:00
Weizhao Ouyang	0d5a470223	convert : update Ernie 4.5 dense architecture name (#15555 ) Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>	2025-08-25 11:15:06 +02:00
Georgi Gerganov	b0ba31f525	metal : add FA kernels for HS=40 (#15559 ) ggml-ci b6267	2025-08-25 10:14:48 +03:00
RunningLeon	7da9fed0d6	convert : support interns1-mini (#15412 ) * support interns1-mini * fix comment * update	2025-08-25 08:32:16 +02:00
Chenguang Li	c247d06f38	CANN: ROPE cache sin/cos repeat (#15501 ) Signed-off-by: noemotiovon <757486878@qq.com> b6265	2025-08-25 10:32:21 +08:00
Ruben Ortlam	043fb27d38	vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (#15524 ) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16 b6264	2025-08-24 19:36:36 +02:00
Georgi Gerganov	b730706a49	kv-cache : support layer reuse (#15504 ) * kv-cache : support layer reuse ggml-ci * cont : update comments [no ci]	2025-08-24 13:07:07 +03:00
Jeff Bolz	c9a24fb932	vulkan: Support FA with any multiple of 8 head sizes (#15537 ) The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state. b6262	2025-08-24 11:24:25 +02:00
Ruben Ortlam	a9c6ffcbfa	vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (#15526 ) b6261	2025-08-24 10:48:53 +02:00
Jeff Bolz	e78cf0d4b1	vulkan: workaround MoltenVK compile failure in multi_add (#15506 ) * vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <picard12@live.de>	2025-08-24 10:48:21 +02:00
Johannes Gäßler	710dfc465a	CUDA: fix half2 -> half conversion for HIP (#15529 )	2025-08-23 21:37:06 +02:00
Jeff Bolz	611f419cff	vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (#15281 ) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against #15489, sync after clearing partial sums b6258	2025-08-23 13:16:17 -05:00
Piotr Wilkin (ilintar)	b1afcab804	model : add support for Seed-OSS (#15490 ) * First draft * Fix linter errors * Added missing sinks nullptr * Don't forget the llama-arch! * We're through to the generation stage. * Fix post-attention norm * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix RoPE type * Fix tensor name and reorder llm_types * Update gguf-py/gguf/constants.py Remove nonexistent FFN_POST_NORM tensor Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add basic chat template * Add chat template tests * Remake chat template test * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Reorder llm type descriptions * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6257	2025-08-23 15:21:52 +02:00
Johannes Gäßler	9ef536907d	scripts: fix compare-llama-bench.py (#15521 )	2025-08-23 13:58:58 +03:00
LaffeyNyaa	21dc4ddaf2	chat : fix debug build assertion in trim function (#15520 ) b6255	2025-08-23 10:38:30 +02:00
Jeff Bolz	289bf4113e	vulkan: Rewrite synchronization to allow some overlap between nodes (#15489 ) Track a list of nodes that need synchronization, and only sync if the new node depends on them (or overwrites them). This allows some overlap which can improve performance, and centralizes a big chunk of the synchronization logic. The remaining synchronization logic involves writes to memory other than the nodes, e.g. for dequantization or split_k. Each of these allocations has a bool indicating whether they were in use and need to be synced. This should be checked before they are written to, and set to true after they are done being consumed. b6254	2025-08-23 09:33:36 +02:00
R0CKSTAR	b55f06e1aa	vulkan.Dockerfile: install vulkan SDK using tarball (#15282 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-08-23 08:58:57 +02:00
Acly	0a9b43e507	vulkan : support ggml_mean (#15393 ) * vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader	2025-08-23 08:35:21 +02:00
Jeff Bolz	330c3d2d21	vulkan: optimize mul_mat_id loading row ids into shared memory (#15427 ) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two. b6251	2025-08-23 08:31:54 +02:00
Johannes Gäßler	e92734d51b	test-opt: allow slight inprecision (#15503 ) b6250	2025-08-22 23:47:01 +02:00
Reese Levine	45363632cb	ggml WebGPU: add support for quantization types (#15440 ) * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView b6249	2025-08-22 11:28:03 -07:00
Aldehir Rojas	32732f2459	model : gpt-oss add response_format support (#15494 ) b6248	2025-08-22 11:04:08 -05:00
rmatif	92f7f0a53c	ggml: add `conv3d` op (#15182 ) * add conv3d * bump GGML_OP_COUNT b6247	2025-08-22 15:33:15 +02:00
Yavor Ivanov	b1ab91821f	cuda : add Pad Reflect 1D support (#14659 ) * Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b6246	2025-08-22 13:06:29 +02:00
Georgi Gerganov	9ebebef62f	llama : remove KV cache defragmentation logic (#15473 ) ggml-ci b6245	2025-08-22 12:22:13 +03:00
Aaron Teo	ad5c975c2d	ggml-cpu: Support Q5_0 and Q5_1 on s390x (#15486 ) * ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: update the last update date Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b6244	2025-08-22 16:11:04 +08:00
65a	4afb0a746f	server : Support multimodal completion and embeddings prompts in JSON format (#15108 ) - Use server_tokens in more places in server and util.cpp - Convert most functions that used llama_tokens to server_tokens - Modify input tokenizer to handle JSON objects as subprompts - Break out MTMD prompt parsing into utility function - Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types - Add capability to model endpoint to indicate if client can send multimodal data - Add tests. b6243	2025-08-22 10:10:14 +02:00
Tarek Dakhran	e288693669	readme : model : mtdm : lfm2 improvements (#15476 ) * Support untied embeddings * Increase number of image tokens to 1024 * Add LFM2-VL to readme * Actually use untied embeddings b6242	2025-08-22 09:29:08 +02:00
Chenguang Li	a0f98dd604	CANN: Optimize RMS_NORM using cache (#15419 ) * [CANN] Optimize RMS_NORM using cache Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> * fix review comment Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> b6241	2025-08-22 14:12:07 +08:00
Diego Devesa	54a241f505	sched : fix possible use of wrong ids tensor when offloading moe prompt processing (#15488 ) b6240	2025-08-21 23:09:32 +02:00
Georgi Gerganov	cd36b5e5c7	llama : remove deprecated llama_kv_self API (#15472 ) ggml-ci b6239	2025-08-21 19:13:45 +03:00
Georgi Gerganov	3f196be84b	graph : remove build_attn_with_sinks overload (#15469 ) ggml-ci b6238	2025-08-21 18:44:45 +03:00
Acly	97ae5961a4	vulkan : support conv_2d_dw with f16 weights (#15392 ) b6237	2025-08-21 17:01:51 +02:00

... 2 3 4 5 6 ...

6439 Commits