llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-19 11:57:07 +00:00

Author	SHA1	Message	Date
R0CKSTAR	9b8f3c6c77	musa: fix build warnings (unused variable) (#14869 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b5995	2025-07-26 10:36:02 +08:00
Aaron Teo	c7f3169cd5	ggml-cpu : disable GGML_NNPA by default due to instability (#14880 ) * docs: update s390x document for sentencepiece Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `e086c5e3a7`) * docs: update huggingface links + reword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `8410b085ea`) * ggml-cpu: disable ggml-nnpa compile flag by default fixes #14877 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `412f4c7c88`) * docs: update s390x build docs to reflect nnpa disable Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `c1eeae1d0c`) --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b5994	2025-07-25 19:09:03 +02:00
Gabe Goodhart	793c0d7f46	metal: SSM_SCAN performance (#14743 ) * feat: Add s_off as a parameter in the args struct This may not be necessary, but it more closely mirrors the CUDA kernel Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state This is a first attempt at optimizing the metal kernel. The changes here are: - Launch the kernel with a thread group of size d_state - Use simd groups and shared memory to do the summation for the y computation When tested with G4 tiny preview, this shows roughly a 3x speedup on prefill and 15% speedup on decode. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update logic to correctly do the multi-layer parallel sum Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Correctly size the shared memory bufer and assert expected size relationships Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Compute block offsets once rather than once per token Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use local variable for state recursion Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use a secondary simd_sum instead of a for loop Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add assertion and comment about relationship between simd size and num simd groups Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallelize of d_state for mamba-1 Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallel sum in SSM_CONV Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Revert "feat: Parallel sum in SSM_CONV" After discussion with @compilade, the size of the parallelism here is not worth the cost in complexity or overhead of the parallel for. https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357 This reverts commit `16bc059660`. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify shared memory sizing Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b5993	2025-07-25 10:47:39 -06:00
lhez	ce111d39d6	opencl: add fused `rms_norm_mul` (#14841 ) * opencl: add fused `rms_norm` + `mul` * opencl: improve workgroup size for `rms_norm_mul` b5992	2025-07-25 17:12:13 +02:00
wooksong	e7fecba934	docs : update HOWTO‑add‑model.md for ModelBase and new model classes (#14874 ) This patch updates the example in docs/development/HOWTO-add-model.md to reflect recent changes after `TextModel` and `MmprojModel` were introduced. It replaces the outdated `Model` base class with `TextModel` or `MmprojModel` and updates the registration example accordingly. Signed-off-by: Wook Song <wook16.song@samsung.com>	2025-07-25 16:25:05 +02:00
Oliver Simons	e2b7621e7c	ggml : remove invalid portPos specifiers from dot files (#14838 ) Neither "g" nor "x" are valid portPos specifiers per the official [graphviz documents](https://graphviz.org/docs/attr-types/portPos/): > If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_". I tested locally for it to fall back to default portPos specifier if an invalid portPos is specified. As a consequence, we can remove associated code. b5990	2025-07-25 14:29:57 +03:00
Georgi Gerganov	c1dbea752a	context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870 ) ggml-ci b5989	2025-07-25 14:28:06 +03:00
kiwi	749e0d27f0	mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (#14503 ) * [fix] Fix 32-bit narrowing issue in export-lora and mtmd clip * Update export-lora.cpp * Update clip.cpp * Update export-lora.cpp * format: use space to replace tab b5988	2025-07-25 13:08:04 +02:00
Chris Rohlf	64bf1c3744	rpc : check for null buffers in get/set/copy tensor endpoints (#14868 ) b5987	2025-07-25 12:17:02 +02:00
Diego Devesa	c12bbde372	sched : fix multiple evaluations of the same graph with pipeline parallelism (#14855 ) ggml-ci b5986	2025-07-25 11:07:26 +03:00
R0CKSTAR	3f4fc97f1d	musa: upgrade musa sdk to rc4.2.0 (#14498 ) * musa: apply mublas API changes Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update musa version to 4.2.0 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: restore MUSA graph settings in CMakeLists.txt Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: disable mudnnMemcpyAsync by default Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: switch back to non-mudnn images Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * minor changes Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: restore rc in docker image tag Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b5985	2025-07-24 20:05:37 +01:00
Georgi Gerganov	2df255da3c	sync : ggml ggml-ci b5984	2025-07-24 20:27:23 +03:00
Kai Pastor	60f816a79d	cmake : fix usage issues (ggml/1257) * CMake config: Create target only once Fix error on repeated find_package(ggml). For simplicity, check only for the top-level ggml::ggml. * CMake config: Add CUDA link libs * CMake config: Add OpenCL link libs * CMake config: Use canonical find_dependency Use set and append to control link lib variables. Apply more $<LINK_ONLY...>. * CMake config: Wire OpenMP dependency	2025-07-24 20:27:23 +03:00
Daniel Bevenius	5592f278b6	ggml-cpu : remove stdlib include from repack.cpp (ggml/1276) This commit removes the inclusion of `<cstdlib>`. The motivation for this change is that this source file does not seem to use any functions from this header and the comment about `qsort` is a little misleading/confusing.	2025-07-24 20:27:23 +03:00
Georgi Gerganov	e4868d16d2	context : perform output reorder lazily upon access after sync (#14853 ) * context : perform output reorder after lazily upon access after sync ggml-ci * cont : add TODO b5981	2025-07-24 16:31:48 +03:00
Xuan-Son Nguyen	820de57d4f	chat : fix kimi-k2 chat template (#14852 ) b5980	2025-07-24 13:59:56 +02:00
Aaron Teo	f263f5d9ae	ggml-zdnn: fix missing data transform call Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 18:30:10 +08:00
Aaron Teo	1c75ed63e5	ggml-zdnn: fix compiler error missing type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 18:22:34 +08:00
Aaron Teo	a1d8568c14	ggml-zdnn: impl matmul Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 18:13:07 +08:00
Alberto Cabrera Pérez	cb4a63aad6	sycl: fixed semantics of block offset calculation (#14814 ) b5979	2025-07-24 11:09:57 +01:00
yummy	86f5623d90	llama : fix MiniCPM inference after Granite Four changes (#14850 ) MiniCPM models use the llm_build_granite constructor which was changed in the Granite Four PR to use hparams.rope_finetuned instead of a use_rope parameter. MiniCPM models need rope enabled by default. Fixes inference from gibberish to correct responses. b5978	2025-07-24 11:50:51 +02:00
Pouya	39cffdf188	docs: add libcurl-dev install hint for Linux distros (#14801 ) * docs: add libcurl-dev install hint for Linux distros Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com> * Update docs/build.md --------- Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-07-24 11:26:44 +02:00
Aaron Teo	59e9805ab0	ggml-zdnn: code clean up Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 16:26:29 +08:00
Aaron Teo	c1653ab639	ggml-zdnn: fix incorrect ztensor shape, reduce memory padding Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 16:22:06 +08:00
Aaron Teo	828519659b	ggml-zdnn: update supports_op matmul matrix Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 16:11:37 +08:00
Georgi Gerganov	065908cb09	metal : fix fusion across different encoders (#14849 ) * metal : fix fusion across different encoders ggml-ci * cont : add assertion ggml-ci b5976	2025-07-24 10:24:05 +03:00
Donghyeon Jeong	4ec6291a24	sycl: fix undefined variable in work group size check (#14843 ) b5975	2025-07-24 12:50:41 +08:00
Aaron Teo	18658b8607	ggml-zdnn: impl init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 12:02:20 +08:00
jacekpoplawski	a12363bbf0	convert : text-only support for GLM-4.1V-9B-Thinking (#14823 ) * use language_model part only, ignore visual layers * fix rope_dim calculation	2025-07-23 23:23:57 +02:00
Johannes Gäßler	a86f52b285	CUDA: fix overflow in FA, tune performance (#14840 ) b5973	2025-07-23 21:43:25 +02:00
Aaron Teo	da2e0e70ba	ggml-zdnn: switch buffers back and set to arbitrary number Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 02:31:22 +08:00
Aaron Teo	63fbc45ed6	ggml-zdnn: switch to std vector instead of array Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 01:09:01 +08:00
Aaron Teo	b7f4b6fde3	ggml-zdnn: rework init_tensor to create new buffers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 01:03:53 +08:00
Aaron Teo	ee0ed78d54	ggml-zdnn: add check for view tensors to prevent init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:56:32 +08:00
Aaron Teo	13c64448bd	ggml-zdnn: assign tensor->extra to buffer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:48:32 +08:00
Aaron Teo	13c05872f2	ggml-zdnn: implement at least 1 op to test Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:44:05 +08:00
Aaron Teo	9e84742e72	ggml-zdnn: test ztensor finding in init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:40:22 +08:00
Aaron Teo	af9f4f0039	ggml-zdnn: fix compiler warnings and bugfixes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:25:41 +08:00
Johannes Gäßler	b284197df4	CUDA: fix compilation with GGML_CUDA_F16 (#14837 ) b5972	2025-07-23 18:22:30 +02:00
Aaron Teo	ae2f656d7e	ggml-zdnn: bugfix new impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:18:53 +08:00
Aaron Teo	7c6395f826	ggml-zdnn: rewrite the backend implementation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:14:45 +08:00
Sigbjørn Skjæret	221c0e0c58	ci : correct label refactor->refactoring (#14832 )	2025-07-23 14:27:54 +02:00
Aaron Teo	04ddb2ac95	ggml-zdnn: update op out_prod to use tensor->extra Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-23 19:51:37 +08:00
Aaron Teo	77a753297b	ggml-zdnn: support op out_prod Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-23 19:28:51 +08:00
Johannes Gäßler	07a19e27a2	CUDA: fix quantized KV cache + multiple sequences (#14822 ) * CUDA: fix quantized KV cache + multiple sequences * Update ggml/src/ggml-cuda/fattn-common.cuh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b5970	2025-07-23 14:08:09 +03:00
Georgi Gerganov	18f3b5ff9e	tests : add non-cont K,V FA tests ggml-ci	2025-07-23 14:08:09 +03:00
l3utterfly	7233358d29	memory : handle saving/loading null layers in recurrent memory (#14675 ) * Update llama-memory-recurrent.cpp handle saving/loading null layers in recurrent memory * fixed styling issues and updated comments * fix styling issue Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b5968	2025-07-23 11:16:41 +03:00
Aaron Teo	11d58d29de	ggml-zdnn: add comments to prevent accidentally deleting lines Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-23 14:54:44 +08:00
lixing-star	6c88b3bb25	ggml: fix loongarch quantize_row_q8_1 error (#14827 ) b5967	2025-07-23 09:39:51 +03:00
chen fan	14c28dfc50	CANN: weight format to NZ for Ascend310P3 (#14407 ) * weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code b5966	2025-07-23 11:58:00 +08:00

1 2 3 4 5 ...

6123 Commits