llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-02 09:12:03 +00:00

Author	SHA1	Message	Date
Sigbjørn Skjæret	144a4ce824	vendor : sync minja (#16500 ) * sync minja.hpp Adds Call/EndCall support, used in MiniCPM3 and MiniCPM4-MCP. * remove spurious semicolon * sync from ochafik/minja b6873	2025-10-29 14:09:50 +01:00
Jeff Bolz	f549b0007d	vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (#16793 ) This lets the copy to the destination device use the host-visible vidmem optimization. b6872	2025-10-29 09:53:04 +01:00
Aman Gupta	9a3ea685b9	CUDA: Fix bug in topk-moe for gpt-oss (#16821 ) * CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef b6871	2025-10-29 15:55:06 +08:00
YaelLogic	338074c383	sycl: add RMS_NORM_BACK operation support (#16808 ) * sycl: add RMS_NORM_BACK operation support * sycl: rms_norm_back: add dual reduction paths (FP64 and FP32) and savepoint before further changes * sycl: add RMS_NORM_BACK support Implement RMS_NORM_BACK for the SYCL backend using FP32 compensated parallel reduction. Minimal docs updates (ops.md / SYCL.csv). * revert: restore .gitignore and tools/run/CMakeLists.txt to upstream * revert: restore tests/CMakeLists.txt to upstream * sycl: optimize rms_norm_back * fix: restore SYCL.csv to correct state with RMS_NORM_BACK support * Update ggml/src/ggml-sycl/norm.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * fix: remove trailing whitespace and add missing newline (EditorConfig) --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> b6870	2025-10-29 14:14:39 +08:00
YaelGitAccount	851553ea6b	cuda: add SET operation support (#16804 ) * feat(cuda): add GGML_OP_SET support Implement CUDA kernel for SET operation with f32 support. All tests passing (14598/14598). * cuda(set): add I32 support; keep F32 * refactor(cuda): use ggml_cuda_cpy to unify SET operator logic and remove code duplication * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-cuda/set.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6869	2025-10-28 20:10:28 +01:00
Georgi Gerganov	85a7d8677b	memory : remove KV cache size padding (#16812 ) * memory : remove KV cache size padding * cont : restore padding for n_kv tensor shape * server : use slot context size instead of training context size * server : simplify context limit logic b6868	2025-10-28 20:19:44 +02:00
Georgi Gerganov	a8ca18b4b8	llama-bench : clarify benchmarked parts of the computation (#16823 )	2025-10-28 19:41:43 +02:00
l3utterfly	8284efc35c	initialise buffer.device in ggml_hexagon_session (#16816 ) b6866	2025-10-28 08:16:20 -07:00
Sam Malayek	1c1409e131	embedding: add raw option for --embd-output-format (#16541 ) * Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README b6865	2025-10-28 12:51:41 +02:00
Johannes Gäßler	7a0e900e36	llama: consistent ctx <-> buf order for KV cache (#16746 ) b6864	2025-10-28 11:23:54 +01:00
Aldehir Rojas	280d97be96	grammar : support array references in json schema (#16792 ) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6863	2025-10-28 09:37:52 +01:00
Chenguang Li	3479efd112	CANN: Improve device ID handling and aclnnArange checks (#16752 ) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var b6862	2025-10-28 10:54:53 +08:00
Aman Gupta	463bbf20bf	CUDA: add unused vars to mmvf and mmvq (#16807 ) b6861	2025-10-28 10:31:21 +08:00
tamarPal	ad8d36beff	sycl: add SSM_CONV operation support (#16800 ) * feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <tamarPal@example.com> b6860	2025-10-28 09:50:33 +08:00
Yuri Khrustalev	c053e18a66	chat: Add LFM2 tool handling (#16763 ) * Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev b6859	2025-10-27 23:54:01 +01:00
Xuan-Son Nguyen	e1ab084803	mtmd : fix idefics3 preprocessing (#16806 ) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite b6858	2025-10-27 23:12:16 +01:00
Diego Devesa	5a4ff43e7d	llama : disable pipeline parallelism if compute buffer allocation fails (#16748 ) b6857	2025-10-27 21:51:28 +01:00
Acly	10640e31aa	ggml : fix interpolate with align-corners and ne=1 (#16700 ) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning b6856	2025-10-27 21:50:22 +01:00
Johannes Gäßler	80d28f104c	HIP: fix AMDGPU_TARGETS, update documentation (#16803 ) b6855	2025-10-27 21:39:49 +01:00
Xuan-Son Nguyen	c55d53acec	model : add LightOnOCR-1B model (#16764 ) * model : add LightOnOCR-1B model * add test b6854	2025-10-27 16:02:58 +01:00
Johannes Gäßler	945501f5ea	llama: fix leaked buffers for mmap + split files (#16765 ) b6853	2025-10-27 09:17:31 +01:00
Aman Gupta	75cbdd3fce	test-backend-ops: print failed tests at the end (#16785 ) b6852	2025-10-27 09:25:10 +08:00
tamarPal	2b9bd9bf4e	sycl: add ROLL operation support (#16665 ) * sycl: add ROLL operation support - Implement ggml_sycl_roll function for F32 tensors - Add multi-axis roll operation with SYCL kernel - Support all 4 tensor dimensions with proper shift normalization - Add roll.cpp and roll.hpp to SYCL backend - Update backend dispatch and supports_op for GGML_OP_ROLL - Tests: 17662/17662 pass with identical CPU reference results * fix: remove trailing whitespace from roll.cpp - Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp - Remove trailing spaces from lines 6, 11, 28, 47, 58, 60 * ci: retrigger * sycl: remove wait() calls from ROLL operation * fix: editorconfig — LF endings + final newline for roll.hpp --------- Co-authored-by: tamarPal <tamarPal@example.com> b6851	2025-10-27 09:20:24 +08:00
shani-f	59fc1ec8e8	sycl: add REPEAT_BACK operation support (#16734 ) * SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * Update ggml/src/ggml-sycl/repeat_back.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/repeat_back.hpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6850	2025-10-27 09:19:50 +08:00
Aman Gupta	75d33b9302	CUDA: support for weight clamp in top-k norm (#16702 ) b6849	2025-10-27 09:06:16 +08:00
Acly	3470a5c891	ggml-alloc : make gallocr prefer chunks that allow memory reuse (#16788 ) b6848	2025-10-26 23:19:03 +01:00
Sigbjørn Skjæret	bd562fe4f7	cuda : use fast copy when src and dst are of different type and contiguous (#16789 ) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape b6847	2025-10-26 21:31:41 +01:00
leejet	bbac6a26b2	ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch (#16744 ) * fix k_compute_batched_ptrs * add backend ops test * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * reduce the batch size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b6846	2025-10-26 19:13:31 +01:00
Sigbjørn Skjæret	73a48c9790	convert : enable expert group selection for all models with it (#16691 ) b6845	2025-10-26 17:21:23 +01:00
Sigbjørn Skjæret	f696428ce8	graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero (#16655 ) * add missing norm topk bias * use clamping instead, update number and add comment b6844	2025-10-26 17:20:32 +01:00
Sigbjørn Skjæret	7cce4f8158	model : set res->t_embd in SmallThinker models (#16782 ) b6843	2025-10-26 16:08:52 +01:00
amirai21	8d8862829c	docs : add Jamba to Text-only models list (#16778 )	2025-10-26 13:01:20 +01:00
Aman Gupta	f77c13b91f	CUDA: General GEMV fusion (#16715 ) b6841	2025-10-26 19:28:04 +08:00
Gilad S.	3cfa9c3f12	vulkan: deduplicate Microsoft Direct3D12 devices (#16689 ) * fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver * style: indent * fix: decrease priority * fix: switch to `\|\|` b6840	2025-10-26 05:37:38 +01:00
Galunid	5d195f17bc	convert : handle mmproj filename/path properly (#16760 ) * convert: handle mmproj model output filename properly * remove redundant commits * Add model_type to gguf utility * Use mmproj- prefix instead of suffix * Apply CISC suggestion Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6839	2025-10-25 20:41:36 +02:00
Shunta Saito	226f295f4d	model : set res->t_embd in PLaMo2 models (#16766 ) b6838	2025-10-25 12:26:27 +02:00
Giuseppe Scrivano	f90b4a8efe	vulkan: delete dead code (#16732 ) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> b6837	2025-10-25 10:59:54 +02:00
Jeff Bolz	8423d01931	vulkan: Optimize SSM_SCAN (#16645 ) b6836	2025-10-25 07:04:12 +02:00
compilade	5cca2542ac	convert : avoid dequantizing mxfp4 for GPT-OSS (#16756 ) b6835	2025-10-24 20:52:00 -04:00
leejet	55945d2ef5	ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (#16742 ) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions b6834	2025-10-24 21:39:37 +02:00
Aman Gupta	0bcb40b48c	CUDA: use CUB for arbitary size argsort (#16754 ) b6833	2025-10-24 20:46:19 +08:00
Florian Badie	69e9ff0103	webui: support q URL parameter (#16728 ) * webui: support q URL parameter Fixes #16722 I’ve checked that it works with Firefox’s AI tools * webui: apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-10-24 14:10:29 +02:00
Daniel Bevenius	5a91109a5d	model-conversion : add trust_remote_code for orig model run [no ci] (#16751 ) This commit add the trust_remote_code=True argument when loading models using AutoConfig, AutoTokenizer, and AutoModelForCausalLM for the run original model script. The motivation for this is that some models require custom code to be loaded properly, and setting trust_remote_code=True avoids a prompt asking for user confirmation: ```console (venv) $ make causal-run-original-model The repository /path/to/model contains custom code which must be executed to correctly load the model. You can inspect the repository content at /path/to/model. Do you wish to run the custom code? [y/N] N ``` Having this as the default seems like a safe choice as we have to clone or download the models we convert and would be expecting to run any custom code they have.	2025-10-24 12:02:02 +02:00
compilade	f8f071fadd	convert : handle pre-quantized models (#14810 ) * convert : begin handling pre-quantized models * convert : fix conversion from FP8 for Deepseek-V3.1-Base b6830	2025-10-23 16:31:41 -04:00
Johannes Gäßler	0bf47a1dbb	server: add memory breakdown print (#16740 ) b6829	2025-10-23 21:30:17 +02:00
Julien Denize	dd62dcfab9	convert : Make mistral-common dependency optional (#16738 ) * Make mistral-common dependency optional * Fix typing	2025-10-23 15:54:46 +02:00
Xuan-Son Nguyen	d0660f237a	mtmd-cli : allow using --jinja (#16718 ) * mtmd-cli : allow using --jinja * support -sys * implement chat_history * fix clear memory * rm -sys support, added TODO b6827	2025-10-23 15:00:49 +02:00
Prajwal B Mehendarkar	fe6a9882ac	Manually link -lbsd to resolve flock symbol on AIX (#16610 ) b6826	2025-10-23 19:37:31 +08:00
Aman Gupta	061f0eff02	ggml-cuda: use passed ops instead of hardcoded ops (#16712 ) b6825	2025-10-23 19:14:06 +08:00
matteo	8cf6b42d46	server : send partial stop string when <EOG> is reached (#15007 ) b6824	2025-10-23 12:32:24 +03:00

1 2 3 4 5 ...

6873 Commits