llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-07 09:57:00 +00:00

Author	SHA1	Message	Date
Xuan-Son Nguyen	070ff4d535	mtmd: add --image-min/max-tokens (#16921 ) b6935	2025-11-03 11:11:18 +01:00
Xuan-Son Nguyen	bf7b0c9725	mtmd: pad mask for qwen2.5vl (#16954 ) * mtmd: pad mask for qwen2.5vl * improve b6934	2025-11-03 10:25:55 +01:00
Jinyang He	fcfce040e8	ggml : LoongArch fixes (#16958 ) * Fix test-quantize-fns f16 and q4_0 failed when use LSX * Fix LoongArch set float intrinsic when use LSX/LASX b6933	2025-11-03 08:40:02 +02:00
Olivier Chafik	ee3a5a10ad	sync: minja (glm 4.6 & minmax m2 templates) (#16949 ) * sync: minja * Sync https://github.com/ochafik/minja/pull/7 (MinMax M2) b6932	2025-11-03 07:33:56 +02:00
shani-f	7e994168b1	SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869 ) * SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * SYCL: optimize repeat_back kernel * Remove Hebrew comment from repeat_back.cpp * Remove comments for code clarity Removed comments to clean up the code. * Fix formatting in ggml-sycl.cpp * Formatted lambda according to legacy style. No logic changes * Remove blank line in repeat_back.cpp Remove unnecessary blank line before assigning acc to dst_dd. b6931	2025-11-03 09:35:33 +08:00
Sascha Rogmann	bcfa87622a	feat(webui): improve LaTeX rendering with currency detection (#16508 ) * webui : Revised LaTeX formula recognition * webui : Further examples containg amounts * webui : vitest for maskInlineLaTeX * webui: Moved preprocessLaTeX to lib/utils * webui: LaTeX in table-cells * chore: update webui build output (use theirs) * webui: backslash in LaTeX-preprocessing * chore: update webui build output * webui: look-behind backslash-check * chore: update webui build output * Apply suggestions from code review Code maintenance (variable names, code formatting, string handling) Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: Moved constants to lib/constants. * webui: package woff2 inside base64 data * webui: LaTeX-line-break in display formula * chore: update webui build output * webui: Bugfix (font embedding) * webui: Bugfix (font embedding) * webui: vite embeds assets * webui: don't suppress 404 (fonts) * refactor: KaTeX integration with SCSS Moves KaTeX styling to SCSS for better customization and font embedding. This change includes: - Adding `sass` as a dev dependency. - Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding. - Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables. * fix: LaTeX processing within blockquotes * webui: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-11-03 00:41:08 +01:00
Shagun Bera	a2054e3a8f	test-backend-ops : fix segfault in moe-expert-reduce test in support mode and coverage (#16936 ) * tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage * tests: init gf and filter out fusion tests for support mode * tests: filter out fusion cases before calling eval_support * tests: filter out fusion cases from show_test_coverage as well, fix lint b6929	2025-11-03 00:10:30 +01:00
Sigbjørn Skjæret	dd52868050	ci : disable failing riscv cross build (#16952 )	2025-11-02 23:11:21 +01:00
Zhiyong Wang	6b9a52422b	model: add Janus Pro for image understanding (#16906 ) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b6927	2025-11-02 22:08:04 +01:00
Georgi Gerganov	2f966b8ed8	clip : use FA (#16837 ) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-11-02 21:21:48 +01:00
Georgi Gerganov	cd5e3b5754	server : support unified cache across slots (#16736 ) * server : support unified context across slots * cont : fix speculative decoding initialization * context : fix n_ctx_per_seq computation * server : purge slots one by one * tests : add unified cache server tests * llama : update per-seq context computation * test-thread-safety : handle tiny training context of the input model * server : fix server_tokens clear() * server : use 4 slots + unified KV by default * llama : add note about context size queries * cont : update todos [no ci] * context : do not cap the size of the context * tests : adjust parameters to be CI friendlier * context : add warning	2025-11-02 18:14:04 +02:00
Aldehir Rojas	87c9efc3b2	common : move gpt-oss reasoning processing to init params (#16937 ) b6924	2025-11-02 16:56:28 +02:00
Adrian Lundberg	76af40aaaa	docs: remove llama_sampler_accept reference in sampling sample usage (#16920 ) commit `5fb5e24811` (llama : minor sampling refactor (2) (#9386)) moved the llama_sampler_accept call into llama_sampler_sample, but the sampling sample usage in llama.h was forgotten to be updated accordingly. b6923	2025-11-02 11:28:37 +02:00
mnehete32	7db35a7958	CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (#16917 ) b6922	2025-11-02 11:12:57 +08:00
Aaron Teo	a864132ba5	devops: fix failing s390x docker build (#16918 )	2025-11-02 08:48:46 +08:00
Aaron Teo	d38d9f0877	ggml: add s390x cpu-feats (#16774 ) b6920	2025-11-02 08:48:23 +08:00
Georgi Gerganov	7fd205a8e8	scripts : add script to bench models (#16894 ) b6919	2025-11-02 00:15:31 +02:00
Pascal	2f68ce7cfd	webui: auto-refresh /props on inference start to resync model metadata (#16784 ) * webui: auto-refresh /props on inference start to resync model metadata - Add no-cache headers to /props and /slots - Throttle slot checks to 30s - Prevent concurrent fetches with promise guard - Trigger refresh from chat streaming for legacy and ModelSelector - Show dynamic serverWarning when using cached data * fix: restore proper legacy behavior in webui by using unified /props refresh Updated assistant message bubbles to show each message's stored model when available, falling back to the current server model only when the per-message value is missing When the model selector is disabled, now fetches /props and prioritizes that model name over chunk metadata, then persists it with the streamed message so legacy mode properly reflects the backend configuration * fix: detect first valid SSE chunk and refresh server props once * fix: removed the slots availability throttle constant and state * webui: purge ai-generated cruft * chore: update webui static build	2025-11-01 19:49:51 +01:00
Pascal	e4a71599e5	webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe (#16757 ) * webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog Extended MarkdownContent to flag previewable code languages, add a preview button alongside copy controls, manage preview dialog state, and share styling for the new button group Introduced CodePreviewDialog.svelte, a sandboxed iframe modal for rendering HTML/JS previews with consistent dialog controls * webui: fullscreen HTML preview dialog using bits-ui * Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: pedantic style tweak for CodePreviewDialog close button * webui: remove overengineered preview language logic * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-11-01 17:14:54 +01:00
Adrien Gallouët	dd5e8cab51	vendor : update cpp-httplib to 0.27.0 (#16846 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b6916	2025-11-01 16:52:17 +01:00
Xuan-Son Nguyen	cf659bbb8e	mtmd: refactor preprocessing + support max/min pixels (#16878 ) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen b6915	2025-11-01 15:51:36 +01:00
Aleksander Grygier	d8b860a219	Add a setting to display message generation statistics (#16901 ) * feat: Add setting to display message generation statistics * chore: build static webui output	2025-11-01 15:35:57 +01:00
Jaromír Hradílek	1ae74882f8	webui: recognize AsciiDoc files as valid text files (#16850 ) * webui: recognize AsciiDoc files as valid text files * webui: add an updated static webui build * webui: add the updated dependency list * webui: re-add an updated static webui build This also reverts commit `742dbb8379`.	2025-11-01 15:02:57 +01:00
Sigbjørn Skjæret	961660b8c3	common : allow --system-prompt-file for diffusion-cli (#16903 ) b6912	2025-11-01 11:01:42 +01:00
Sigbjørn Skjæret	74fef4129f	codeowners : update after refactor (#16905 )	2025-11-01 09:55:25 +02:00
Jeff Bolz	5d8bb900bc	vulkan: Fix multi_add invalid descriptor usage (#16899 ) b6910	2025-11-01 06:52:14 +01:00
Jeff Bolz	2e76e01360	vulkan: fuse mul_mat+add and mul_mat_id+add_id (#16868 ) * vulkan: fuse mul_mat+add and mul_mat_id+add_id The fusion is only applied for the mat-vec mul paths. * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix 32b build --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6909	2025-11-01 06:45:28 +01:00
Oliver Simons	d3dc9dd898	CUDA: Remove unneded bias/gate dims in fused mmvq (#16858 ) * CUDA: Remove unneded bias/gate dims in fused mmvq Pointed out [here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989) that only a single value is needed per target col per thread * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix "Error 991-D: extra braces are nonstandard" during compilation --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b6908	2025-11-01 13:13:26 +08:00
Piotr Wilkin (ilintar)	bea04522ff	refactor : llama-model.cpp (#16252 ) * Sqashed: llama-model.cpp refactoring * Fix formatting of attn / ffn / ffn_moe calls * Fix import regression / unify spacing in models.h * totally DID NOT miss those! * Add missing qwen3vl(moe) models * Add missing new .cpp files to build * Remove extra semicolons * Editor checker * Update src/models/models.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6907	2025-10-31 23:40:23 +01:00
Piotr Wilkin (ilintar)	0de0a01576	model : Minimax M2 (#16831 ) * Model: Minimax M2 * Cleanup * Cleanup pt. 2 * Cleanup pt. 3 * Update convert_hf_to_gguf_update.py - merge catch blocks Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Remove vocab models and test * Remove all redundant hparam settings covered by TextModel * Move super to start, don't set block_count * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6906	2025-10-31 21:20:47 +01:00
Giuseppe Scrivano	e58d585604	model : add Granite Hybrid nano types (#16896 ) Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> b6905	2025-10-31 21:20:07 +01:00
Johannes Gäßler	31c511a968	CUDA: Volta tensor core support for MMF (#16843 ) * CUDA: Volta tensor core support for MMF * more generic checks for hardware support * Update ggml/src/ggml-cuda/mmf.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> b6904	2025-10-31 15:57:19 +01:00
Georgi Gerganov	6d39015a74	sync : ggml	2025-10-31 16:26:28 +02:00
Aman Gupta	4146d6a1a6	CUDA: add expert reduce kernel (#16857 ) * CUDA: add expert reduce kernel * contigous checks, better formatting, use std::vector instead of array * use vector empty instead of size Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-10-31 20:05:07 +08:00
Georgi Gerganov	8da3c0e200	batch : fix consistency checks for the input positions (#16890 ) b6901	2025-10-31 13:50:33 +02:00
Georgi Gerganov	c22473b580	server : don't print user inputs to console (#16871 ) b6900	2025-10-31 10:54:19 +02:00
Daniel Bevenius	0f715b4e75	server : fix typos in server.cpp comments [no ci] (#16883 )	2025-10-31 09:51:26 +01:00
Jeff Bolz	d2d931f173	vulkan: disable spirv-opt for rope shaders (#16872 ) b6898	2025-10-31 08:34:47 +01:00
Masato Nakasaka	2976b0374d	vulkan: Fix crash when FP16 mul_mat accumulation is not supported (#16796 ) * Experimenting crash fix * added assert for aborting and fixed comment * changed to check if a pipeline is empty or not * Moved function in class definition * replaced with is_empty * Modified is_empty to check only unaligned pipelines b6897	2025-10-31 08:18:59 +01:00
Ruben Ortlam	d2a2673dd1	vulkan: fix shmem overrun in mmq id shader (#16873 ) * vulkan: fix shmem overrun in mmq id shader * metal : fix mul_mm_id --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b6896	2025-10-31 08:14:49 +01:00
l3utterfly	13002a0896	ggml-hexagon: respect input size when getting/setting tensor data (#16836 ) * respect input size when getting/setting tensor data allows partial repacking/copying when get tensor size is smaller than the actual tensor * Removed duplicate repack_mxfp4_mxfp4x4x2 function b6895	2025-10-30 21:46:31 -07:00
Sigbjørn Skjæret	6eb208d17e	ci : enable free-disk-space on cuda docker build (#16877 ) b6894	2025-10-31 00:34:27 +01:00
lhez	9984cbb61d	opencl: fix boundary handling for mul_mm (#16875 )	2025-10-30 16:00:20 -07:00
RodriMora	ce18efeaf1	convert : update transformers requirements (#16866 ) * Update requirements-convert_legacy_llama.txt Updated requirements to support Qwen3-VL in transformers 4.57.1 version * Update requirements/requirements-convert_legacy_llama.txt Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 23:15:03 +01:00
chansikpark	16724b5b68	server : bump request URI max length to 32768 (#16862 ) b6891	2025-10-30 20:22:23 +02:00
Georgi Gerganov	b52edd2558	server : remove n_past (#16818 ) * server : remove n_past * server : replace slot.n_prompt_tokens() with slot.task->n_tokens() * server : fixes + clean-up * cont : fix context shift * server : add server_tokens::pos_next() Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * server : fix pos_next() usage Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> b6890	2025-10-30 18:42:57 +02:00
Max Krasnyansky	517b7170e1	cpu: introduce chunking for repack matmuls and enable matmul-id chunking on ARM64 (#16833 ) Very similar implementation to the flash-attention chunking, with similar benefits. b6889	2025-10-30 09:06:13 -07:00
Shagun Bera	835e918d84	common: fix typo in cli help text (#16864 ) b6888	2025-10-30 17:47:31 +02:00
JJJYmmm	d261223d24	model: add support for qwen3vl series (#16780 ) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit `f321b9fdf1`. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6887	2025-10-30 16:19:14 +01:00
Max Krasnyansky	dcca0d3ab8	cpu: introduce chunking for flash attention (#16829 ) Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks. b6886	2025-10-30 14:26:05 +02:00

1 2 3 4 5 ...

6935 Commits