llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-08 10:07:01 +00:00

Author	SHA1	Message	Date
Aleksander Grygier	e7da30b584	fix: Viewing multiple PDF attachments (#16974 )	2025-11-03 18:53:26 +01:00
Daniel Bevenius	ed8aa63320	model-conversion : pass config to from_pretrained (#16963 ) This commit modifies the script `run-org-model.py` to ensure that the model configuration is explicitly passed to the `from_pretrained` method when loading the model. It also removes a duplicate configuration loading which was a mistake. The motivation for this change is that enables the config object to be modified and then passed to the model loading function, which can be useful when testing new models.	2025-11-03 18:01:59 +01:00
Georgi Gerganov	48bd26501b	server : add props.model_alias (#16943 ) * server : add props.model_alias * webui : npm run format b6937	2025-11-03 14:38:23 +01:00
theo77186	622cd010ff	ggml: CUDA: add head size 72 for flash-attn (#16962 ) b6936	2025-11-03 14:29:11 +01:00
Xuan-Son Nguyen	070ff4d535	mtmd: add --image-min/max-tokens (#16921 ) b6935	2025-11-03 11:11:18 +01:00
Xuan-Son Nguyen	bf7b0c9725	mtmd: pad mask for qwen2.5vl (#16954 ) * mtmd: pad mask for qwen2.5vl * improve b6934	2025-11-03 10:25:55 +01:00
Jinyang He	fcfce040e8	ggml : LoongArch fixes (#16958 ) * Fix test-quantize-fns f16 and q4_0 failed when use LSX * Fix LoongArch set float intrinsic when use LSX/LASX b6933	2025-11-03 08:40:02 +02:00
Olivier Chafik	ee3a5a10ad	sync: minja (glm 4.6 & minmax m2 templates) (#16949 ) * sync: minja * Sync https://github.com/ochafik/minja/pull/7 (MinMax M2) b6932	2025-11-03 07:33:56 +02:00
shani-f	7e994168b1	SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869 ) * SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * SYCL: optimize repeat_back kernel * Remove Hebrew comment from repeat_back.cpp * Remove comments for code clarity Removed comments to clean up the code. * Fix formatting in ggml-sycl.cpp * Formatted lambda according to legacy style. No logic changes * Remove blank line in repeat_back.cpp Remove unnecessary blank line before assigning acc to dst_dd. b6931	2025-11-03 09:35:33 +08:00
Sascha Rogmann	bcfa87622a	feat(webui): improve LaTeX rendering with currency detection (#16508 ) * webui : Revised LaTeX formula recognition * webui : Further examples containg amounts * webui : vitest for maskInlineLaTeX * webui: Moved preprocessLaTeX to lib/utils * webui: LaTeX in table-cells * chore: update webui build output (use theirs) * webui: backslash in LaTeX-preprocessing * chore: update webui build output * webui: look-behind backslash-check * chore: update webui build output * Apply suggestions from code review Code maintenance (variable names, code formatting, string handling) Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: Moved constants to lib/constants. * webui: package woff2 inside base64 data * webui: LaTeX-line-break in display formula * chore: update webui build output * webui: Bugfix (font embedding) * webui: Bugfix (font embedding) * webui: vite embeds assets * webui: don't suppress 404 (fonts) * refactor: KaTeX integration with SCSS Moves KaTeX styling to SCSS for better customization and font embedding. This change includes: - Adding `sass` as a dev dependency. - Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding. - Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables. * fix: LaTeX processing within blockquotes * webui: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-11-03 00:41:08 +01:00
Shagun Bera	a2054e3a8f	test-backend-ops : fix segfault in moe-expert-reduce test in support mode and coverage (#16936 ) * tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage * tests: init gf and filter out fusion tests for support mode * tests: filter out fusion cases before calling eval_support * tests: filter out fusion cases from show_test_coverage as well, fix lint b6929	2025-11-03 00:10:30 +01:00
Sigbjørn Skjæret	dd52868050	ci : disable failing riscv cross build (#16952 )	2025-11-02 23:11:21 +01:00
Zhiyong Wang	6b9a52422b	model: add Janus Pro for image understanding (#16906 ) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b6927	2025-11-02 22:08:04 +01:00
Georgi Gerganov	2f966b8ed8	clip : use FA (#16837 ) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-11-02 21:21:48 +01:00
Georgi Gerganov	cd5e3b5754	server : support unified cache across slots (#16736 ) * server : support unified context across slots * cont : fix speculative decoding initialization * context : fix n_ctx_per_seq computation * server : purge slots one by one * tests : add unified cache server tests * llama : update per-seq context computation * test-thread-safety : handle tiny training context of the input model * server : fix server_tokens clear() * server : use 4 slots + unified KV by default * llama : add note about context size queries * cont : update todos [no ci] * context : do not cap the size of the context * tests : adjust parameters to be CI friendlier * context : add warning	2025-11-02 18:14:04 +02:00
Aldehir Rojas	87c9efc3b2	common : move gpt-oss reasoning processing to init params (#16937 ) b6924	2025-11-02 16:56:28 +02:00
Adrian Lundberg	76af40aaaa	docs: remove llama_sampler_accept reference in sampling sample usage (#16920 ) commit `5fb5e24811` (llama : minor sampling refactor (2) (#9386)) moved the llama_sampler_accept call into llama_sampler_sample, but the sampling sample usage in llama.h was forgotten to be updated accordingly. b6923	2025-11-02 11:28:37 +02:00
mnehete32	7db35a7958	CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (#16917 ) b6922	2025-11-02 11:12:57 +08:00
Aaron Teo	a864132ba5	devops: fix failing s390x docker build (#16918 )	2025-11-02 08:48:46 +08:00
Aaron Teo	d38d9f0877	ggml: add s390x cpu-feats (#16774 ) b6920	2025-11-02 08:48:23 +08:00
Georgi Gerganov	7fd205a8e8	scripts : add script to bench models (#16894 ) b6919	2025-11-02 00:15:31 +02:00
Pascal	2f68ce7cfd	webui: auto-refresh /props on inference start to resync model metadata (#16784 ) * webui: auto-refresh /props on inference start to resync model metadata - Add no-cache headers to /props and /slots - Throttle slot checks to 30s - Prevent concurrent fetches with promise guard - Trigger refresh from chat streaming for legacy and ModelSelector - Show dynamic serverWarning when using cached data * fix: restore proper legacy behavior in webui by using unified /props refresh Updated assistant message bubbles to show each message's stored model when available, falling back to the current server model only when the per-message value is missing When the model selector is disabled, now fetches /props and prioritizes that model name over chunk metadata, then persists it with the streamed message so legacy mode properly reflects the backend configuration * fix: detect first valid SSE chunk and refresh server props once * fix: removed the slots availability throttle constant and state * webui: purge ai-generated cruft * chore: update webui static build	2025-11-01 19:49:51 +01:00
Pascal	e4a71599e5	webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe (#16757 ) * webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog Extended MarkdownContent to flag previewable code languages, add a preview button alongside copy controls, manage preview dialog state, and share styling for the new button group Introduced CodePreviewDialog.svelte, a sandboxed iframe modal for rendering HTML/JS previews with consistent dialog controls * webui: fullscreen HTML preview dialog using bits-ui * Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: pedantic style tweak for CodePreviewDialog close button * webui: remove overengineered preview language logic * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-11-01 17:14:54 +01:00
Adrien Gallouët	dd5e8cab51	vendor : update cpp-httplib to 0.27.0 (#16846 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b6916	2025-11-01 16:52:17 +01:00
Xuan-Son Nguyen	cf659bbb8e	mtmd: refactor preprocessing + support max/min pixels (#16878 ) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen b6915	2025-11-01 15:51:36 +01:00
Aleksander Grygier	d8b860a219	Add a setting to display message generation statistics (#16901 ) * feat: Add setting to display message generation statistics * chore: build static webui output	2025-11-01 15:35:57 +01:00
Jaromír Hradílek	1ae74882f8	webui: recognize AsciiDoc files as valid text files (#16850 ) * webui: recognize AsciiDoc files as valid text files * webui: add an updated static webui build * webui: add the updated dependency list * webui: re-add an updated static webui build This also reverts commit `742dbb8379`.	2025-11-01 15:02:57 +01:00
Sigbjørn Skjæret	961660b8c3	common : allow --system-prompt-file for diffusion-cli (#16903 ) b6912	2025-11-01 11:01:42 +01:00
Sigbjørn Skjæret	74fef4129f	codeowners : update after refactor (#16905 )	2025-11-01 09:55:25 +02:00
Jeff Bolz	5d8bb900bc	vulkan: Fix multi_add invalid descriptor usage (#16899 ) b6910	2025-11-01 06:52:14 +01:00
Jeff Bolz	2e76e01360	vulkan: fuse mul_mat+add and mul_mat_id+add_id (#16868 ) * vulkan: fuse mul_mat+add and mul_mat_id+add_id The fusion is only applied for the mat-vec mul paths. * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix 32b build --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6909	2025-11-01 06:45:28 +01:00
Oliver Simons	d3dc9dd898	CUDA: Remove unneded bias/gate dims in fused mmvq (#16858 ) * CUDA: Remove unneded bias/gate dims in fused mmvq Pointed out [here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989) that only a single value is needed per target col per thread * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix "Error 991-D: extra braces are nonstandard" during compilation --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b6908	2025-11-01 13:13:26 +08:00
Piotr Wilkin (ilintar)	bea04522ff	refactor : llama-model.cpp (#16252 ) * Sqashed: llama-model.cpp refactoring * Fix formatting of attn / ffn / ffn_moe calls * Fix import regression / unify spacing in models.h * totally DID NOT miss those! * Add missing qwen3vl(moe) models * Add missing new .cpp files to build * Remove extra semicolons * Editor checker * Update src/models/models.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6907	2025-10-31 23:40:23 +01:00
Piotr Wilkin (ilintar)	0de0a01576	model : Minimax M2 (#16831 ) * Model: Minimax M2 * Cleanup * Cleanup pt. 2 * Cleanup pt. 3 * Update convert_hf_to_gguf_update.py - merge catch blocks Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Remove vocab models and test * Remove all redundant hparam settings covered by TextModel * Move super to start, don't set block_count * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6906	2025-10-31 21:20:47 +01:00
Giuseppe Scrivano	e58d585604	model : add Granite Hybrid nano types (#16896 ) Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> b6905	2025-10-31 21:20:07 +01:00
Johannes Gäßler	31c511a968	CUDA: Volta tensor core support for MMF (#16843 ) * CUDA: Volta tensor core support for MMF * more generic checks for hardware support * Update ggml/src/ggml-cuda/mmf.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> b6904	2025-10-31 15:57:19 +01:00
Georgi Gerganov	6d39015a74	sync : ggml	2025-10-31 16:26:28 +02:00
Aman Gupta	4146d6a1a6	CUDA: add expert reduce kernel (#16857 ) * CUDA: add expert reduce kernel * contigous checks, better formatting, use std::vector instead of array * use vector empty instead of size Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-10-31 20:05:07 +08:00
Georgi Gerganov	8da3c0e200	batch : fix consistency checks for the input positions (#16890 ) b6901	2025-10-31 13:50:33 +02:00
Georgi Gerganov	c22473b580	server : don't print user inputs to console (#16871 ) b6900	2025-10-31 10:54:19 +02:00
Daniel Bevenius	0f715b4e75	server : fix typos in server.cpp comments [no ci] (#16883 )	2025-10-31 09:51:26 +01:00
Jeff Bolz	d2d931f173	vulkan: disable spirv-opt for rope shaders (#16872 ) b6898	2025-10-31 08:34:47 +01:00
Masato Nakasaka	2976b0374d	vulkan: Fix crash when FP16 mul_mat accumulation is not supported (#16796 ) * Experimenting crash fix * added assert for aborting and fixed comment * changed to check if a pipeline is empty or not * Moved function in class definition * replaced with is_empty * Modified is_empty to check only unaligned pipelines b6897	2025-10-31 08:18:59 +01:00
Ruben Ortlam	d2a2673dd1	vulkan: fix shmem overrun in mmq id shader (#16873 ) * vulkan: fix shmem overrun in mmq id shader * metal : fix mul_mm_id --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b6896	2025-10-31 08:14:49 +01:00
l3utterfly	13002a0896	ggml-hexagon: respect input size when getting/setting tensor data (#16836 ) * respect input size when getting/setting tensor data allows partial repacking/copying when get tensor size is smaller than the actual tensor * Removed duplicate repack_mxfp4_mxfp4x4x2 function b6895	2025-10-30 21:46:31 -07:00
Sigbjørn Skjæret	6eb208d17e	ci : enable free-disk-space on cuda docker build (#16877 ) b6894	2025-10-31 00:34:27 +01:00
lhez	9984cbb61d	opencl: fix boundary handling for mul_mm (#16875 )	2025-10-30 16:00:20 -07:00
RodriMora	ce18efeaf1	convert : update transformers requirements (#16866 ) * Update requirements-convert_legacy_llama.txt Updated requirements to support Qwen3-VL in transformers 4.57.1 version * Update requirements/requirements-convert_legacy_llama.txt Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 23:15:03 +01:00
chansikpark	16724b5b68	server : bump request URI max length to 32768 (#16862 ) b6891	2025-10-30 20:22:23 +02:00
Georgi Gerganov	b52edd2558	server : remove n_past (#16818 ) * server : remove n_past * server : replace slot.n_prompt_tokens() with slot.task->n_tokens() * server : fixes + clean-up * cont : fix context shift * server : add server_tokens::pos_next() Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * server : fix pos_next() usage Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> b6890	2025-10-30 18:42:57 +02:00

1 2 3 4 5 ...

6939 Commits