llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Author	SHA1	Message	Date
Jeff Bolz	e78cf0d4b1	vulkan: workaround MoltenVK compile failure in multi_add (#15506 ) * vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <picard12@live.de>	2025-08-24 10:48:21 +02:00
Johannes Gäßler	710dfc465a	CUDA: fix half2 -> half conversion for HIP (#15529 )	2025-08-23 21:37:06 +02:00
Jeff Bolz	611f419cff	vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (#15281 ) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against #15489, sync after clearing partial sums b6258	2025-08-23 13:16:17 -05:00
Piotr Wilkin (ilintar)	b1afcab804	model : add support for Seed-OSS (#15490 ) * First draft * Fix linter errors * Added missing sinks nullptr * Don't forget the llama-arch! * We're through to the generation stage. * Fix post-attention norm * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix RoPE type * Fix tensor name and reorder llm_types * Update gguf-py/gguf/constants.py Remove nonexistent FFN_POST_NORM tensor Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add basic chat template * Add chat template tests * Remake chat template test * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Reorder llm type descriptions * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6257	2025-08-23 15:21:52 +02:00
Johannes Gäßler	9ef536907d	scripts: fix compare-llama-bench.py (#15521 )	2025-08-23 13:58:58 +03:00
LaffeyNyaa	21dc4ddaf2	chat : fix debug build assertion in trim function (#15520 ) b6255	2025-08-23 10:38:30 +02:00
Jeff Bolz	289bf4113e	vulkan: Rewrite synchronization to allow some overlap between nodes (#15489 ) Track a list of nodes that need synchronization, and only sync if the new node depends on them (or overwrites them). This allows some overlap which can improve performance, and centralizes a big chunk of the synchronization logic. The remaining synchronization logic involves writes to memory other than the nodes, e.g. for dequantization or split_k. Each of these allocations has a bool indicating whether they were in use and need to be synced. This should be checked before they are written to, and set to true after they are done being consumed. b6254	2025-08-23 09:33:36 +02:00
R0CKSTAR	b55f06e1aa	vulkan.Dockerfile: install vulkan SDK using tarball (#15282 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-08-23 08:58:57 +02:00
Acly	0a9b43e507	vulkan : support ggml_mean (#15393 ) * vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader	2025-08-23 08:35:21 +02:00
Jeff Bolz	330c3d2d21	vulkan: optimize mul_mat_id loading row ids into shared memory (#15427 ) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two. b6251	2025-08-23 08:31:54 +02:00
Johannes Gäßler	e92734d51b	test-opt: allow slight inprecision (#15503 ) b6250	2025-08-22 23:47:01 +02:00
Reese Levine	45363632cb	ggml WebGPU: add support for quantization types (#15440 ) * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView b6249	2025-08-22 11:28:03 -07:00
Aldehir Rojas	32732f2459	model : gpt-oss add response_format support (#15494 ) b6248	2025-08-22 11:04:08 -05:00
rmatif	92f7f0a53c	ggml: add `conv3d` op (#15182 ) * add conv3d * bump GGML_OP_COUNT b6247	2025-08-22 15:33:15 +02:00
Yavor Ivanov	b1ab91821f	cuda : add Pad Reflect 1D support (#14659 ) * Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b6246	2025-08-22 13:06:29 +02:00
Georgi Gerganov	9ebebef62f	llama : remove KV cache defragmentation logic (#15473 ) ggml-ci b6245	2025-08-22 12:22:13 +03:00
Aaron Teo	ad5c975c2d	ggml-cpu: Support Q5_0 and Q5_1 on s390x (#15486 ) * ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: update the last update date Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b6244	2025-08-22 16:11:04 +08:00
65a	4afb0a746f	server : Support multimodal completion and embeddings prompts in JSON format (#15108 ) - Use server_tokens in more places in server and util.cpp - Convert most functions that used llama_tokens to server_tokens - Modify input tokenizer to handle JSON objects as subprompts - Break out MTMD prompt parsing into utility function - Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types - Add capability to model endpoint to indicate if client can send multimodal data - Add tests. b6243	2025-08-22 10:10:14 +02:00
Tarek Dakhran	e288693669	readme : model : mtdm : lfm2 improvements (#15476 ) * Support untied embeddings * Increase number of image tokens to 1024 * Add LFM2-VL to readme * Actually use untied embeddings b6242	2025-08-22 09:29:08 +02:00
Chenguang Li	a0f98dd604	CANN: Optimize RMS_NORM using cache (#15419 ) * [CANN] Optimize RMS_NORM using cache Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> * fix review comment Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> b6241	2025-08-22 14:12:07 +08:00
Diego Devesa	54a241f505	sched : fix possible use of wrong ids tensor when offloading moe prompt processing (#15488 ) b6240	2025-08-21 23:09:32 +02:00
Georgi Gerganov	cd36b5e5c7	llama : remove deprecated llama_kv_self API (#15472 ) ggml-ci b6239	2025-08-21 19:13:45 +03:00
Georgi Gerganov	3f196be84b	graph : remove build_attn_with_sinks overload (#15469 ) ggml-ci b6238	2025-08-21 18:44:45 +03:00
Acly	97ae5961a4	vulkan : support conv_2d_dw with f16 weights (#15392 ) b6237	2025-08-21 17:01:51 +02:00
Dong Won Kim	20c2dac8c6	vulkan: add exp operation (#15456 ) Co-authored-by: aeseulgi <kim2h7903@gmail.com> b6236	2025-08-21 17:00:16 +02:00
Jeff Bolz	96452a3fa4	vulkan: Reuse conversion results in prealloc_y (#15410 ) * vulkan: Reuse conversion results in prealloc_y Cache the pipeline and tensor that were most recently used to fill prealloc_y, and skip the conversion if the current pipeline/tensor match. * don't use shared pointer for prealloc_y_last_pipeline_used b6235	2025-08-21 16:55:00 +02:00
Jie Fu (傅杰)	9ad5e60dba	examples : fix some typos in examples/model-conversion/README.md (#15477 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-08-21 16:53:13 +02:00
Georgi Gerganov	715a6db02c	kv-cache : drop the "unified" prefix (#15467 ) * kv-cache : drop the "unified" prefix ggml-ci * cont : fix comment [no ci]	2025-08-21 17:00:33 +03:00
Jie Fu (傅杰)	ad294df03f	examples : install torch-cpu for model conversion tool/example (#15475 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-08-21 15:42:34 +02:00
Ali Tariq	029bb39eb1	ci : enable RVV1.0 native build (#15386 ) * Changed the CI file to hw * Changed the CI file to hw * Added to sudoers for apt * Removed the clone command and used checkout * Added libcurl * Added gcc-14 * Checking gcc --version * added gcc-14 symlink * added CC and C++ variables * Added the gguf weight * Changed the weights path * Added system specification * Removed white spaces * ci: Replace Jenkins riscv native build Cloud-V pipeline with GitHub Actions workflow Removed the legacy .devops/cloud-v-pipeline Jenkins CI configuration and introduced .github/workflows/build-riscv-native.yml for native RISC-V builds using GitHub Actions. * removed trailing whitespaces * Added the trigger at PR creation * Corrected OS name * Added ccache as setup package * Added ccache for self-hosted runner * Added directory for ccache size storage Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Changed the build command and added ccache debug log * Added the base dir for the ccache * Re-trigger CI * Cleanup and refactored ccache steps * Cleanup and refactored ccache steps --------- Co-authored-by: Akif Ejaz <akifejaz40@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-08-21 14:52:16 +02:00
Georgi Gerganov	30649cab65	ci : continue file download with wget (#15471 ) ggml-ci	2025-08-21 13:42:55 +03:00
Daniel Bevenius	2758fa10da	examples : add model conversion tool/example (#15455 ) * examples : add model conversion tool/example This commit adds an "example/tool" that is intended to help in the process of converting models to GGUF. Currently it supports normal causal models and embedding models. The readme contains instructions and command to guide through the process. The motivation for this to have a structured and repeatable process for model conversions and hopefully with time improve upon it to make the process easier and more reliable. We have started to use this for new model conversions internally and will continue doing so and improve it as we go along. Perhaps with time this should be placed in a different directory than the examples directory, but for now it seems like a good place to keep it while we are still developing it. * squash! examples : add model conversion tool/example Remove dependency on scikit-learn in model conversion example. * squash! examples : add model conversion tool/example Update transformer dep to use non-dev version. And also import `AutoModelForCausalLM` instead of `AutoModel` to ensure compatibility with the latest version. * squash! examples : add model conversion tool/example Remove the logits requirements file from the all requirements file. b6229	2025-08-21 12:16:54 +02:00
Michael Giba	b108e42904	ci : fix -Werror=return-type in clip.cpp so ci/run.sh can run without issue (#15221 ) * Fix -Werror=return-type so ci/run.sh can run * Update tools/mtmd/clip.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * Remove false now that we have abort --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> b6228	2025-08-21 12:06:46 +02:00
Copilot	245be739df	ci : add copilot-instructions.md (#15286 ) * Initial plan * Initialize copilot instructions exploration * Add comprehensive .github/copilot-instructions.md file * Update Python environment and tools directory documentation - Add instructions for using .venv Python environment - Include flake8 and pyright linting tools from virtual environment - Add tools/ as core directory in project layout - Reference existing configuration files (.flake8, pyrightconfig.json) * add more python dependencies to .venv * Update copilot instructions: add backend hardware note and server testing * Apply suggestions from code review * Apply suggestions from code review * Replace clang-format with git clang-format to format only changed code * Minor formatting improvements: remove extra blank line and add trailing newline * try installing git-clang-format * try just clang-format * Remove --binary flag from git clang-format and add git-clang-format installation to CI * download 18.x release * typo-- * remove --binary flag --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-08-21 11:47:52 +02:00
Julien Denize	b2caf67db1	convert : make Mistral community chat templates optional via parameter (#15420 ) * Make Mistral community chat templates optional * Change the flag arg to disable instead of enable community chat templates * Improve error message * Improve help message * Tone down the logger messages	2025-08-21 11:19:50 +02:00
Jie Fu (傅杰)	2f3dbffb17	common : fix incorrect print of non-ascii characters in the logging (#15466 ) Signed-off-by: Jie Fu <jiefu@tencent.com> b6225	2025-08-21 11:54:34 +03:00
Xuan-Son Nguyen	945e1f12a6	ggml : fix condition of im2col on Metal backend (#15460 )	2025-08-21 08:32:26 +03:00
stduhpf	1b0db8f6e0	server : fix webui (#15462 ) * Fix webui crash after streaming * build webui	2025-08-21 08:19:22 +03:00
Daniel Bevenius	29f538ac63	examples : remove references to `make` in examples [no ci] (#15457 ) This commit removes references to `make` in the examples, as the build system has been updated to use CMake directly and using `make` will now generate an error since Commit `37f10f955f` ("make : remove make in favor of CMake (#15449)").	2025-08-21 06:12:28 +02:00
R0CKSTAR	8ad038c0fd	musa: add GGML_UNUSED_VARS (#15446 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-08-21 11:06:05 +08:00
Diego Devesa	5682a3745f	sched : copy only the used experts when offloading prompt processing (#15346 )	2025-08-21 01:35:28 +02:00
teo	1bc664a26a	server: fix OpenAI API compatibility for usage statistics in chat streams (#15444 )	2025-08-21 00:10:08 +02:00
Johannes Gäßler	13aeb7aef2	CUDA: refactor FA support/selection code (#15454 ) b6218	2025-08-20 23:14:14 +02:00
Johannes Gäßler	7a6e91ad26	CUDA: replace GGML_CUDA_F16 with CUDA arch checks (#15433 )	2025-08-20 16:58:49 +02:00
Jeff Bolz	fec9519802	vulkan: shorten pipeline name strings (#15431 ) These detailed strings were causing increased build time on gcc.	2025-08-20 16:33:14 +02:00
Daniel Bevenius	657b8a77bd	chat: handle gpt-oss return/end token inconsistency (#15421 ) This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<\|return\|>` token with the `<\|end\|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<\|return\|>` when `add_generation_prompt` is false, and `<\|end\|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart\|>' instead of '<\|start\|>'. Resolves: https://github.com/ggml-org/llama.cpp/issues/15417 b6215	2025-08-20 14:26:01 +02:00
Jie Fu (傅杰)	ec5ab1a36c	common : fix context shift help message (#15448 ) Signed-off-by: Jie Fu <jiefu@tencent.com> b6214	2025-08-20 13:33:30 +03:00
xiaobing318	1a99c2d948	cmake : fix target include directories (#15450 ) * Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * feat:Modify the header file include path 1. There's no llava directory in the tools directory. 2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`. * Restore the docker.yml file b6213	2025-08-20 13:32:05 +03:00
Daniel Bevenius	37f10f955f	make : remove make in favor of CMake (#15449 ) This commit removes the content from the Makefile and updates the current deprecation message to information that `make` has been replaced by CMake instead. The message when `make` is invoked will now be the following: ```console $ make Makefile:6: *** Build system changed: The Makefile build has been replaced by CMake. For build instructions see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md . Stop. ``` The motivation for this is that many, if not all targets fail to build now, after changes to the system, and `make` has also been deprected for some time now.	2025-08-20 13:31:16 +03:00
Georgi Gerganov	2f37014073	lookahead : add sample command to readme (#15447 ) * lookahead : add sample command to readme * cont : build-agnostic command	2025-08-20 13:30:46 +03:00

1 2 3 4 5 ...

6260 Commits