llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Author	SHA1	Message	Date
Aaron Teo	b05a9d650f	vendors: update miniaudio version (#16212 ) * vendor: update miniaudio.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * vendor: update miniaudio.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b6585	2025-09-25 23:38:10 +08:00
rtaluyev	27052978e4	readme : update bindings (#16144 ) Link to Java JNA bindings to llama.cpp native libraries	2025-09-25 18:20:34 +03:00
Aman Gupta	077c94d0ca	CUDA: add a fused top-K MoE kernel (#16130 ) * CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback b6583	2025-09-25 16:35:05 +02:00
Daniel Bevenius	aa3ee0eb0b	model-conversion : add embedding prompt file support (#15871 ) This commit adds support for passing a prompt file to the model conversion targets/scripts. It also updates the logits.cpp to print out embedding information in the same format as when running the original embedding model. The motivation for this is that it allows us to pass files of different sizes when running the converted models and validating the logits. This can be particularly important when testing the sliding window functionality of models where the sequence length needs to exceed a certain number of tokens to trigger the sliding window logic. b6582	2025-09-25 12:02:36 +02:00
Daniel Bevenius	d0991da39d	server : add support for external server for tests (#16243 ) This commit adds support for using an externally started llama-server instance for the server tests. This can be enabled by setting the DEBUG_EXTERNAL environment variable. The motivation for this is to allow debugging of the server itself when investigating a test failure. Instructions for how to do this are added to the README.md file in the tests directory.	2025-09-25 11:36:47 +02:00
junchao-zhao	aa719c2f88	ggml : fix loongarch lsx compilation error (#15864 ) b6580	2025-09-25 12:22:55 +03:00
Johannes Gäßler	4cdd0bb453	docs: fix typo [no ci] (#16244 )	2025-09-25 12:12:27 +03:00
Douglas Hanley	b5bd037832	llama : add support for qwen3 reranker (#15824 ) b6578	2025-09-25 11:53:09 +03:00
Georgi Gerganov	dfcd53f7ec	metal : fuse NORM + MUL + ADD, support non-multiples of 4 (#16220 ) * metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]	2025-09-25 11:30:16 +03:00
Georgi Gerganov	4ea00794b8	metal : relax reorder conditions (#16216 ) b6576	2025-09-25 11:29:42 +03:00
Georgi Gerganov	02a6a82ae7	metal : restore im2col perf (#16219 ) b6575	2025-09-25 11:29:08 +03:00
Radoslav Gerganov	c498fc82fe	rpc : use ggml logging facilities Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function. b6574	2025-09-25 07:20:02 +00:00
Aaron Teo	e7a5130a20	codeowners: add ownership of zdnn backend [no ci] (#16232 ) add @Andreas-Krebbel to owners of zDNN backend Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-25 08:06:30 +03:00
Eve	bee378e098	ci: run the x64 and arm ci on the github machines instead (#16183 ) * run the x64 ci on regular machines * set up the same thing for arm fix test-quantize-perf just like #12306 * try to disable sve * add another sve run b6572	2025-09-25 08:06:06 +03:00
Aaron Teo	5fb557653b	devops: fix s390x docker release failure (#16231 )	2025-09-25 11:36:30 +08:00
Aaron Teo	4ae88d07d0	codeowners: add ownership of zdnn backend [no ci] (#16229 ) add @AlekseiNikiforovIBM to owners of zDNN backend Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-25 00:25:04 +08:00
Johannes Gäßler	e789095502	llama: print memory breakdown on exit (#15860 ) * llama: print memory breakdown on exit b6569	2025-09-24 16:53:48 +02:00
Acly	f2a789e334	ggml : split graph allocations according to backend max buffer size (#15815 ) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks b6568	2025-09-24 16:17:49 +02:00
Tarek Dakhran	3a59971967	model : add label for LiquidAI LFM2-2.6B model (#16204 ) * model : add label for LiquidAI LFM2-2.6B model HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B). Support for GGUF conversion and inference is added in #14620. However, due to similar `n_embd`, it identifies as a 1.2B model. Fix the label by using `n_ff` to identify the model instead. Output of `llama-bench`: ``` \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ------------------------------ \| ---------: \| ---------: \| ---------- \| ------: \| --------------: \| -------------------: \| \| lfm2 1.2B F16 \| 2.18 GiB \| 1.17 B \| CPU \| 10 \| pp512 \| 223.97 ± 5.32 \| \| lfm2 2.6B F16 \| 4.79 GiB \| 2.57 B \| CPU \| 10 \| pp512 \| 92.53 ± 4.14 \| \| lfm2 350M F16 \| 676.25 MiB \| 354.48 M \| CPU \| 10 \| pp512 \| 725.52 ± 11.70 \| \| lfm2 700M F16 \| 1.38 GiB \| 742.49 M \| CPU \| 10 \| pp512 \| 336.22 ± 12.93 \| ``` * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6567	2025-09-24 13:42:26 +02:00
Jie Fu (傅杰)	63b54c81a6	model-conversion : make causal-verify-logits fails with model names containing "." (#16215 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 10:25:26 +02:00
Uilian Ries	152729f884	common : add missing chrono header for common.cpp (#16211 ) Signed-off-by: Uilian Ries <uilianries@gmail.com> b6565	2025-09-24 09:53:47 +03:00
Sigbjørn Skjæret	c0c59c1157	codeowners : match all requirements files (#16214 )	2025-09-24 08:53:20 +02:00
Jie Fu (傅杰)	7735706b93	model-conversion : run-org-model.py fails to run on mac m1 (#16213 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 08:46:52 +02:00
Daniel Bevenius	4d9ea03d17	codeowners : use slash prefix for root files [no ci] (#16210 ) This commit adds a leading slash to the paths of root-level files in the CODEOWNERS file. The motivation for this is that these might otherwise match files in subdirectories that have other/additional owners will override them. Refs: https://github.com/ggml-org/llama.cpp/pull/16209#issuecomment-3326434274	2025-09-24 08:10:09 +02:00
Jie Fu (傅杰)	8ba548dae2	model-conversion : fix the make targets in the README.md (#16209 ) Fix two incorrect make targets in the readme. Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 06:19:23 +02:00
Georgi Gerganov	f505bd83ca	ci : disable AMD workflows + update NVIDIA workflows (#16200 ) * ci : disable AMD workflows + update NVIDIA workflows * cont : fixes * cont : update nvidia vulkan workflows	2025-09-23 20:41:40 +03:00
Georgi Gerganov	0889589dbe	ci : enable Vulkan workflow on Mac (#16194 )	2025-09-23 13:44:25 +03:00
Xiangyan Sun	4e29084ba4	ggml-cpu: Respect cpumask settings (#16164 ) b6558	2025-09-23 11:58:12 +03:00
Sigbjørn Skjæret	f6b4af3d04	ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928 ) * fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl * change initialization to true b6557	2025-09-23 10:25:20 +02:00
Aaron Teo	264f1b5187	zdnn: refactor codebase + add docs (#16178 ) * zdnn: initial matmul refactor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm static from funcs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update ggml-zdnn.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: change header files to hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch to common.hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move mulmat forward around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm inline from utils Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: add zDNN docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b6556	2025-09-23 14:53:05 +08:00
Daniel Bevenius	0bc7cc7154	codeowners : add @danbev to model-conversion example [no ci] (#16190 ) This commit adds examples/model-conversion/ to the CODEOWNERS file and assigns myself (@danbev) as the code owner for this directory.	2025-09-23 09:13:22 +03:00
Aaron Teo	4b9f4cb0f8	devops: add s390x containers (#15915 ) * devops: add s390x dockerfile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add missing ninja Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move s390x docker into cpu docker Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: rework s390x docker Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: copy more tools Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add server build step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove apt clean steps as distroless misses it Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove apt commands from distroless Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix shared libs in distroless Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: use correct libs path Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix shared libs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add collector stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix missing stage ref Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix permission issue Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix unknown model loading failures Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at fixing model loading failure Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix missing ggml shared object failure to load model Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove move shared objects Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move libggml-cpu and blas into bin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: finalise hardened server stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add cli target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix missing shared libraries in base Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: update debian target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "devops: formalise llama.cpp loc" This reverts commit `0a7664af84`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `0a7664af84`) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at fixing missing dir Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at making it cache the build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix copying process Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: make build dir an argument Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "devops: make build dir an argument" This reverts commit `438698976b`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add build stage for gguf-py Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move gguf-py installation into build stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: break system packages? Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add rust compiler installer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix rustc not found Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove cache mount to allow rustc to persist Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move rustc installation to another layer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move gguf-py installation to full stage, fix copying Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove rustc installation in build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable full target for now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempting static build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: merge s390x dockerfile into cpu for now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: switch to gcc image for build step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove build essentials Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: install openblas into base target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: go back to s390x dockerfile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove libggml and libblas Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add full target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add break system packages Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add libjpeg Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add missing cmake dep Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: finalise docker images for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add custom openblas patch Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: use libopenblas-dev instead of libopenblas-openmp-dev Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add s390x docker build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-23 13:59:34 +08:00
Daniel Bevenius	85e72271ba	ggml-cpu : fix typo in gemm comments [no ci] (#16189 )	2025-09-23 05:59:03 +02:00
Gabe Goodhart	1d0125bcf1	feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (#16177 ) This is a configuration of the hparams in the GraniteHybrid architecture that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x). It may be used for some models in the Granite 4 family with the GraniteHybrid architecture acting as a superset arch. Rather than support it directly in the c++ graph, we simply coerce the architecture flag back to the correct "granite" or "granitemoe" architecture. Branch: gabe-l-hart/GraniteNonHybridConversion Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-22 20:40:10 +02:00
Haiyue Wang	351f3da39c	clang-tidy : disable warning about performance enum size (#16127 ) Disable 'performance-enum-size' checking: Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes) than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the base type to reduce its size.	2025-09-22 19:57:46 +02:00
Sigbjørn Skjæret	3ecb2f671a	ggml : implement set_rows with i32 index (#16159 ) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (#16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com> b6550	2025-09-22 19:13:00 +02:00
Georgi Gerganov	432cf4304c	codeowners : update + cleanup (#16174 ) --------- Co-authored-by: slaren <slarengh@gmail.com> b6549	2025-09-22 18:20:21 +03:00
Adrien Gallouët	37a23c17bd	common : enable `--offline` mode without curl support (#16137 ) * common : use the json parser Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : enable --offline mode without CURL support This change refactors the download logic to properly support offline mode even when the project is built without CURL. Without this commit, using `--offline` would give the following error: error: built without CURL, cannot download model from the internet even if all the files are already cached. Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> b6548	2025-09-22 15:13:51 +03:00
Quentin Bramas	138c87ce8b	webui : fix handling incomplete chunks (#16107 )	2025-09-22 11:53:13 +03:00
GideonSerf	c6db9a1027	embedding : fix typos in README (#16171 )	2025-09-22 11:49:58 +03:00
Haiyue Wang	d05affbab7	common : remove unused local variables (#16140 ) These two local variables 'arg' and 'arg_prefix' have been overriden by: 1. for (const auto & arg : opt.args) 2. for (int i = 1; i < argc; i++) { const std::string arg_prefix = "--"; std::string arg = argv[i]; b6545	2025-09-22 11:48:42 +03:00
Georgi Gerganov	4f324a556c	ggml : extend ggml_can_fuse to work with non-sequential nodes (#16123 ) * ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph * cont : fix wrong bounds check condition * cont : remove unnecessary overload b6544	2025-09-22 11:12:37 +03:00
Georgi Gerganov	a71ae3ba7a	ggml : add ggml_op_is_empty (#16122 ) * ggml : add ggml_op_is_empty * ggml : move to ggml-impl.h b6543	2025-09-22 11:12:09 +03:00
Xuan-Son Nguyen	05a2458121	codeowners : update ownership for @ngxson and @allozuar (#16128 )	2025-09-22 11:10:58 +03:00
Shin-myoung-serp	96fdca043b	Vulkan: add conv_transpose_2d operation (#16022 ) * Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader. b6541	2025-09-22 10:04:01 +02:00
Sigbjørn Skjæret	b2d980fce0	codeowners : claim responsibility for ci, models, gguf-py and convert (#16124 ) * claim responsibility for ci, gguf-py and convert * add myself to various src/llama- files	2025-09-22 10:59:05 +03:00
Georgi Gerganov	5c6106a696	contrib : update roles (#16113 ) * contrib : update roles * contrib : merge PR sections + add link to CI instructions Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.	2025-09-22 10:58:02 +03:00
Georgi Gerganov	ec65fb52f0	ci : remove vulkaninfo calls (#16169 )	2025-09-22 10:16:05 +03:00
Georgi Gerganov	1d660d2fae	ci : use smaller model (#16168 ) * ci : switch from gemma to qwen3 0.6b * ci : use smaller model for some tests	2025-09-22 09:11:39 +03:00
Jeff Bolz	a20d810d79	vulkan: add RTE variants of exp shader (#16165 ) This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite. b6536	2025-09-22 07:37:17 +02:00

1 2 3 4 5 ...

6585 Commits