llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Author	SHA1	Message	Date
junchao-zhao	aa719c2f88	ggml : fix loongarch lsx compilation error (#15864 ) b6580	2025-09-25 12:22:55 +03:00
Johannes Gäßler	4cdd0bb453	docs: fix typo [no ci] (#16244 )	2025-09-25 12:12:27 +03:00
Douglas Hanley	b5bd037832	llama : add support for qwen3 reranker (#15824 ) b6578	2025-09-25 11:53:09 +03:00
Georgi Gerganov	dfcd53f7ec	metal : fuse NORM + MUL + ADD, support non-multiples of 4 (#16220 ) * metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]	2025-09-25 11:30:16 +03:00
Georgi Gerganov	4ea00794b8	metal : relax reorder conditions (#16216 ) b6576	2025-09-25 11:29:42 +03:00
Georgi Gerganov	02a6a82ae7	metal : restore im2col perf (#16219 ) b6575	2025-09-25 11:29:08 +03:00
Radoslav Gerganov	c498fc82fe	rpc : use ggml logging facilities Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function. b6574	2025-09-25 07:20:02 +00:00
Aaron Teo	e7a5130a20	codeowners: add ownership of zdnn backend [no ci] (#16232 ) add @Andreas-Krebbel to owners of zDNN backend Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-25 08:06:30 +03:00
Eve	bee378e098	ci: run the x64 and arm ci on the github machines instead (#16183 ) * run the x64 ci on regular machines * set up the same thing for arm fix test-quantize-perf just like #12306 * try to disable sve * add another sve run b6572	2025-09-25 08:06:06 +03:00
Aaron Teo	5fb557653b	devops: fix s390x docker release failure (#16231 )	2025-09-25 11:36:30 +08:00
Aaron Teo	4ae88d07d0	codeowners: add ownership of zdnn backend [no ci] (#16229 ) add @AlekseiNikiforovIBM to owners of zDNN backend Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-25 00:25:04 +08:00
Johannes Gäßler	e789095502	llama: print memory breakdown on exit (#15860 ) * llama: print memory breakdown on exit b6569	2025-09-24 16:53:48 +02:00
Acly	f2a789e334	ggml : split graph allocations according to backend max buffer size (#15815 ) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks b6568	2025-09-24 16:17:49 +02:00
Tarek Dakhran	3a59971967	model : add label for LiquidAI LFM2-2.6B model (#16204 ) * model : add label for LiquidAI LFM2-2.6B model HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B). Support for GGUF conversion and inference is added in #14620. However, due to similar `n_embd`, it identifies as a 1.2B model. Fix the label by using `n_ff` to identify the model instead. Output of `llama-bench`: ``` \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ------------------------------ \| ---------: \| ---------: \| ---------- \| ------: \| --------------: \| -------------------: \| \| lfm2 1.2B F16 \| 2.18 GiB \| 1.17 B \| CPU \| 10 \| pp512 \| 223.97 ± 5.32 \| \| lfm2 2.6B F16 \| 4.79 GiB \| 2.57 B \| CPU \| 10 \| pp512 \| 92.53 ± 4.14 \| \| lfm2 350M F16 \| 676.25 MiB \| 354.48 M \| CPU \| 10 \| pp512 \| 725.52 ± 11.70 \| \| lfm2 700M F16 \| 1.38 GiB \| 742.49 M \| CPU \| 10 \| pp512 \| 336.22 ± 12.93 \| ``` * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6567	2025-09-24 13:42:26 +02:00
Jie Fu (傅杰)	63b54c81a6	model-conversion : make causal-verify-logits fails with model names containing "." (#16215 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 10:25:26 +02:00
Uilian Ries	152729f884	common : add missing chrono header for common.cpp (#16211 ) Signed-off-by: Uilian Ries <uilianries@gmail.com> b6565	2025-09-24 09:53:47 +03:00
Sigbjørn Skjæret	c0c59c1157	codeowners : match all requirements files (#16214 )	2025-09-24 08:53:20 +02:00
Jie Fu (傅杰)	7735706b93	model-conversion : run-org-model.py fails to run on mac m1 (#16213 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 08:46:52 +02:00
Daniel Bevenius	4d9ea03d17	codeowners : use slash prefix for root files [no ci] (#16210 ) This commit adds a leading slash to the paths of root-level files in the CODEOWNERS file. The motivation for this is that these might otherwise match files in subdirectories that have other/additional owners will override them. Refs: https://github.com/ggml-org/llama.cpp/pull/16209#issuecomment-3326434274	2025-09-24 08:10:09 +02:00
Jie Fu (傅杰)	8ba548dae2	model-conversion : fix the make targets in the README.md (#16209 ) Fix two incorrect make targets in the readme. Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 06:19:23 +02:00
Georgi Gerganov	f505bd83ca	ci : disable AMD workflows + update NVIDIA workflows (#16200 ) * ci : disable AMD workflows + update NVIDIA workflows * cont : fixes * cont : update nvidia vulkan workflows	2025-09-23 20:41:40 +03:00
Georgi Gerganov	0889589dbe	ci : enable Vulkan workflow on Mac (#16194 )	2025-09-23 13:44:25 +03:00
Xiangyan Sun	4e29084ba4	ggml-cpu: Respect cpumask settings (#16164 ) b6558	2025-09-23 11:58:12 +03:00
Sigbjørn Skjæret	f6b4af3d04	ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928 ) * fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl * change initialization to true b6557	2025-09-23 10:25:20 +02:00
Aaron Teo	264f1b5187	zdnn: refactor codebase + add docs (#16178 ) * zdnn: initial matmul refactor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm static from funcs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update ggml-zdnn.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: change header files to hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch to common.hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move mulmat forward around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm inline from utils Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: add zDNN docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b6556	2025-09-23 14:53:05 +08:00
Daniel Bevenius	0bc7cc7154	codeowners : add @danbev to model-conversion example [no ci] (#16190 ) This commit adds examples/model-conversion/ to the CODEOWNERS file and assigns myself (@danbev) as the code owner for this directory.	2025-09-23 09:13:22 +03:00
Aaron Teo	4b9f4cb0f8	devops: add s390x containers (#15915 ) * devops: add s390x dockerfile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add missing ninja Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move s390x docker into cpu docker Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: rework s390x docker Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: copy more tools Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add server build step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove apt clean steps as distroless misses it Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove apt commands from distroless Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix shared libs in distroless Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: use correct libs path Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix shared libs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add collector stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix missing stage ref Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix permission issue Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix unknown model loading failures Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at fixing model loading failure Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix missing ggml shared object failure to load model Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove move shared objects Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move libggml-cpu and blas into bin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: finalise hardened server stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add cli target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix missing shared libraries in base Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: update debian target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "devops: formalise llama.cpp loc" This reverts commit `0a7664af84`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `0a7664af84`) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at fixing missing dir Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at making it cache the build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix copying process Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: make build dir an argument Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "devops: make build dir an argument" This reverts commit `438698976b`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add build stage for gguf-py Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move gguf-py installation into build stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: break system packages? Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add rust compiler installer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix rustc not found Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove cache mount to allow rustc to persist Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move rustc installation to another layer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move gguf-py installation to full stage, fix copying Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove rustc installation in build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable full target for now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempting static build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: merge s390x dockerfile into cpu for now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: switch to gcc image for build step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove build essentials Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: install openblas into base target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: go back to s390x dockerfile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove libggml and libblas Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add full target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add break system packages Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add libjpeg Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add missing cmake dep Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: finalise docker images for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add custom openblas patch Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: use libopenblas-dev instead of libopenblas-openmp-dev Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add s390x docker build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-23 13:59:34 +08:00
Daniel Bevenius	85e72271ba	ggml-cpu : fix typo in gemm comments [no ci] (#16189 )	2025-09-23 05:59:03 +02:00
Gabe Goodhart	1d0125bcf1	feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (#16177 ) This is a configuration of the hparams in the GraniteHybrid architecture that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x). It may be used for some models in the Granite 4 family with the GraniteHybrid architecture acting as a superset arch. Rather than support it directly in the c++ graph, we simply coerce the architecture flag back to the correct "granite" or "granitemoe" architecture. Branch: gabe-l-hart/GraniteNonHybridConversion Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-22 20:40:10 +02:00
Haiyue Wang	351f3da39c	clang-tidy : disable warning about performance enum size (#16127 ) Disable 'performance-enum-size' checking: Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes) than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the base type to reduce its size.	2025-09-22 19:57:46 +02:00
Sigbjørn Skjæret	3ecb2f671a	ggml : implement set_rows with i32 index (#16159 ) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (#16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com> b6550	2025-09-22 19:13:00 +02:00
Georgi Gerganov	432cf4304c	codeowners : update + cleanup (#16174 ) --------- Co-authored-by: slaren <slarengh@gmail.com> b6549	2025-09-22 18:20:21 +03:00
Adrien Gallouët	37a23c17bd	common : enable `--offline` mode without curl support (#16137 ) * common : use the json parser Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : enable --offline mode without CURL support This change refactors the download logic to properly support offline mode even when the project is built without CURL. Without this commit, using `--offline` would give the following error: error: built without CURL, cannot download model from the internet even if all the files are already cached. Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> b6548	2025-09-22 15:13:51 +03:00
Quentin Bramas	138c87ce8b	webui : fix handling incomplete chunks (#16107 )	2025-09-22 11:53:13 +03:00
GideonSerf	c6db9a1027	embedding : fix typos in README (#16171 )	2025-09-22 11:49:58 +03:00
Haiyue Wang	d05affbab7	common : remove unused local variables (#16140 ) These two local variables 'arg' and 'arg_prefix' have been overriden by: 1. for (const auto & arg : opt.args) 2. for (int i = 1; i < argc; i++) { const std::string arg_prefix = "--"; std::string arg = argv[i]; b6545	2025-09-22 11:48:42 +03:00
Georgi Gerganov	4f324a556c	ggml : extend ggml_can_fuse to work with non-sequential nodes (#16123 ) * ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph * cont : fix wrong bounds check condition * cont : remove unnecessary overload b6544	2025-09-22 11:12:37 +03:00
Georgi Gerganov	a71ae3ba7a	ggml : add ggml_op_is_empty (#16122 ) * ggml : add ggml_op_is_empty * ggml : move to ggml-impl.h b6543	2025-09-22 11:12:09 +03:00
Xuan-Son Nguyen	05a2458121	codeowners : update ownership for @ngxson and @allozuar (#16128 )	2025-09-22 11:10:58 +03:00
Shin-myoung-serp	96fdca043b	Vulkan: add conv_transpose_2d operation (#16022 ) * Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader. b6541	2025-09-22 10:04:01 +02:00
Sigbjørn Skjæret	b2d980fce0	codeowners : claim responsibility for ci, models, gguf-py and convert (#16124 ) * claim responsibility for ci, gguf-py and convert * add myself to various src/llama- files	2025-09-22 10:59:05 +03:00
Georgi Gerganov	5c6106a696	contrib : update roles (#16113 ) * contrib : update roles * contrib : merge PR sections + add link to CI instructions Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.	2025-09-22 10:58:02 +03:00
Georgi Gerganov	ec65fb52f0	ci : remove vulkaninfo calls (#16169 )	2025-09-22 10:16:05 +03:00
Georgi Gerganov	1d660d2fae	ci : use smaller model (#16168 ) * ci : switch from gemma to qwen3 0.6b * ci : use smaller model for some tests	2025-09-22 09:11:39 +03:00
Jeff Bolz	a20d810d79	vulkan: add RTE variants of exp shader (#16165 ) This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite. b6536	2025-09-22 07:37:17 +02:00
Georgi Gerganov	4d0a7cbc61	ci : adjust params for less runtime (#16167 ) * ci : adjust params for less runtime * ci : gate BF16 on some hardware * ci : move extra tests to Arm runner b6535	2025-09-22 08:31:40 +03:00
Ruben Ortlam	9073a73d82	vulkan: vec dot matrix multiplication fix (#16151 ) * vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching * add odd m/n + odd k test with batching b6534	2025-09-22 07:22:43 +02:00
lhez	51f5a45fbe	opencl: fix concat crash on win arm64 with Adreno (#15944 ) b6533	2025-09-21 16:42:10 -07:00
lhez	c4510dc937	opencl: initial `q8_0` mv support (#15732 ) b6532	2025-09-21 14:48:44 -07:00
Georgi Gerganov	da30ab5f86	ci : add label for the RISC-V runner (#16150 )	2025-09-21 19:00:27 +03:00

... 4 5 6 7 8 ...

6830 Commits