llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-28 08:31:25 +00:00

Author	SHA1	Message	Date
Xuan-Son Nguyen	945e1f12a6	ggml : fix condition of im2col on Metal backend (#15460 )	2025-08-21 08:32:26 +03:00
stduhpf	1b0db8f6e0	server : fix webui (#15462 ) * Fix webui crash after streaming * build webui	2025-08-21 08:19:22 +03:00
Daniel Bevenius	29f538ac63	examples : remove references to `make` in examples [no ci] (#15457 ) This commit removes references to `make` in the examples, as the build system has been updated to use CMake directly and using `make` will now generate an error since Commit `37f10f955f` ("make : remove make in favor of CMake (#15449)").	2025-08-21 06:12:28 +02:00
R0CKSTAR	8ad038c0fd	musa: add GGML_UNUSED_VARS (#15446 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-08-21 11:06:05 +08:00
Diego Devesa	5682a3745f	sched : copy only the used experts when offloading prompt processing (#15346 )	2025-08-21 01:35:28 +02:00
teo	1bc664a26a	server: fix OpenAI API compatibility for usage statistics in chat streams (#15444 )	2025-08-21 00:10:08 +02:00
Johannes Gäßler	13aeb7aef2	CUDA: refactor FA support/selection code (#15454 ) b6218	2025-08-20 23:14:14 +02:00
Johannes Gäßler	7a6e91ad26	CUDA: replace GGML_CUDA_F16 with CUDA arch checks (#15433 )	2025-08-20 16:58:49 +02:00
Jeff Bolz	fec9519802	vulkan: shorten pipeline name strings (#15431 ) These detailed strings were causing increased build time on gcc.	2025-08-20 16:33:14 +02:00
Daniel Bevenius	657b8a77bd	chat: handle gpt-oss return/end token inconsistency (#15421 ) This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<\|return\|>` token with the `<\|end\|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<\|return\|>` when `add_generation_prompt` is false, and `<\|end\|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart\|>' instead of '<\|start\|>'. Resolves: https://github.com/ggml-org/llama.cpp/issues/15417 b6215	2025-08-20 14:26:01 +02:00
Jie Fu (傅杰)	ec5ab1a36c	common : fix context shift help message (#15448 ) Signed-off-by: Jie Fu <jiefu@tencent.com> b6214	2025-08-20 13:33:30 +03:00
xiaobing318	1a99c2d948	cmake : fix target include directories (#15450 ) * Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * feat:Modify the header file include path 1. There's no llava directory in the tools directory. 2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`. * Restore the docker.yml file b6213	2025-08-20 13:32:05 +03:00
Daniel Bevenius	37f10f955f	make : remove make in favor of CMake (#15449 ) This commit removes the content from the Makefile and updates the current deprecation message to information that `make` has been replaced by CMake instead. The message when `make` is invoked will now be the following: ```console $ make Makefile:6: *** Build system changed: The Makefile build has been replaced by CMake. For build instructions see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md . Stop. ``` The motivation for this is that many, if not all targets fail to build now, after changes to the system, and `make` has also been deprected for some time now.	2025-08-20 13:31:16 +03:00
Georgi Gerganov	2f37014073	lookahead : add sample command to readme (#15447 ) * lookahead : add sample command to readme * cont : build-agnostic command	2025-08-20 13:30:46 +03:00
R0CKSTAR	a094f38143	musa: fix build warnings (#15258 ) * musa: fix build warnings Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare] Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b6210	2025-08-20 10:17:37 +08:00
lhez	fb22dd07a6	opencl: mark `argsort` unsupported if cols exceed workgroup limit (#15375 ) b6209	2025-08-19 11:25:51 -07:00
Georgi Gerganov	9ef6b0b835	model : add gpt-oss type strings (#15424 ) b6208	2025-08-19 19:58:28 +03:00
Gian-Carlo Pascutto	1e19f5d462	common : Add top-nsigma sampler to help globally (#15428 ) Fixes #15423. b6207	2025-08-19 19:58:14 +03:00
Georgi Gerganov	d2fcd91cf9	server : disable context shift by default (#15416 ) * server : disable context shift by default ggml-ci * server : make scopr of test parameters local	2025-08-19 16:46:37 +03:00
SHUAI YANG	a6d3cfe7fa	CANN: optimize rope operator (#15335 ) * optimize rope ops * amendment * delete trailing whitespace * change the variable name b6205	2025-08-19 21:28:22 +08:00
R0CKSTAR	67f09a3a27	musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (#15413 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b6204	2025-08-19 12:33:47 +02:00
Marvin Gießing	6424594c56	ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (#15385 ) * Added VSX intrinsics for Power9+ systems Signed-off-by: mgiessing <marvin.giessing@gmail.com> * Manual unrolling for minor perf improvement Signed-off-by: mgiessing <marvin.giessing@gmail.com> * Update ggml/src/ggml-cpu/arch/powerpc/quants.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: mgiessing <marvin.giessing@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-08-19 11:54:31 +03:00
Xuan-Son Nguyen	e9288e8869	chat : clarify the meaning of reasoning_format (#15408 ) * chat : clarify the meaning of reasoning_format * add link to this PR b6202	2025-08-19 10:29:36 +02:00
Georgi Gerganov	9d262f4bad	server : remove swa_full warning (#15399 ) b6201	2025-08-19 08:45:26 +03:00
Georgi Gerganov	f0d3c7405c	batched-bench : use rand tokens (#15398 )	2025-08-19 08:45:12 +03:00
Xuan-Son Nguyen	f08c4c0d8d	mtmd : clean up clip_n_output_tokens (#15391 ) b6199	2025-08-18 22:53:52 +02:00
Georgi Gerganov	6d7f1117e3	codeowners : remove mmv.*	2025-08-18 22:06:44 +03:00
Georgi Gerganov	60212f1ead	sync : ggml	2025-08-18 22:06:44 +03:00
Georgi Gerganov	f0c541d315	scripts : update sync scripts	2025-08-18 22:06:44 +03:00
Sigbjørn Skjæret	baa9255a45	llama : merge conts and reshapes and remove unnecessary cont (#15380 ) * remove unnecessary conts and merge reshapes * restore necessary conts * merge more conts and reshapes * merge even more conts and reshapes b6195	2025-08-18 19:30:17 +02:00
Georgi Gerganov	3007baf201	readme : update hot topics (#15397 )	2025-08-18 18:11:44 +03:00
davidef	d1d8241600	server : fix incoming tasks not process in order (#15395 ) b6193	2025-08-18 17:51:42 +03:00
Dobri Danchev	618575c582	Fix broken build: require updated pip to support --break-system-packages (#15357 ) * Revert "devops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04 (#15005)" This reverts commit `e4e915912c`. * devops: Allow pip to modify externally-managed python environment (system installation) - Updated pip install commands to include the --break-system-packages flag, ensuring compatibility when working with system-managed Python environments (PEP 668). - Note: The --break-system-packages option was introduced in 2023. Ensure pip is updated to a recent version before using this flag. fixes [#15004](https://github.com/danchev/llama.cpp/issues/15004)	2025-08-18 12:50:48 +02:00
compilade	f44f793172	ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379 ) * ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors * ggml-quants : avoid division by zero in make_q3_quants b6191	2025-08-18 09:23:56 +02:00
Jeff Bolz	ae532eac2c	vulkan: disable spirv-opt for bfloat16 shaders (#15352 ) b6190	2025-08-18 07:56:29 +02:00
Oleksandr Kuvshynov	e5155e6986	server : export max observed n_past value (#15361 ) Add tracking for high watermark cache usage and make it available in /metrics endpoint. Use-case: Tracking largest needed cache usage under realistic workload to better understand memory requirements and be able to adjust cache size/quantization for model/cache accordingly. b6189	2025-08-18 00:28:58 +02:00
Jeff Bolz	21c17b5bef	vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355 ) * vulkan: Use larger workgroups for mul_mat_vec when M is small Also use subgroup instructions for (part of) the reduction when supported. Without this, the more expensive reductions would eat into the benefits of the larger workgroups. * update heuristic for amd/intel Co-authored-by: 0cc4m <picard12@live.de> --------- Co-authored-by: 0cc4m <picard12@live.de> b6188	2025-08-17 18:08:57 +02:00
Dong Won Kim	19f4decae0	vulkan: support sqrt (#15370 ) b6187	2025-08-17 16:03:09 +02:00
Sigbjørn Skjæret	4d196981d4	convert : force patch_embd weights to F16 or F32 to avoid broken GGUFs (#15367 ) * force patch_embd weights to f32 * use MmprojModel base tensor_force_quant instead	2025-08-17 14:47:42 +02:00
Sigbjørn Skjæret	b143fbc87a	ci : fix hang in windows-hip build/release (#15365 ) * fix hang in windows-latest-cmake-hip * apply fix to release as well b6185	2025-08-17 13:30:23 +02:00
Jeff Bolz	de5627910d	vulkan: Optimize argsort (#15354 ) - Launch an appropriate number of invocations (next larger power of two). 32 invocations is common and the barrier is much cheaper there. - Specialize for "needs bounds checking" vs not. - Make the code less branchy and [[unroll]] the loops. In the final code, I see no branches inside the main loop (only predicated stores) when needs_bounds_check is false. - Always sort ascending, then apply the ascending vs descending option when doing the final stores to memory. - Copy the values into shared memory, makes them slightly cheaper to access. b6184	2025-08-17 10:41:45 +02:00
Tarek Dakhran	65349f26f2	model : support vision LiquidAI LFM2-VL family (#15347 ) * wip lfm2 vision model * Fix conv weight * Implement dynamic resolution * Fix cuda * support LFM2-VL-450M * happy CI * Remove extra `ggml_conv` and put others into the right place Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6183	2025-08-16 23:33:54 +02:00
Jeff Bolz	1fe00296f5	vulkan: fuse adds (#15252 ) * vulkan: fuse adds Fuse adds that have the same shape, which are common in MoE models. It will currently fuse up to 6 adds, because we assume no more than 8 descriptors per dispatch. But this could be changed. * check runtimeDescriptorArray feature * disable multi_add for Intel due to likely driver bug b6182	2025-08-16 11:48:22 -05:00
Jeff Bolz	de2192794f	vulkan: Support mul_mat_id with f32 accumulators (#15337 ) * vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id * vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up - There's no explicit way to request f32 precision for mul_mat_id, but there probably should be, and this gets the code in place for that. - A couple fixes to check_results. - Remove casts to fp16 in coopmat1 FA shader (found by inspection). b6181	2025-08-16 11:18:31 +02:00
Jeff Bolz	2e2b22ba66	vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id (#15334 ) b6180	2025-08-16 10:58:38 +02:00
rmatif	912ff8c119	OpenCL: add initial FA support (#14987 ) * add F16/F16 fa support * fix kernel init * use mad instead of fma * use inline function * mark FA with sinks as unsupported for now * add pragma unroll to loops b6179	2025-08-16 01:05:55 -07:00
Daniel Bevenius	5e6229a840	common : fix double bos, use common_chat_templates for add_bos and add_eos (#15326 ) This commit updates common_chat_templates_apply_jinja to use the the add_bos and add_eos parameters from the chat template instead of the inputs. The motivation for this is that currently if the `add_bos` and `add_eos` from the input parameters are used it is possible to there will be a missmatch between the model and the chat template which can lead to the the removal of duplicate BOS/EOS tokens in chat.cpp `apply` to not happen leading to two BOS tokens being added to the template. b6178	2025-08-15 19:50:52 +02:00
lhez	e2c1bfff53	opencl: add initial mxfp4 support via mv (#15270 ) * opencl: add reference `mul_mv_mxfp4_f32` * opencl: add reference `mul_mv_id` for mxfp4 * Q4_0 tranpose fix for Adreno --------- Co-authored-by: shawngu-quic <shawngu@qti.qualcomm.com> b6177	2025-08-15 09:52:14 -07:00
Georgi Gerganov	5edf1592fd	vulkan : fix out-of-bounds access in argmax kernel (#15342 ) ggml-ci b6176	2025-08-15 16:16:36 +02:00
Georgi Gerganov	db3010bd23	vulkan : fix compile warnings on macos (#15340 ) ggml-ci b6175	2025-08-15 15:28:28 +02:00

... 3 4 5 6 7 ...

6424 Commits