llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Author	SHA1	Message	Date
Max Krasnyansky	63d2fc46e1	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 ) * model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX Note: This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2025-10-22 13:47:09 -07:00
Neo Zhang Jianyu	2be72c2b12	SYCL: Update to oneAPI 2025.2 (#16371 ) * update oneapi to 2025.2, use deep-learning-essentials to replace base-tool * update to 2025.2 use deeplearn essi to replace base toolkit * add missed dll * add deep learning essentials * add sycl-ls --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-10-02 10:16:25 +03:00
Aaron Teo	264f1b5187	zdnn: refactor codebase + add docs (#16178 ) * zdnn: initial matmul refactor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm static from funcs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update ggml-zdnn.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: change header files to hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch to common.hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move mulmat forward around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm inline from utils Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: add zDNN docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-23 14:53:05 +08:00
hipudding	c0389dba43	CANN: Disable acl_graph for prefill stage (#15933 ) Since the prefill length is not fixed, graphs constructed for the prefill stage cannot be reused. For this reason, ACL graph execution is disabled by default during prefill.	2025-09-11 15:59:37 +08:00
Chenguang Li	28b5f190ef	CANN: implement LRU cache for ACL graphs (#15814 ) * CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <757486878@qq.com> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-10 15:29:12 +08:00
hipudding	5421f63ab0	CANN: Fix precision issue on 310I DUO multi-devices (#15784 )	2025-09-04 15:12:30 +08:00
Chenguang Li	2f853687b3	CANN: Support eager execution mode under ACL graph compilation (#15712 ) * [CANN] Support eager execution mode under ACL graph compilation Add support for running operators in eager mode while ACL graph compilation is enabled. This allows bypassing graph execution and directly submitting ops, which is useful for debugging and reducing graph build overhead in certain scenarios. Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> * rename to acl_graph_mode Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-02 14:07:48 +08:00
hipudding	11490b3672	CANN: Improve loading efficiency after converting weights to NZ format. (#14985 ) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo	2025-07-31 19:47:20 +08:00
Anton Mitkov	2bf9d539dd	sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973 )	2025-06-25 18:09:55 +02:00
Xinpeng Dou	e21d2d4ae2	CANN: Simplify the environment variable setting(#13104 ) * Simplify the environment variable setting to specify the memory pool type. * Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options. * update * fix CI * update * delete whitespace * fix according to review * update CANN.md * update CANN.md	2025-06-09 19:47:39 +08:00
Bizhao Shi	2d38b6e400	CANN: Add the basic supports of Flash Attention kernel (#13627 ) * cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline	2025-05-26 10:20:18 +08:00
Xinpeng Dou	f0adb80bf7	CANN: Update CANN model support (#13162 ) * Update CANN model support status * Update of model support * update * update * update * fix format of CANN.md * fix format of CANN.md * fix format of CANN.md	2025-05-20 11:43:43 +08:00
Alberto Cabrera Pérez	725f23f1f3	sycl : backend documentation review (#13544 ) * sycl: reviewing and updating docs * Updates Runtime error codes * Improves OOM troubleshooting entry * Added a llama 3 sample * Updated supported models * Updated releases table	2025-05-19 14:38:20 +01:00
Łukasz Ślusarczyk	9c404ed54c	sycl: use oneDNN for matrices multiplication (#12972 )	2025-05-15 16:53:41 +02:00
Romain Biessy	8ed71242f4	sycl: update documentation to use -no-cnv (#12845 )	2025-04-09 11:22:04 +02:00
Nicolò Scipione	94148ba330	sycl: allow ggml-sycl configuration and compilation using Visual Studio project/solution (#12625 )	2025-04-04 16:00:46 +02:00
lhez	7d7b1bafa7	opencl: update doc for OpenCL (#12702 ) * opencl: add OpenCL to build.md * opencl: remove fixed issue/TODO * opencl: add link to OPENCL.md * opencl: update doc - refine tools requirement for Windows 11 arm64	2025-04-03 22:18:17 -07:00
Atharva Dubey	2004644b7a	ci : add env variable in ggml-ci and document the same in SYCL.md (#12736 )	2025-04-03 15:12:39 +03:00
Romain Biessy	8293970542	SYCL: Rename oneMKL to oneMath (#12192 ) * Rename oneMKL Interface to oneMath * Use oneMath for Intel vendor * Rename occurences to mkl * clang-format * Silence verbose warnings * Set oneMath HIP_TARGETS * Fix silence warnings * Remove step to build oneMath from build instructions * Use fixed oneMath version * Remove INTEL_CPU * Fold CMake oneDNN conditions * Use Intel oneMKL for Intel devices * Improve CMake message * Link against MKL::MKL_SYCL::BLAS only * Move oneMath documentation to Nvidia and AMD sections	2025-04-01 16:24:29 +08:00
Tei Home	3361e2deba	docs: update: improve the Fedoa CUDA guide (#12536 ) * docs: update fedora-cuda guide - Rename and place into Backend Folder. - Update Host-Supplied Packages. - Expand Recommended Users Section. * docs: improve the flow of CUDA-FEDORA.md	2025-03-24 11:02:26 +00:00
Svetlozar Georgiev	9ffcc9e374	sycl: cleanup oneDNN related code (#12097 )	2025-03-21 10:15:56 +08:00
Łukasz Ślusarczyk	35cae5ba05	SYCL: using graphs is configurable by environment variable and compile option (#12371 ) * alberto changes * enable sycl graphs by env variable * fixed compilation warnings in ggml-sycl.cpp * renamed graph variables * fix markdown in docs/backend/SYCL.md Co-authored-by: Romain Biessy <romain.biessy@codeplay.com> * fix markdown in docs/backend/SYCL.md again * compiling graphs by default, renamed graph_enable to graph_disable --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>	2025-03-18 11:16:31 +01:00
Neo Zhang Jianyu	08d5986290	[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035 ) * opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2025-02-24 22:33:23 +08:00
Georgi Gerganov	68ff663a04	repo : update links to new url (#11886 ) * repo : update links to new url ggml-ci * cont : more urls ggml-ci	2025-02-15 16:40:57 +02:00
lhez	4078c77f98	docs: add OpenCL (#11697 )	2025-02-11 15:04:13 -07:00
Jafar Uruç	a07c2c8a52	docs : Update readme to build targets for local docker build (#11368 )	2025-01-24 14:30:13 +01:00
Georgi Gerganov	8648c52101	make : deprecate (#10514 ) * make : deprecate ggml-ci * ci : disable Makefile builds ggml-ci * docs : remove make references [no ci] * ci : disable swift build ggml-ci * docs : remove obsolete make references, scripts, examples ggml-ci * basic fix for compare-commits.sh * update build.md * more build.md updates * more build.md updates * more build.md updates * Update Makefile Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-12-02 21:22:53 +02:00
Ruixin Huang	c6bc73951e	CANN: Update cann.md to display correctly in CLion (#10538 )	2024-11-28 15:27:11 +08:00
leo-pony	605fa66c50	CANN: Fix SOC_TYPE compile bug (#10519 ) * CANN: Fix the bug build fail on Ascend310P under two cases: 1) Manual specify SOC_TYPE 2) Under some unusual compile environment * Update the cann backend News content: Support F16 and F32 data type model for Ascend 310P NPU. * fix CANN compile fail bug: the assert in ascend kernel function doesn't supportted on some CANN version	2024-11-28 15:25:24 +08:00
Neo Zhang Jianyu	ad21c9e1f1	update rel to 4040 (#10395 ) Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2024-11-20 13:54:25 +08:00
Romain Biessy	2a1507c162	sycl : Add option to set the SYCL architecture for all targets (#10266 ) * Add option to set the SYCL architecture for all targets * Convert GGML_SYCL_HIP_TARGET to the more generic GGML_SYCL_ARCH option * Document that setting GGML_SYCL_ARCH can improve the performance	2024-11-19 08:02:23 +00:00
Romain Biessy	5a54af4d4f	sycl: Use syclcompat::dp4a (#10267 ) * sycl: Use syclcompat::dp4a * Using the syclcompat version allow the compiler to optimize the operation with native function * Update news section * Update CI Windows oneAPI version to 2025.0 * Reword doc * Call syclcompat::dp4a inside dpct::dp4a This reverts commit `90cb61d692`.	2024-11-15 11:09:12 +08:00
Zhiyuan Li	3bcd40b3c5	Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (#10133 ) * rwkv6: rename to wkv6 * rwkv6: support avx2 avx512 armv8 armv9 * rwkv6: update cuda file name * rwkv6: rename params * wkv on sycl * sycl: add some ops * sycl: Enhance OP support judgment * wkv6: drop armv9 and tranfer to GGML style ggml-ci * sync : ggml * update the function to use appropriate types * fix define error * Update ggml/src/ggml-cpu.c * add appropriate asserts * move element-wise functions outside * put the declaration outside the loop * rewrite to be more inline with the common pattern for distributing threads * use recommended way GGML_TENSOR_LOCALS --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Plamen Minev <pacominev@gmail.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: Meng, Hengyu <airdldl@163.com>	2024-11-07 15:19:10 +08:00
Alberto Cabrera Pérez	f536f4c439	[SYCL] Initial cmake support of SYCL for AMD GPUs (#9658 ) sycl: initial cmake support of SYCL for AMD GPUs	2024-10-02 13:57:18 +01:00
Neo Zhang Jianyu	faf67b3de4	[SYCL]set context default value to avoid memory issue, update guide (#9476 ) * set context default to avoid memory issue, update guide * Update docs/backend/SYCL.md Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>	2024-09-18 08:30:31 +08:00
蕭澧邦	cddae4884c	Correct typo run_llama2.sh > run-llama2.sh (#9149 )	2024-08-30 22:10:01 +10:00
luoyu-intel	1731d4238f	[SYCL] Add oneDNN primitive support (#9091 ) * add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc	2024-08-22 12:50:10 +08:00
wangshuai09	cfac111e2b	cann: add doc for cann backend (#8867 ) Co-authored-by: xuedinge233 <damow890@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2024-08-19 16:46:38 +08:00
Neo Zhang	a21c6fd450	update guide (#8909 ) Co-authored-by: Neo Zhang <>	2024-08-11 14:07:43 +05:30
Chen Xi	ed67bcb24f	[SYCL] fix multi-gpu issue on sycl (#8554 ) --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>	2024-07-25 19:45:18 +08:00
Xuan Son Nguyen	be20e7f49d	Reorganize documentation pages (#8325 ) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections	2024-07-05 18:08:32 +02:00

41 Commits