mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-10-30 08:42:00 +00:00 
			
		
		
		
	 63d2fc46e1
			
		
	
	63d2fc46e1
	
	
	
		
			
			* model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX **Note:** This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
		
			
				
	
	
		
			110 lines
		
	
	
		
			6.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			110 lines
		
	
	
		
			6.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Hexagon backend developer details
 | |
| 
 | |
| ## Backend libraries
 | |
| 
 | |
| The Hexagon backend consist of two parts:
 | |
| 
 | |
|   - `libggml-hexagon`
 | |
|     This is the regular CPU-side GGML backend library, either shared or statically linked
 | |
| 
 | |
|   - `libggml-htp-vNN`
 | |
|     This is the NPU-side (HTP stands for Hexagon Tensor Processor) shared library that contains the Op dispatcher and kernels.
 | |
|     The correct library is selected automatically at runtime based on the HW version.
 | |
| 
 | |
| Here is an example of the build artifacts
 | |
| 
 | |
| ```
 | |
| ~/src/llama.cpp$ ls -l pkg-adb/llama.cpp/lib/libggml*
 | |
| pkg-adb/llama.cpp/lib/libggml-base.so
 | |
| pkg-adb/llama.cpp/lib/libggml-cpu.so
 | |
| pkg-adb/llama.cpp/lib/libggml-hexagon.so      <<< CPU library
 | |
| pkg-adb/llama.cpp/lib/libggml-htp-v73.so      <<< HTP op/kernels for Hexagon v73
 | |
| pkg-adb/llama.cpp/lib/libggml-htp-v75.so
 | |
| pkg-adb/llama.cpp/lib/libggml-htp-v79.so
 | |
| pkg-adb/llama.cpp/lib/libggml-htp-v81.so
 | |
| ```
 | |
| 
 | |
| ## Memory buffers
 | |
| 
 | |
| Hexagon NPU backend takes advantage of the Snapdragon's unified memory model where all buffers are fully accessible by the CPU and GPU.
 | |
| The NPU does have a dedicated tightly-coupled memory called VTCM but that memory is used only for intermediate data (e.g. dynamically
 | |
| quantized tensors) or temporary data (chunks of the weight tensors fetched via DMA).
 | |
| 
 | |
| Please note that currently the Hexagon backend does not implement SET/GET_ROWS Ops because there is no advantage in offloading those
 | |
| to the NPU at this point.
 | |
| 
 | |
| The backend does allocates non-host buffers for the tensors with datatypes that require repacking: Q4_0, Q8_0, MXFP4.
 | |
| From the MMU perspective these buffers are still regular buffers (normal access by the CPU) they are marked as non-host simply to force
 | |
| the repacking.
 | |
| 
 | |
| ## Large model handling
 | |
| 
 | |
| Hexagon NPU session (aka Process Domain (PD) in the Hexagon docs) is limited to a memory mapping of around 3.5GB.
 | |
| In llama.cpp/GGML the Hexagon session is mapped to a single GGML backend device (HTP0, HTP1, etc).
 | |
| 
 | |
| In order to map models larger than 3.5GB we need to allocate multiple devices and split the model.
 | |
| For this we're taking advantage of the llama.cpp/GGML multi-GPU layer-splitting support.
 | |
| Each Hexagon device behaves like a GPU from the offload and model splitting perspective.
 | |
| 
 | |
| Here is an example of running GPT-OSS-20B model on a newer Snapdragon device with 16GB of DDR.
 | |
| 
 | |
| ```
 | |
| M=gpt-oss-20b-Q4_0.gguf NDEV=4 D=HTP0,HTP1,HTP2,HTP3 P=surfing.txt scripts/snapdragon/adb/run-cli.sh -no-cnv -f surfing.txt -n 32
 | |
| ...
 | |
| LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib
 | |
| ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib
 | |
| GGML_HEXAGON_NDEV=4 ./bin/llama-cli --no-mmap -m /data/local/tmp/llama.cpp/../gguf/gpt-oss-20b-Q4_0.gguf
 | |
|       -t 4 --ctx-size 8192 --batch-size 128 -ctk q8_0 -ctv q8_0 -fa on -ngl 99 --device HTP0,HTP1,HTP2,HTP3 -no-cnv -f surfing.txt
 | |
| ...
 | |
| llama_model_loader: - type  f32:  289 tensors
 | |
| llama_model_loader: - type q4_0:   96 tensors
 | |
| llama_model_loader: - type q8_0:    2 tensors
 | |
| llama_model_loader: - type mxfp4:  72 tensors
 | |
| ...
 | |
| load_tensors: offloaded 25/25 layers to GPU
 | |
| load_tensors:          CPU model buffer size =  1182.09 MiB
 | |
| load_tensors:         HTP1 model buffer size =     6.64 MiB
 | |
| load_tensors:  HTP1-REPACK model buffer size =  2505.94 MiB
 | |
| load_tensors:         HTP3 model buffer size =     5.55 MiB
 | |
| load_tensors:  HTP3-REPACK model buffer size =  2088.28 MiB
 | |
| load_tensors:         HTP0 model buffer size =     7.75 MiB
 | |
| load_tensors:  HTP0-REPACK model buffer size =  2923.59 MiB
 | |
| load_tensors:         HTP2 model buffer size =     6.64 MiB
 | |
| load_tensors:  HTP2-REPACK model buffer size =  2505.94 MiB
 | |
| ...
 | |
| llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
 | |
| llama_context:        CPU  output buffer size =     0.77 MiB
 | |
| llama_kv_cache_iswa: creating non-SWA KV cache, size = 8192 cells
 | |
| llama_kv_cache:       HTP1 KV buffer size =    25.50 MiB
 | |
| llama_kv_cache:       HTP3 KV buffer size =    25.50 MiB
 | |
| llama_kv_cache:       HTP0 KV buffer size =    25.50 MiB
 | |
| llama_kv_cache:       HTP2 KV buffer size =    25.50 MiB
 | |
| llama_kv_cache: size =  102.00 MiB (  8192 cells,  12 layers,  1/1 seqs), K (q8_0):   51.00 MiB, V (q8_0):   51.00 MiB
 | |
| llama_kv_cache_iswa: creating     SWA KV cache, size = 256 cells
 | |
| llama_kv_cache:       HTP1 KV buffer size =     0.80 MiB
 | |
| llama_kv_cache:       HTP3 KV buffer size =     0.53 MiB
 | |
| llama_kv_cache:       HTP0 KV buffer size =     1.06 MiB
 | |
| llama_kv_cache:       HTP2 KV buffer size =     0.80 MiB
 | |
| llama_kv_cache: size =    3.19 MiB (   256 cells,  12 layers,  1/1 seqs), K (q8_0):    1.59 MiB, V (q8_0):    1.59 MiB
 | |
| llama_context:       HTP0 compute buffer size =    16.06 MiB
 | |
| llama_context:       HTP1 compute buffer size =    16.06 MiB
 | |
| llama_context:       HTP2 compute buffer size =    16.06 MiB
 | |
| llama_context:       HTP3 compute buffer size =    16.06 MiB
 | |
| llama_context:        CPU compute buffer size =    98.19 MiB
 | |
| ...
 | |
| llama_perf_context_print: prompt eval time =    3843.67 ms /   197 tokens ( 19.51 ms per token, 51.25 tokens per second)
 | |
| llama_perf_context_print:        eval time =    1686.13 ms /    31 runs   ( 54.39 ms per token, 18.39 tokens per second)
 | |
| llama_perf_context_print:       total time =    6266.30 ms /   228 tokens
 | |
| llama_perf_context_print:    graphs reused =         30
 | |
| llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
 | |
| llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
 | |
| llama_memory_breakdown_print: |   - HTP1 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
 | |
| llama_memory_breakdown_print: |   - HTP2 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
 | |
| llama_memory_breakdown_print: |   - HTP3 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
 | |
| llama_memory_breakdown_print: |   - Host               |                 1476 =  1208 +     105 +     162                |
 | |
| llama_memory_breakdown_print: |   - HTP1-REPACK        |                 2505 =  2505 +       0 +       0                |
 | |
| llama_memory_breakdown_print: |   - HTP3-REPACK        |                 2088 =  2088 +       0 +       0                |
 | |
| llama_memory_breakdown_print: |   - HTP0-REPACK        |                 2923 =  2923 +       0 +       0                |
 | |
| llama_memory_breakdown_print: |   - HTP2-REPACK        |                 2505 =  2505 +       0 +       0                |
 | |
| ```
 |