mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2025-11-07 09:57:00 +00:00
ggml-zdnn: update documentation, prepare for upstream
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
This commit is contained in:
@@ -40,7 +40,7 @@ body:
|
|||||||
attributes:
|
attributes:
|
||||||
label: GGML backends
|
label: GGML backends
|
||||||
description: Which GGML backends do you know to be affected?
|
description: Which GGML backends do you know to be affected?
|
||||||
options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL]
|
options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL, zDNN]
|
||||||
multiple: true
|
multiple: true
|
||||||
validations:
|
validations:
|
||||||
required: true
|
required: true
|
||||||
|
|||||||
2
.github/ISSUE_TEMPLATE/011-bug-results.yml
vendored
2
.github/ISSUE_TEMPLATE/011-bug-results.yml
vendored
@@ -42,7 +42,7 @@ body:
|
|||||||
attributes:
|
attributes:
|
||||||
label: GGML backends
|
label: GGML backends
|
||||||
description: Which GGML backends do you know to be affected?
|
description: Which GGML backends do you know to be affected?
|
||||||
options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL]
|
options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL, zDNN]
|
||||||
multiple: true
|
multiple: true
|
||||||
validations:
|
validations:
|
||||||
required: true
|
required: true
|
||||||
|
|||||||
5
.github/labeler.yml
vendored
5
.github/labeler.yml
vendored
@@ -22,6 +22,11 @@ Vulkan:
|
|||||||
- any-glob-to-any-file:
|
- any-glob-to-any-file:
|
||||||
- ggml/include/ggml-vulkan.h
|
- ggml/include/ggml-vulkan.h
|
||||||
- ggml/src/ggml-vulkan/**
|
- ggml/src/ggml-vulkan/**
|
||||||
|
IBM zDNN:
|
||||||
|
- changed-files:
|
||||||
|
- any-glob-to-any-file:
|
||||||
|
- ggml/include/ggml-zdnn.h
|
||||||
|
- ggml/src/ggml-zdnn/**
|
||||||
documentation:
|
documentation:
|
||||||
- changed-files:
|
- changed-files:
|
||||||
- any-glob-to-any-file:
|
- any-glob-to-any-file:
|
||||||
|
|||||||
@@ -42,14 +42,14 @@ cmake --build build --config Release -j $(nproc)
|
|||||||
cmake --build build --config Release -j $(nproc)
|
cmake --build build --config Release -j $(nproc)
|
||||||
```
|
```
|
||||||
|
|
||||||
- By default, NNPA is enabled when available. To disable it (not recommended):
|
- By default, NNPA is disabled by default. To enable it:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cmake -S . -B build \
|
cmake -S . -B build \
|
||||||
-DCMAKE_BUILD_TYPE=Release \
|
-DCMAKE_BUILD_TYPE=Release \
|
||||||
-DGGML_BLAS=ON \
|
-DGGML_BLAS=ON \
|
||||||
-DGGML_BLAS_VENDOR=OpenBLAS \
|
-DGGML_BLAS_VENDOR=OpenBLAS \
|
||||||
-DGGML_NNPA=OFF
|
-DGGML_NNPA=ON
|
||||||
|
|
||||||
cmake --build build --config Release -j $(nproc)
|
cmake --build build --config Release -j $(nproc)
|
||||||
```
|
```
|
||||||
@@ -76,6 +76,23 @@ cmake --build build --config Release -j $(nproc)
|
|||||||
cmake --build build --config Release -j $(nproc)
|
cmake --build build --config Release -j $(nproc)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## IBM zDNN Accelerator
|
||||||
|
|
||||||
|
This provides acceleration using the IBM zAIU co-processor located in the Telum I and Telum II processors. Make sure to have the [IBM zDNN library](https://github.com/IBM/zDNN) installed.
|
||||||
|
|
||||||
|
#### Compile from source from IBM
|
||||||
|
|
||||||
|
You may find the official build instructions here: [Building and Installing zDNN](https://github.com/IBM/zDNN?tab=readme-ov-file#building-and-installing-zdnn)
|
||||||
|
|
||||||
|
### Compilation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cmake -S . -B build \
|
||||||
|
-DCMAKE_BUILD_TYPE=Release \
|
||||||
|
-DGGML_ZDNN=ON
|
||||||
|
cmake --build build --config Release -j$(nproc)
|
||||||
|
```
|
||||||
|
|
||||||
## Getting GGUF Models
|
## Getting GGUF Models
|
||||||
|
|
||||||
All models need to be converted to Big-Endian. You can achieve this in three cases:
|
All models need to be converted to Big-Endian. You can achieve this in three cases:
|
||||||
@@ -84,9 +101,9 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
|
|||||||
|
|
||||||

|

|
||||||
|
|
||||||
You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
|
You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e).
|
||||||
|
|
||||||
These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system.
|
These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.
|
||||||
|
|
||||||
2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
|
2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
|
||||||
|
|
||||||
@@ -94,6 +111,14 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
|
|||||||
|
|
||||||
The model you are trying to convert must be in `safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
|
The model you are trying to convert must be in `safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
|
||||||
|
|
||||||
|
Ensure that you have installed the required packages in advance
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip3 install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
Convert the `safetensors` model to `GGUF`
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 convert_hf_to_gguf.py \
|
python3 convert_hf_to_gguf.py \
|
||||||
--outfile model-name-be.f16.gguf \
|
--outfile model-name-be.f16.gguf \
|
||||||
@@ -116,7 +141,7 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
|
|||||||
|
|
||||||

|

|
||||||
|
|
||||||
The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
|
The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B GGUF](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
|
python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
|
||||||
@@ -137,19 +162,19 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
|
|||||||
|
|
||||||
### 1. SIMD Acceleration
|
### 1. SIMD Acceleration
|
||||||
|
|
||||||
Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
|
Only available in IBM z15/LinuxONE 3 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
|
||||||
|
|
||||||
### 2. NNPA Vector Intrinsics Acceleration
|
### 2. NNPA Vector Intrinsics Acceleration
|
||||||
|
|
||||||
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
|
Only available in IBM z16/LinuxONE 4 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
|
||||||
|
|
||||||
### 3. zDNN Accelerator
|
### 3. zDNN Accelerator (WIP)
|
||||||
|
|
||||||
_Only available in IBM z16 or later system. No direction at the moment._
|
Only available in IBM z17/LinuxONE 5 or later system with the `-DGGML_ZDNN=ON` compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines.
|
||||||
|
|
||||||
### 4. Spyre Accelerator
|
### 4. Spyre Accelerator
|
||||||
|
|
||||||
_No direction at the moment._
|
_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._
|
||||||
|
|
||||||
## Performance Tuning
|
## Performance Tuning
|
||||||
|
|
||||||
@@ -189,6 +214,26 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
|
|||||||
|
|
||||||
Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
|
Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
|
||||||
|
|
||||||
|
4. Failing to install the `sentencepiece` package using GCC 15+
|
||||||
|
|
||||||
|
Answer: The `sentencepiece` team are aware of this as seen in [this issue](https://github.com/google/sentencepiece/issues/1108).
|
||||||
|
|
||||||
|
As a temporary workaround, please run the installation command with the following environment variables.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export CXXFLAGS="-include cstdint"
|
||||||
|
```
|
||||||
|
|
||||||
|
For example,
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CXXFLAGS="-include cstdint" pip3 install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
5. `-DGGML_NNPA=ON` generates gibberish output
|
||||||
|
|
||||||
|
Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`.
|
||||||
|
|
||||||
## Getting Help on IBM Z & LinuxONE
|
## Getting Help on IBM Z & LinuxONE
|
||||||
|
|
||||||
1. **Bugs, Feature Requests**
|
1. **Bugs, Feature Requests**
|
||||||
@@ -202,10 +247,11 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
|
|||||||
## Appendix A: Hardware Support Matrix
|
## Appendix A: Hardware Support Matrix
|
||||||
|
|
||||||
| | Support | Minimum Compiler Version |
|
| | Support | Minimum Compiler Version |
|
||||||
| ------- | ------- | ------------------------ |
|
| -------- | ------- | ------------------------ |
|
||||||
| IBM z15 | ✅ | |
|
| IBM z15 | ✅ | |
|
||||||
| IBM z16 | ✅ | |
|
| IBM z16 | ✅ | |
|
||||||
| IBM z17 | ✅ | GCC 15.1.0 |
|
| IBM z17 | ✅ | GCC 15.1.0 |
|
||||||
|
| IBM zAIU | ✅ | |
|
||||||
|
|
||||||
- ✅ - supported and verified to run as intended
|
- ✅ - supported and verified to run as intended
|
||||||
- 🚫 - unsupported, we are unlikely able to provide support
|
- 🚫 - unsupported, we are unlikely able to provide support
|
||||||
@@ -214,7 +260,7 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
|
|||||||
|
|
||||||
| | VX/VXE/VXE2 | NNPA | zDNN | Spyre |
|
| | VX/VXE/VXE2 | NNPA | zDNN | Spyre |
|
||||||
| ---------- | ----------- | ---- | ---- | ----- |
|
| ---------- | ----------- | ---- | ---- | ----- |
|
||||||
| FP32 | ✅ | ✅ | ❓ | ❓ |
|
| FP32 | ✅ | ✅ | ✅ | ❓ |
|
||||||
| FP16 | ✅ | ✅ | ❓ | ❓ |
|
| FP16 | ✅ | ✅ | ❓ | ❓ |
|
||||||
| BF16 | 🚫 | 🚫 | ❓ | ❓ |
|
| BF16 | 🚫 | 🚫 | ❓ | ❓ |
|
||||||
| Q4_0 | ✅ | ✅ | ❓ | ❓ |
|
| Q4_0 | ✅ | ✅ | ❓ | ❓ |
|
||||||
@@ -244,3 +290,5 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
|
|||||||
- ✅ - acceleration available
|
- ✅ - acceleration available
|
||||||
- 🚫 - acceleration unavailable, will still run using scalar implementation
|
- 🚫 - acceleration unavailable, will still run using scalar implementation
|
||||||
- ❓ - acceleration unknown, please contribute if you can test it yourself
|
- ❓ - acceleration unknown, please contribute if you can test it yourself
|
||||||
|
|
||||||
|
Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on July 31, 2025.
|
||||||
|
|||||||
Reference in New Issue
Block a user