From cf8cdcd3722ab3c31abecade4e99cac56427c96b Mon Sep 17 00:00:00 2001 From: Aaron Teo Date: Thu, 31 Jul 2025 01:26:30 +0800 Subject: [PATCH] ggml-zdnn: update documentation, prepare for upstream Signed-off-by: Aaron Teo --- .../ISSUE_TEMPLATE/010-bug-compilation.yml | 2 +- .github/ISSUE_TEMPLATE/011-bug-results.yml | 2 +- .github/labeler.yml | 5 ++ docs/build-s390x.md | 80 +++++++++++++++---- 4 files changed, 71 insertions(+), 18 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/010-bug-compilation.yml b/.github/ISSUE_TEMPLATE/010-bug-compilation.yml index 95a0b5cc75..feb0d51205 100644 --- a/.github/ISSUE_TEMPLATE/010-bug-compilation.yml +++ b/.github/ISSUE_TEMPLATE/010-bug-compilation.yml @@ -40,7 +40,7 @@ body: attributes: label: GGML backends description: Which GGML backends do you know to be affected? - options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL] + options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL, zDNN] multiple: true validations: required: true diff --git a/.github/ISSUE_TEMPLATE/011-bug-results.yml b/.github/ISSUE_TEMPLATE/011-bug-results.yml index d1034bbb69..c42a14ff83 100644 --- a/.github/ISSUE_TEMPLATE/011-bug-results.yml +++ b/.github/ISSUE_TEMPLATE/011-bug-results.yml @@ -42,7 +42,7 @@ body: attributes: label: GGML backends description: Which GGML backends do you know to be affected? - options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL] + options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL, zDNN] multiple: true validations: required: true diff --git a/.github/labeler.yml b/.github/labeler.yml index df6a7a40ed..c4da4ab4e1 100644 --- a/.github/labeler.yml +++ b/.github/labeler.yml @@ -22,6 +22,11 @@ Vulkan: - any-glob-to-any-file: - ggml/include/ggml-vulkan.h - ggml/src/ggml-vulkan/** +IBM zDNN: + - changed-files: + - any-glob-to-any-file: + - ggml/include/ggml-zdnn.h + - ggml/src/ggml-zdnn/** documentation: - changed-files: - any-glob-to-any-file: diff --git a/docs/build-s390x.md b/docs/build-s390x.md index 4c9ebb271c..18edb57792 100644 --- a/docs/build-s390x.md +++ b/docs/build-s390x.md @@ -42,14 +42,14 @@ cmake --build build --config Release -j $(nproc) cmake --build build --config Release -j $(nproc) ``` -- By default, NNPA is enabled when available. To disable it (not recommended): +- By default, NNPA is disabled by default. To enable it: ```bash cmake -S . -B build \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_BLAS=ON \ -DGGML_BLAS_VENDOR=OpenBLAS \ - -DGGML_NNPA=OFF + -DGGML_NNPA=ON cmake --build build --config Release -j $(nproc) ``` @@ -76,6 +76,23 @@ cmake --build build --config Release -j $(nproc) cmake --build build --config Release -j $(nproc) ``` +## IBM zDNN Accelerator + +This provides acceleration using the IBM zAIU co-processor located in the Telum I and Telum II processors. Make sure to have the [IBM zDNN library](https://github.com/IBM/zDNN) installed. + +#### Compile from source from IBM + +You may find the official build instructions here: [Building and Installing zDNN](https://github.com/IBM/zDNN?tab=readme-ov-file#building-and-installing-zdnn) + +### Compilation + +```bash +cmake -S . -B build \ + -DCMAKE_BUILD_TYPE=Release \ + -DGGML_ZDNN=ON +cmake --build build --config Release -j$(nproc) +``` + ## Getting GGUF Models All models need to be converted to Big-Endian. You can achieve this in three cases: @@ -84,9 +101,9 @@ All models need to be converted to Big-Endian. You can achieve this in three cas ![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff) - You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08). + You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e). - These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system. + These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system. 2. **Convert safetensors model to GGUF Big-Endian directly (recommended)** @@ -94,6 +111,14 @@ All models need to be converted to Big-Endian. You can achieve this in three cas The model you are trying to convert must be in `safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case. + Ensure that you have installed the required packages in advance + + ```bash + pip3 install -r requirements.txt + ``` + + Convert the `safetensors` model to `GGUF` + ```bash python3 convert_hf_to_gguf.py \ --outfile model-name-be.f16.gguf \ @@ -116,7 +141,7 @@ All models need to be converted to Big-Endian. You can achieve this in three cas ![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff) - The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case. + The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B GGUF](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case. ```bash python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG @@ -137,19 +162,19 @@ All models need to be converted to Big-Endian. You can achieve this in three cas ### 1. SIMD Acceleration -Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation. +Only available in IBM z15/LinuxONE 3 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation. ### 2. NNPA Vector Intrinsics Acceleration -Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation. +Only available in IBM z16/LinuxONE 4 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation. -### 3. zDNN Accelerator +### 3. zDNN Accelerator (WIP) -_Only available in IBM z16 or later system. No direction at the moment._ +Only available in IBM z17/LinuxONE 5 or later system with the `-DGGML_ZDNN=ON` compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines. ### 4. Spyre Accelerator -_No direction at the moment._ +_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._ ## Performance Tuning @@ -189,6 +214,26 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue. +4. Failing to install the `sentencepiece` package using GCC 15+ + + Answer: The `sentencepiece` team are aware of this as seen in [this issue](https://github.com/google/sentencepiece/issues/1108). + + As a temporary workaround, please run the installation command with the following environment variables. + + ```bash + export CXXFLAGS="-include cstdint" + ``` + + For example, + + ```bash + CXXFLAGS="-include cstdint" pip3 install -r requirements.txt + ``` + +5. `-DGGML_NNPA=ON` generates gibberish output + + Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`. + ## Getting Help on IBM Z & LinuxONE 1. **Bugs, Feature Requests** @@ -201,11 +246,12 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl ## Appendix A: Hardware Support Matrix -| | Support | Minimum Compiler Version | -| ------- | ------- | ------------------------ | -| IBM z15 | ✅ | | -| IBM z16 | ✅ | | -| IBM z17 | ✅ | GCC 15.1.0 | +| | Support | Minimum Compiler Version | +| -------- | ------- | ------------------------ | +| IBM z15 | ✅ | | +| IBM z16 | ✅ | | +| IBM z17 | ✅ | GCC 15.1.0 | +| IBM zAIU | ✅ | | - ✅ - supported and verified to run as intended - 🚫 - unsupported, we are unlikely able to provide support @@ -214,7 +260,7 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl | | VX/VXE/VXE2 | NNPA | zDNN | Spyre | | ---------- | ----------- | ---- | ---- | ----- | -| FP32 | ✅ | ✅ | ❓ | ❓ | +| FP32 | ✅ | ✅ | ✅ | ❓ | | FP16 | ✅ | ✅ | ❓ | ❓ | | BF16 | 🚫 | 🚫 | ❓ | ❓ | | Q4_0 | ✅ | ✅ | ❓ | ❓ | @@ -244,3 +290,5 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl - ✅ - acceleration available - 🚫 - acceleration unavailable, will still run using scalar implementation - ❓ - acceleration unknown, please contribute if you can test it yourself + +Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on July 31, 2025.