This is useful for testing for regressions on GCN with CDNA hardware.
With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
* 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted code indentation
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Fixed incorrect setting of variable types
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted the judgment logic
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Add a defensive security assert
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted the support judgment logic.
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* revoke the suggest commit changes due to it's not applicable in jetson_device
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Add parentheses to enforce operator precedence
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Fix ci bug: add a spaces
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: yangxiao <yang_xl@tju.edu.cn>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: yangxiao <yangxl_zz@qq.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* ggml : remove MSVC warnings pragmas
This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.
* whisper : remove MSVC warning pragmas
This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers
Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.
Fixes#12152
* Addressed comments
* fix HIP builds
* properly sync to stream
* removed ggml_cuda_cpy_fn_ptrs
* move stream sync before free
* guard to only use indirection with graphs
* style fixes
* check for errors
---------
Co-authored-by: slaren <slarengh@gmail.com>
* Enable CUDA Graph on CTK < 12.x
`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.
* Fix compilation errors with MUSA
* Disable CUDA Graph for MUSA
refactor mmqv to unify the calculation of nwarps and rows per block between host and device code.
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check
Adds rocWMMA support to fattn-wmma-f16
---
Signed-off-by: Carl Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Ben Jackson <ben@ben.com>
* MUSA: support ARM64 and enable __dp4a .etc
* fix cross entropy loss op for musa
* update
* add cc info log for musa
* add comment for the MUSA .cc calculation block
---------
Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?)
* Reverts erroneous rename in SYCL-code.
* Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A.
* Renames the rest of the compute capability macros for consistency.