deepsek
66906cd82a
HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 ( #14624 )
...
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
2025-07-27 00:28:14 +02:00
Aman Gupta
55c2646b45
CUDA: add dynamic shared mem to softmax, refactor general usage ( #14497 )
2025-07-03 07:45:11 +08:00
R0CKSTAR
1f73301b63
cuda : remove nrows_x in mul_mat_q_process_tile ( #13325 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-05-07 09:48:23 +02:00
Johannes Gäßler
15a28ec8c7
CUDA: fix --split-mode row for MMQ ( #13323 )
2025-05-06 08:36:46 +02:00
Johannes Gäßler
93c4e23905
CUDA: fix race condition in MMQ stream-k fixup ( #13299 )
2025-05-04 14:16:39 +02:00
Johannes Gäßler
8afbd96818
CUDA: fix race condition in MMQ ids_dst ( #13294 )
2025-05-04 13:58:38 +02:00
Johannes Gäßler
e1e8e0991f
CUDA: batched+noncont MMQ, refactor bs>1 MoE code ( #13199 )
2025-04-30 23:12:59 +02:00
Johannes Gäßler
b10d8bfdb1
CUDA: use switch statements in constexpr functions ( #13095 )
2025-04-24 15:57:10 +02:00
R0CKSTAR
492d7f1ff7
musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc ( #12611 )
...
* musa: fix all warnings
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: update ci doc (install ccache)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* fix Windows build issue
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-03-30 10:59:38 +02:00
Slobodan Josic
bd40678df7
HIP: Add support for RDNA4 targets ( #12372 )
2025-03-26 23:46:30 +01:00
R0CKSTAR
7ea75035b6
CUDA: Fix clang warnings ( #12540 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-03-24 11:28:34 +01:00
R0CKSTAR
fac63a3d78
musa: refine compute capability ( #12493 )
...
* musa: refine compute capability
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-03-22 10:11:37 +01:00
Johannes Gäßler
9c42b1718c
CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ ( #12098 )
2025-02-28 09:26:43 +01:00
Johannes Gäßler
73e2ed3ce3
CUDA: use async data loading for FlashAttention ( #11894 )
...
* CUDA: use async data loading for FlashAttention
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-17 14:03:24 +01:00
Johannes Gäßler
b9ab0a4d0b
CUDA: use arch list for compatibility check ( #11775 )
...
* CUDA: use arch list for feature availability check
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-11 00:17:22 +01:00
uvos
4d0598e144
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other ( #11601 )
...
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
2025-02-02 22:08:05 +01:00
Johannes Gäßler
864a0b67a6
CUDA: use mma PTX instructions for FlashAttention ( #11583 )
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-02 19:31:09 +01:00
Andreas Kieslinger
750cb3e246
CUDA: rename macros to avoid conflicts with WinAPI ( #10736 )
...
* Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?)
* Reverts erroneous rename in SYCL-code.
* Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A.
* Renames the rest of the compute capability macros for consistency.
2024-12-10 18:23:24 +01:00
uvos
3ad5451f3b
Add some minimal optimizations for CDNA ( #10498 )
...
* Add some minimal optimizations for CDNA
* ggml_cuda: set launch bounds also for GCN as it helps there too
2024-11-27 17:10:08 +01:00
Diego Devesa
ae8de6d50a
ggml : build backends as libraries ( #10256 )
...
* ggml : build backends as libraries
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com >
2024-11-14 18:04:35 +01:00
Johannes Gäßler
5af118efda
CUDA: fix --split-mode row race condition ( #9413 )
2024-09-11 10:22:40 +02:00
slaren
2b1f616b20
ggml : reduce hash table reset cost ( #8698 )
...
* ggml : reduce hash table reset cost
* fix unreachable code warnings after GGML_ASSERT(false)
* GGML_ASSERT(false) -> GGML_ABORT("fatal error")
* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
Johannes Gäßler
69c487f4ed
CUDA: MMQ code deduplication + iquant support ( #8495 )
...
* CUDA: MMQ code deduplication + iquant support
* 1 less parallel job for CI build
2024-07-20 22:25:26 +02:00
Johannes Gäßler
808aba3916
CUDA: optimize and refactor MMQ ( #8416 )
...
* CUDA: optimize and refactor MMQ
* explicit q8_1 memory layouts, add documentation
2024-07-11 16:47:47 +02:00
Johannes Gäßler
8e558309dc
CUDA: MMQ support for iq4_nl, iq4_xs ( #8278 )
2024-07-05 09:06:31 +02:00
Daniele
0a423800ff
CUDA: revert part of the RDNA1 optimizations ( #8309 )
...
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
2024-07-05 09:06:09 +02:00
Johannes Gäßler
bcefa03bc0
CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 ( #8311 )
2024-07-05 09:05:34 +02:00
Daniele
d23287f122
Define and optimize RDNA1 ( #8085 )
2024-07-04 01:02:58 +02:00
Johannes Gäßler
85a267daaa
CUDA: fix MMQ stream-k for --split-mode row ( #8167 )
2024-06-27 16:26:05 +02:00
Georgi Gerganov
f3f65429c4
llama : reorganize source code + improve CMake ( #8006 )
...
* scripts : update sync [no ci]
* files : relocate [no ci]
* ci : disable kompute build [no ci]
* cmake : fixes [no ci]
* server : fix mingw build
ggml-ci
* cmake : minor [no ci]
* cmake : link math library [no ci]
* cmake : build normal ggml library (not object library) [no ci]
* cmake : fix kompute build
ggml-ci
* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE
ggml-ci
* move public backend headers to the public include directory (#8122 )
* move public backend headers to the public include directory
* nix test
* spm : fix metal header
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* scripts : fix sync paths [no ci]
* scripts : sync ggml-blas.h [no ci]
---------
Co-authored-by: slaren <slarengh@gmail.com >
2024-06-26 18:33:02 +03:00