Johannes Gäßler
a28e0d5eb1
CUDA: app option to compile without FlashAttention ( #12025 )
2025-02-22 20:44:34 +01:00
Johannes Gäßler
5fa07c2f93
CUDA: optimize FA for GQA + large batches ( #12014 )
2025-02-22 12:20:17 +01:00
Gian-Carlo Pascutto
d70908421f
cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. ( #12000 )
2025-02-22 09:43:24 +01:00
PureJourney
ecc8e3aeff
CUDA: correct the lowest Maxwell supported by CUDA 12 ( #11984 )
...
* CUDA: correct the lowest Maxwell supported by CUDA 12
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-02-21 12:21:05 +01:00
Bodhi
0b3863ff95
MUSA: support ARM64 and enable dp4a .etc ( #11843 )
...
* MUSA: support ARM64 and enable __dp4a .etc
* fix cross entropy loss op for musa
* update
* add cc info log for musa
* add comment for the MUSA .cc calculation block
---------
Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com >
2025-02-21 09:46:23 +02:00
Johannes Gäßler
73e2ed3ce3
CUDA: use async data loading for FlashAttention ( #11894 )
...
* CUDA: use async data loading for FlashAttention
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-17 14:03:24 +01:00
Diego Devesa
94b87f87b5
cuda : add ampere to the list of default architectures ( #11870 )
2025-02-14 15:33:52 +01:00
R0CKSTAR
bd6e55bfd3
musa: bump MUSA SDK version to rc3.1.1 ( #11822 )
...
* musa: Update MUSA SDK version to rc3.1.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: Remove workaround in PR #10042
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-02-13 13:28:18 +01:00
uvos
5c4284d57b
HIP: Remove GCN from list of devices that avoid MMQ ( #11831 )
2025-02-12 22:25:28 +01:00
uvos
e598697d63
HIP: Switch to std::vector in rocblas version check ( #11820 )
2025-02-12 17:25:03 +01:00
Johannes Gäßler
c3d6af7cd2
CUDA: fix CUDART_VERSION checks ( #11821 )
2025-02-12 13:16:39 +01:00
Johannes Gäßler
b9ab0a4d0b
CUDA: use arch list for compatibility check ( #11775 )
...
* CUDA: use arch list for feature availability check
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-11 00:17:22 +01:00
Johannes Gäßler
d80be897ac
CUDA: fix min. version for movmatrix ( #11751 )
2025-02-08 10:46:07 +01:00
Johannes Gäßler
fa62da9b2d
CUDA: support for mat. mul. with ne03 != ne13 ( #11656 )
2025-02-05 08:58:31 +01:00
Johannes Gäßler
fd08255d0d
CUDA: non-contiguous (RMS) norm support ( #11659 )
...
* CUDA: non-contiguous (RMS) norm support
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-02-04 22:21:42 +01:00
Johannes Gäßler
21c84b5d2d
CUDA: fix Volta FlashAttention logic ( #11615 )
2025-02-03 14:25:56 +02:00
Johannes Gäßler
6eecde3cc8
HIP: fix flash_attn_stream_k_fixup warning ( #11604 )
2025-02-02 23:48:29 +01:00
uvos
396856b400
CUDA/HIP: add support for selectable warp size to mmv ( #11519 )
...
CUDA/HIP: add support for selectable warp size to mmv
2025-02-02 22:40:09 +01:00
uvos
4d0598e144
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other ( #11601 )
...
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
2025-02-02 22:08:05 +01:00
Johannes Gäßler
864a0b67a6
CUDA: use mma PTX instructions for FlashAttention ( #11583 )
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-02 19:31:09 +01:00
uvos
6af1ca48cb
HIP: Prepare reduction operators for wave 64
2025-01-30 16:25:44 +01:00
uvos
c300e68ef4
CUDA/HIP: add warp_size to cuda_device_info
2025-01-30 16:25:44 +01:00
uvos
be5ef7963f
HIP: Supress transformation warning in softmax.cu
...
loops with bounds not known at compile time can not be unrolled.
when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.
2025-01-28 23:06:32 +01:00
Nikita Sarychev
cae9fb4361
HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug ( #11080 )
...
This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.
2025-01-28 16:42:20 +01:00
Haus1
d6d24cd9ed
AMD: parse the architecture as supplied by gcnArchName ( #11244 )
...
The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.
2025-01-27 14:58:17 +01:00
uvos
26771a1491
Hip: disable VMM on hip as it seams that it dosent work in some configurations ( #11420 )
2025-01-25 21:01:12 +01:00
uvos
5f0db9522f
hip : Add hipGraph and VMM support to ROCM ( #11362 )
...
* Add hipGraph support
* Enable VMM on rocm
2025-01-25 00:02:23 +01:00
Johannes Gäßler
c5d9effb49
CUDA: fix FP16 cuBLAS GEMM ( #11396 )
2025-01-24 21:02:43 +01:00
uvos
9fbadaef4f
rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna ( #11356 )
2025-01-24 17:50:49 +01:00
Johannes Gäßler
8137b4bb2b
CPU/CUDA: fix (GQA) mul mat back, add CUDA support ( #11380 )
2025-01-24 12:38:31 +01:00
Johannes Gäßler
9c8dcefe17
CUDA: backwards pass for misc. ops, add tests ( #11257 )
...
* CUDA: backwards pass for misc. ops, add tests
* remove restrict from pointers
2025-01-16 16:43:38 +01:00
Johannes Gäßler
432df2d5f9
RoPE: fix back, CUDA support for back + noncont. ( #11240 )
...
* RoPE: fix back, CUDA support for back + noncont.
* fix comments reg. non-cont. RoPE support [no-ci]
2025-01-15 12:51:37 +01:00
Andreas Kieslinger
39509fb082
cuda : CUDA Graph Compute Function Refactor (precursor for performance improvements) ( #11042 )
...
* Refactor: Moves cuda graph executable update step to separate function.
* Refactor: Moves cuda graph update check to separate function.
* Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability.
* Fix: Adds missing reference to maintain_cuda_graph() definition.
* Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function.
* Refactor: Moves node graph checks and copy ops into individual function for improved readability.
* Refactor: Removes code permanently excluded from compilation to increase readability.
* Style: Adds missing newline
* Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one
* Refactor: Makes 'cuda_graph_update_required' a local variable
* remove double lines between functions
---------
Co-authored-by: slaren <slarengh@gmail.com >
2025-01-13 16:45:53 +01:00
Molly Sophia
ee7136c6d1
llama: add support for QRWKV6 model architecture ( #11001 )
...
llama: add support for QRWKV6 model architecture (#11001 )
* WIP: Add support for RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* RWKV: Some graph simplification
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Add support for RWKV6Qwen2 with cpu and cuda GLA
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Fix some typos
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* code format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Fix wkv test & add gla test
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Fix cuda warning
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Update README.md
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Update ggml/src/ggml-cuda/gla.cu
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Fix fused lerp weights loading with RWKV6
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* better sanity check skipping for QRWKV6 in llama-quant
thanks @compilade
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
Co-authored-by: compilade <git@compilade.net >
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: compilade <git@compilade.net >
2025-01-10 09:58:08 +08:00
hydai
8d59d91171
fix: add missing msg in static_assert ( #11143 )
...
Signed-off-by: hydai <z54981220@gmail.com >
2025-01-08 20:03:28 +00:00
Johannes Gäßler
46e3556e01
CUDA: add BF16 support ( #11093 )
...
* CUDA: add BF16 support
2025-01-06 02:33:52 +01:00
HimariO
ba1cb19cdd
llama : add Qwen2VL support + multimodal RoPE ( #10361 )
...
* Barebone Qwen2VL LLM convertor
* Add Qwen2VL cli entrypoint
* [WIP] add qwen2vl arch
* Verify m-rope output
* Add vl-rope/2d-rope support for qwen2vl ViT
* update qwen2vl cli tool
* update 5D tensor op workaround
* [WIP] qwen2vl vision model
* make batch and clip utils compatible with qwen2vl
* [WIP] create inference workflow, gguf convert script but fix
* correcting vision-rope behavior, add the missing last layer back to ViT
* add arg parser to qwen2vl_surgery
* replace variable size array with vector
* cuda-gdb cmake preset
* add fp32 mrope, vision rope kernel
* add fp16 support for qwen2vl and m-rope
* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`
* fix rope op mode switching, out dated func args
* update `llama_hparams`
* update to keep up stream changes
* resolve linter, test errors
* add makefile entry, update speical image padding token
* add mrope unit test, fix few compiler warnings
* rename `mrope` related function, params
* minor updates on debug util, bug fixs
* add `m-rope` testcase to `test-backend-ops`
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* fix traililng whitespce
* store `llama_hparams.rope_sections` with fixed size array
* update position id tensor size check in GGML_OP_ROPE
* minor updates
* update `ggml_backend_*_supports_op` of unsupported backends
* remote old `rope_section` compare operator
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-12-14 14:43:46 +02:00
a3sh
8faa1d4dd4
CUDA: faster non-contiguous concat ( #10760 )
...
* faster uncontiguous concat
* Use a lambda to avoid code duplication
Co-authored-by: Diego Devesa <slarengh@gmail.com >
* Update ggml/src/ggml-cuda/concat.cu
* add constexpr and static assert
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2024-12-12 19:09:50 +01:00
Andreas Kieslinger
750cb3e246
CUDA: rename macros to avoid conflicts with WinAPI ( #10736 )
...
* Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?)
* Reverts erroneous rename in SYCL-code.
* Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A.
* Renames the rest of the compute capability macros for consistency.
2024-12-10 18:23:24 +01:00
Johannes Gäßler
26a8406ba9
CUDA: fix shared memory access condition for mmv ( #10740 )
2024-12-09 20:07:12 +01:00
Djip007
19d8762ab6
ggml : refactor online repacking ( #10446 )
...
* rename ggml-cpu-aarch64.c to .cpp
* reformat extra cpu backend.
- clean Q4_0_N_M and IQ4_0_N_M
- remove from "file" tensor type
- allow only with dynamic repack
- extract cpu extra bufts and convert to C++
- hbm
- "aarch64"
- more generic use of extra buffer
- generalise extra_supports_op
- new API for "cpu-accel":
- amx
- aarch64
* clang-format
* Clean Q4_0_N_M ref
Enable restrict on C++
* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack
* added/corrected control on tensor size for Q4 repacking.
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* add debug logs on repacks.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-12-07 14:37:50 +02:00
mahorozte
e9e661bd59
CUDA: remove unnecessary warp reduce in FA (ggml/1032)
...
* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit
* same problem in vec32
---------
Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn >
2024-12-03 20:04:49 +02:00
uvos
3ad5451f3b
Add some minimal optimizations for CDNA ( #10498 )
...
* Add some minimal optimizations for CDNA
* ggml_cuda: set launch bounds also for GCN as it helps there too
2024-11-27 17:10:08 +01:00
Georgi Gerganov
ab96610b1e
cmake : enable warnings in llama ( #10474 )
...
* cmake : enable warnings in llama
ggml-ci
* cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS
* cmake : get_flags -> ggml_get_flags
* speculative-simple : fix warnings
* cmake : reuse ggml_get_flags
ggml-ci
* speculative-simple : fix compile warning
ggml-ci
2024-11-26 14:18:08 +02:00
Diego Devesa
5931c1f233
ggml : add support for dynamic loading of backends ( #10469 )
...
* ggml : add support for dynamic loading of backends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-11-25 15:13:39 +01:00
Diego Devesa
a5e47592b6
cuda : optimize argmax ( #10441 )
...
* cuda : optimize argmax
* remove unused parameter
ggml-ci
* fixup : use full warps
ggml-ci
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* fix ub
* ggml : check ne00 <= INT32_MAX in argmax and argsort
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2024-11-21 18:18:50 +01:00
Diego Devesa
3ee6382d48
cuda : fix CUDA_FLAGS not being applied ( #10403 )
2024-11-19 14:29:38 +01:00
Diego Devesa
d3481e6316
cuda : only use native when supported by cmake ( #10389 )
2024-11-18 18:43:40 +01:00
Johannes Gäßler
76e9e58b78
CUDA: fix MMV kernel being used for FP16 src1 ( #10357 )
2024-11-17 23:20:42 +01:00
Johannes Gäßler
ce2e59ba10
CMake: fix typo in comment [no ci] ( #10360 )
2024-11-17 12:59:38 +01:00