* WIP
* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth
* added BF16 support
* more strict check to make sure src0 is a transpose
* reformulated to handle more complicated transpose cases
* bring back 2D transpose for higher performance
* allow build on windows
* tranpose copy more shapes
* minor tweak
* final clean up
* restore some test cases
* keep only the kernel for true tranposed case; updated with review suggestions
* make CI happy
* remove headers not needed
* reduced bank conflicts for fp16 and bf16
* add missing const*
* now bank conflicts free
* use padding instead of swizzling
---------
Co-authored-by: bssrdf <bssrdf@gmail.com>
* CUDA: Remove unneded bias/gate dims in fused mmvq
Pointed out
[here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989)
that only a single value is needed per target col per thread
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Fix "Error 991-D: extra braces are nonstandard" during compilation
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: Volta tensor core support for MMF
* more generic checks for hardware support
* Update ggml/src/ggml-cuda/mmf.cuh
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
* CUDA: Fix bug in topk-moe for gpt-oss
When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph),
but it actually doesn't fuse in the actual gpt-oss
* fix for qwen3 too
* change ifndef to ifdef
* ggml : fix interpolate with align-corners and ne=1
* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken
* fix clang warning
* ggml: add ggml_can_fuse_subgraph
* ggml-cuda: use ggml_can_fuse_subgraph for topk-moe
* format
* 1. remove inputs from signature as they are transient nodes
2. add check for views: view_src should be part of the subgraph
* - combine check into one loop
- check all view_src parents
- other minor review comments
* remove redudant if test
* - rename and other minor review comments
* add assert about count < 32
* optimise GGML_OP_SUM
* add non-contiguous tests by permuting the input
* change tests to require full contiguity of OP_SUM
* cuda : add check GGML_OP_SUM
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* CUDA set scheduling strategy to spinning for cc121
* Using prop.major and prop.minor, include HIP and MUSA
* Exclude HIP and MUSA
* Remove trailing whitespace
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Remove empty line
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* metal : pad K, V and Mask when needed
* cont : simplify
* cuda : add TODO about KV padding requirement
* metal : add comments
* metal : remove mask padding requirement
* First attempt
* No permute during convert (fixes qk tensors), proper norm application.
* RoPE = NeoX
* Coherence!
* Migrate xielu params from tensors to hyperparameters
* Simple CUDA kernel
* Revert stupid LLM refactorings
* Chat template support
* configchecker / flake8 errors
* Reorder unary.cu
* I do conclude that LLMs are, in fact, stupid.
* Fix after merge
* Final newline
* Make xIELU an UNARY_OP
* Final newline
* Correctly account for parameter shift
* Argh.
* Update ggml/src/ggml-cpu/unary-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Refactor: remove unused methods, inline and factorize softplus, add const modifiers
* Revert CUDA changes, implement xIELU as a separate OP
* Pesky newline
* Add float2half / half2float for F16 inputs/outputs
* CUDA variants, attempt 2
* Actually, attempt 3
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Missing convert header
* Proper formula and reference for xIELU in the comments.
* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add tensor mappings for Apertus to global list instead
* Fix lazy on scalars
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add comment about the constraints on positive/negative alpha
* Change `softplus` to `ggml_softplus`
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0
rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA
* CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn
* CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32
This commit adds mul_mat_id support for ncols_dst >= 16. It does this by
packing ncols_dst tiles into the blockDim.y.
My tests on a RTX 3090 show that this is faster than the cuBLAS fallback
for f16 till bs=64, and for f32 till bs=32
* Review: refactor if statement