Johannes Gäßler
710dfc465a
CUDA: fix half2 -> half conversion for HIP ( #15529 )
2025-08-23 21:37:06 +02:00
R0CKSTAR
8ad038c0fd
musa: add GGML_UNUSED_VARS ( #15446 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-08-21 11:06:05 +08:00
R0CKSTAR
a094f38143
musa: fix build warnings ( #15258 )
...
* musa: fix build warnings
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-08-20 10:17:37 +08:00
Aman Gupta
34c9d765bf
CUDA: add attention sinks for tile and wmma ( #15178 )
...
* CUDA: add attention sinks for tile and wmma
* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
2025-08-09 20:00:24 +08:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: slaren <slarengh@gmail.com >
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com >
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: slaren <slarengh@gmail.com >
2025-08-05 22:10:36 +03:00
Johannes Gäßler
92b8810ec7
CUDA: skip masked KV slices for all FA kernels ( #14924 )
2025-07-30 15:46:13 +02:00
uvos
aa79524c51
HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets ( #14945 )
2025-07-29 20:23:04 +02:00
Johannes Gäßler
a86f52b285
CUDA: fix overflow in FA, tune performance ( #14840 )
2025-07-23 21:43:25 +02:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode ( #14363 )
...
* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-07-16 16:35:42 +03:00
Johannes Gäßler
12a81af45f
CUDA: broadcasting for FlashAttention mask ( #14500 )
2025-07-02 15:48:33 +03:00
Johannes Gäßler
0cf6725e9f
CUDA: FA support for Deepseek (Ampere or newer) ( #13306 )
...
* CUDA: FA support for Deepseek (Ampere or newer)
* do loop unrolling via C++ template
2025-05-09 13:34:58 +02:00
R0CKSTAR
492d7f1ff7
musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc ( #12611 )
...
* musa: fix all warnings
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: update ci doc (install ccache)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* fix Windows build issue
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-03-30 10:59:38 +02:00
Gaurav Garg
517b5ddbf0
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case ( #12183 )
...
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Fixes Issue: #12182
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-03-19 20:52:06 +01:00
Johannes Gäßler
a28e0d5eb1
CUDA: app option to compile without FlashAttention ( #12025 )
2025-02-22 20:44:34 +01:00
Johannes Gäßler
5fa07c2f93
CUDA: optimize FA for GQA + large batches ( #12014 )
2025-02-22 12:20:17 +01:00
Johannes Gäßler
864a0b67a6
CUDA: use mma PTX instructions for FlashAttention ( #11583 )
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-02 19:31:09 +01:00
Diego Devesa
ae8de6d50a
ggml : build backends as libraries ( #10256 )
...
* ggml : build backends as libraries
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com >
2024-11-14 18:04:35 +01:00
Johannes Gäßler
fabdc3bda3
ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)
2024-10-03 21:17:26 +03:00
Johannes Gäßler
e11bd856d5
CPU/CUDA: Gemma 2 FlashAttention support ( #8542 )
...
* CPU/CUDA: Gemma 2 FlashAttention support
* apply logit_softcap to scale in kernel
* disable logit softcapping tests on Metal
* remove metal check
2024-08-24 21:34:59 +02:00
slaren
2b1f616b20
ggml : reduce hash table reset cost ( #8698 )
...
* ggml : reduce hash table reset cost
* fix unreachable code warnings after GGML_ASSERT(false)
* GGML_ASSERT(false) -> GGML_ABORT("fatal error")
* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
Georgi Gerganov
f3f65429c4
llama : reorganize source code + improve CMake ( #8006 )
...
* scripts : update sync [no ci]
* files : relocate [no ci]
* ci : disable kompute build [no ci]
* cmake : fixes [no ci]
* server : fix mingw build
ggml-ci
* cmake : minor [no ci]
* cmake : link math library [no ci]
* cmake : build normal ggml library (not object library) [no ci]
* cmake : fix kompute build
ggml-ci
* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE
ggml-ci
* move public backend headers to the public include directory (#8122 )
* move public backend headers to the public include directory
* nix test
* spm : fix metal header
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* scripts : fix sync paths [no ci]
* scripts : sync ggml-blas.h [no ci]
---------
Co-authored-by: slaren <slarengh@gmail.com >
2024-06-26 18:33:02 +03:00