llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-01 09:01:57 +00:00

Files

Gabe Goodhart 8d5a25d356 perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state

This is a first attempt at optimizing the metal kernel. The changes here
are:

- Launch the kernel with a thread group of size d_state
- Use simd groups and shared memory to do the summation for the y
  computation

When tested with G4 tiny preview, this shows roughly a 3x speedup on
prefill and 15% speedup on decode.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

2025-07-17 11:19:30 -06:00

cmake

ggml-cpu : rework weak alias on apple targets (#14146 )

2025-06-16 13:54:15 +08:00

include

ggml: Add initial WebGPU backend (#14521 )

2025-07-16 18:18:51 +03:00

src

perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state

2025-07-17 11:19:30 -06:00

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

ggml: Add initial WebGPU backend (#14521 )

2025-07-16 18:18:51 +03:00