llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-18 11:46:58 +00:00

Files

Jeff Bolz 38eaf32af1 vulkan: change graph_compute to be async and enable get_tensor_async (#17158 )

* vulkan: change graph_compute to be async and enable get_tensor_async

This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.

Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.

The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.

The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.

* fix thread safety errors

* teardown context cleanly

* Handle async read to non-pinned dst

2025-11-15 09:06:41 +01:00

cmake

ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )

2025-08-07 13:45:41 +02:00

include

ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (#17063 )

2025-11-13 20:54:47 +02:00

src

vulkan: change graph_compute to be async and enable get_tensor_async (#17158 )

2025-11-15 09:06:41 +01:00

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

ggml: disable vxe for cross-compilation by default (#16966 )

2025-11-08 16:00:20 +08:00