ggml : split graph allocations according to backend max buffer size (#15815)

* ggml : make gallocr respect the backend's max buffer size

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max  allocation size in buffer type  interface

* fix missing newline, apple-clang warning

* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.

* track (chunk, offset) pairs instead of "global" offsets through gallocr.

* simpler, don't need loops to map between local/global offsets
* touches more code

* fix dyn_tallocr_max_size and initialization

* fix memory leak when buffers are reused due to same buffer type appearing multiple times

* make vbuffer allocation follow the same logic as backend_buffer did before

* continue to use leftover unallocated space of previous chunks after a new one has been created

* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size

* refactor: move adding new free block and new chunk into separate functions

* allocate chunks individually with a separate free-blocks list for each one

* needs a bit more memory/allocations/indirections, but code is simpler

* fix warnings (missing static) & debug checks
This commit is contained in:
Acly
2025-09-24 16:17:49 +02:00
committed by GitHub
parent 3a59971967
commit f2a789e334
4 changed files with 858 additions and 141 deletions

View File

@@ -342,6 +342,10 @@ struct ggml_cgraph {
// if you need the gradients, get them from the original graph
struct ggml_cgraph ggml_graph_view(struct ggml_cgraph * cgraph, int i0, int i1);
// ggml-alloc.c: true if the operation can reuse memory from its sources
GGML_API bool ggml_op_can_inplace(enum ggml_op op);
// Memory allocation
GGML_API void * ggml_aligned_malloc(size_t size);