ggml : split graph allocations according to backend max buffer size (#15815)

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-31 08:51:55 +00:00

* ggml : make gallocr respect the backend's max buffer size

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max  allocation size in buffer type  interface

* fix missing newline, apple-clang warning

* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.

* track (chunk, offset) pairs instead of "global" offsets through gallocr.

* simpler, don't need loops to map between local/global offsets
* touches more code

* fix dyn_tallocr_max_size and initialization

* fix memory leak when buffers are reused due to same buffer type appearing multiple times

* make vbuffer allocation follow the same logic as backend_buffer did before

* continue to use leftover unallocated space of previous chunks after a new one has been created

* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size

* refactor: move adding new free block and new chunk into separate functions

* allocate chunks individually with a separate free-blocks list for each one

* needs a bit more memory/allocations/indirections, but code is simpler

* fix warnings (missing static) & debug checks

This commit is contained in:

Acly

2025-09-24 16:17:49 +02:00

committed by

GitHub

parent 3a59971967

commit f2a789e334

4 changed files with 858 additions and 141 deletions

									
										4

ggml/src/ggml-impl.h
									
												View File
												
				@@ -342,6 +342,10 @@ struct ggml_cgraph {

				// if you need the gradients, get them from the original graph

				struct ggml_cgraph ggml_graph_view(struct ggml_cgraph * cgraph, int i0, int i1);

				// ggml-alloc.c: true if the operation can reuse memory from its sources

				GGML_API bool ggml_op_can_inplace(enum ggml_op op);

				// Memory allocation

				GGML_API void * ggml_aligned_malloc(size_t size);

ggml : split graph allocations according to backend max buffer size (#15815)

4 ggml/src/ggml-impl.h Unescape Escape View File

4

ggml/src/ggml-impl.h

View File