llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-06 09:46:50 +00:00

Author	SHA1	Message	Date
ibrahimkhadraoui	8c50893820	added some cb functions for debugging puposes	2025-07-07 14:10:45 +04:00
ibrahimkhadraoui	441d8d66bd	override modify_tensors instead of get_tensors	2025-07-07 12:00:57 +04:00
ibrahimkhadraoui	53304c84db	remove unused functions from gguf_writer.py	2025-07-07 11:18:14 +04:00
ibrahimkhadraoui	c4af0f3ca5	mamba_d_ssm added to d_inner find_hparam	2025-07-07 11:17:31 +04:00
ibrahimkhadraoui	c56ec07a9a	read arch from gguf.MODEL_ARCH	2025-07-07 10:34:46 +04:00
ibrahimkhadraoui	280dd2dcb7	falcon-h1 specefic vocab resolved	2025-07-07 10:25:57 +04:00
ibrahimkhadraoui	7a25441e13	fixed multipliers	2025-07-04 17:41:03 +04:00
ibrahimkhadraoui	9760c8bc9d	conflict solve	2025-07-04 16:28:48 +04:00
ibrahimkhadraoui	2aa48dd853	Merge branch 'add-fh1-rebased' of https://github.com/tiiuae/llama.cpp-public into add-fh1-rebased	2025-07-04 16:25:54 +04:00
ibrahimkhadraoui	3ee7983961	fix vocab size	2025-07-04 16:25:27 +04:00
younesbelkada	250b4f1074	mix instead of max	2025-07-04 15:53:47 +04:00
younesbelkada	1fd0574adc	try	2025-07-04 15:50:43 +04:00
ibrahimkhadraoui	a6d0067dd7	Merge branch 'add-fh1-rebased' of https://github.com/tiiuae/llama.cpp-public into add-fh1-rebased	2025-07-04 15:37:44 +04:00
ibrahimkhadraoui	15138df48f	small fix ffn_norm	2025-07-04 15:37:40 +04:00
younesbelkada	6c7d9e26e7	fix	2025-07-04 15:25:59 +04:00
ibrahimkhadraoui	d22b4ea425	Merge branch 'add-fh1-rebased' of https://github.com/tiiuae/llama.cpp-public into add-fh1-rebased	2025-07-04 15:10:11 +04:00
ibrahimkhadraoui	2fe057cc40	Revert "fix" This reverts commit `243e4d1a50`.	2025-07-04 15:04:13 +04:00
younesbelkada	22de62cf56	fix	2025-07-04 15:02:14 +04:00
younesbelkada	cce35498d5	pre-norm -> norm	2025-07-04 14:58:33 +04:00
younesbelkada	243e4d1a50	fix	2025-07-04 14:55:31 +04:00
younesbelkada	1415cd8782	another fix	2025-07-04 14:49:59 +04:00
younesbelkada	a39a8423f7	merge	2025-07-04 14:48:22 +04:00
younesbelkada	50eadc7b33	fixes	2025-07-04 14:47:31 +04:00
ibrahimkhadraoui	071f4b7fd8	changed precision for multipliers float 32->64	2025-07-04 14:37:02 +04:00
ibrahimkhadraoui	8bea92261e	python fixes	2025-07-04 14:32:11 +04:00
younesbelkada	14c37ec047	more cleaning on python code	2025-07-03 18:09:30 +04:00
younesbelkada	fdd5cff4ba	minor fix	2025-07-03 17:12:05 +04:00
younesbelkada	0c93ef6a9c	more fixes	2025-07-03 15:26:33 +04:00
younesbelkada	03568c9358	fix	2025-07-03 15:10:18 +04:00
younesbelkada	71a6848e2d	another fix	2025-07-03 15:08:23 +04:00
younesbelkada	f897efdaf6	push more fixes	2025-07-03 15:05:01 +04:00
younesbelkada	991de6cbe4	v1	2025-07-03 14:49:56 +04:00
Nicolò Scipione	7b63a71a6b	Fix conditional enabling following arch checks for ggml-sycl (#14504 ) Signed-off-by: nscipione <nicolo.scipione@codeplay.com> b5819	2025-07-03 11:00:03 +02:00
Xuan-Son Nguyen	0c2ee38ab7	convert : correct gemma 3n conversion (#14450 ) * convert : correct gemma 3n conversion * rm redundant code	2025-07-03 10:03:06 +02:00
Georgi Gerganov	a70c8a0c4b	kv-cache : use ggml_set_rows (#14285 ) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci b5817	2025-07-03 10:53:35 +03:00
Georgi Gerganov	9067487c44	ggml : fix FA mask dim 2 and 3 (#14505 ) * ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1 b5816	2025-07-03 10:46:57 +03:00
Georgi Gerganov	d4cdd9c1c3	ggml : remove kompute backend (#14501 ) ggml-ci b5815	2025-07-03 07:48:32 +03:00
Aman Gupta	55c2646b45	CUDA: add dynamic shared mem to softmax, refactor general usage (#14497 ) b5814	2025-07-03 07:45:11 +08:00
Sigbjørn Skjæret	e75ba4c043	gguf-py : add support for chat template jinja files (#14508 ) * add support for chat template jinja files * remove gemma3n hack	2025-07-02 21:02:35 +02:00
compilade	5d46babdc2	llama : initial Mamba-2 support (#9126 ) * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1\|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size b5812	2025-07-02 13:10:24 -04:00
Georgi Gerganov	e17991c466	sync : ggml ggml-ci b5811	2025-07-02 20:08:45 +03:00
Daniel Bevenius	c46944aa25	ggml : add version function to get lib version (ggml/1286) * ggml : add version function to get lib version This commit adds a function `ggml_version()` to the ggml library that returns the version of the library as a string. The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used. Usage: ```c printf("GGML version: %s\n", ggml_version()); ``` Output: ```console GGML version: 0.0.2219 ``` * ggml : add ggml_commit() --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-02 20:08:45 +03:00
Rotem Dan	f3ed38d793	Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309 ) b5809	2025-07-02 18:37:16 +02:00
Aman Gupta	55a1c5a5fd	CUDA: add softmax broadcast (#14475 ) * CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output b5808	2025-07-02 15:48:33 +03:00
Johannes Gäßler	12a81af45f	CUDA: broadcasting for FlashAttention mask (#14500 )	2025-07-02 15:48:33 +03:00
Jeff Bolz	8875523eb3	vulkan: support softmax/FA batch and broadcast (#14449 )	2025-07-02 15:48:33 +03:00
Georgi Gerganov	ec68e84c32	ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435 ) ggml-ci	2025-07-02 15:48:33 +03:00
zhouwg	307e79d33d	opencl : fix possible buffer overflow in dump_tensor (#14490 ) b5804	2025-07-02 14:38:10 +02:00
Georgi Gerganov	d7f5f4e578	simple-chat : fix context-exceeded condition (#14494 ) * simple-chat : fix context-exceeded condition ggml-ci * cont : fix n_ctx_used computation ggml-ci b5803	2025-07-02 14:12:07 +03:00
Eric Zhang	c8a4e470f6	opencl : skip empty nodes on cgraph compute (#14491 ) b5802	2025-07-02 13:00:04 +02:00

1 2 3 4 5 ...

5851 Commits