llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-01 09:01:57 +00:00

Author	SHA1	Message	Date
Francis Couture-Harpin	a60a24beed	Merge branch 'master' into compilade/refactor-kv-cache	2025-07-09 09:38:48 -04:00
Miaoqian Lin	26a48ad699	ggml : prevent integer overflow in gguf tensor size calculation (#14595 ) b5854	2025-07-09 14:33:53 +02:00
Dowon	ffd59e7d18	model : add skt/A.X-4.0 model vocabulary (#14589 ) b5853	2025-07-09 11:22:31 +03:00
Sigbjørn Skjæret	105554595f	llama : remove unintended whitespace (#14592 ) b5852	2025-07-09 10:19:50 +02:00
ibrahim khadraoui	04655063c4	model : add support for Falcon-H1 family (#14534 ) * v1 * push more fixes * another fix * fix * more fixes * minor fix * more cleaning on python code * python fixes * changed precision for multipliers float 32->64 * fixes * another fix * fix * pre-norm -> norm * fix * Revert "fix" This reverts commit `243e4d1a50`. * fix * small fix ffn_norm * try * mix instead of max * fix vocab size * conflict solve * fixed multipliers * falcon-h1 specefic vocab resolved * read arch from gguf.MODEL_ARCH * mamba_d_ssm added to d_inner find_hparam * remove unused functions from gguf_writer.py * override modify_tensors instead of get_tensors * fix conversion and d_inner * added some cb functions for debugging puposes * inp_out_ids moved outside of layers loop * mup_vec create as float64 * fix rope_theta * injected mup * clean ups * rm extra space * rm unused MAMBA_CHUNK_SIZE * rm unused key * add bos False * changed ROPE_TYPE * cleaning debugging stuff * cleaning debug quant * fix comment * some cleanups * some cleanups * Update src/llama-model-loader.cpp * more cleanups * moe cleanuips * d_ssm -> d_inner; * cleaning unused hparams * cleanup * more cleanups * more cleanups on python conversion; * minor cleanups * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * remove todo * added falcon-h1 * tensor not required * clean * remove unneeded attributes * more cleanups and fixed conversion * remove final_norm * flake8 fixes * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * flake8 fixes * Update src/llama-hparams.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * added hashes * Update src/llama-arch.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-vocab.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update the update file * Revert "update the update file" This reverts commit `082ab4ad2a`. * fix: address suggestions * fix: update convert_hf_to_gguf.py * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * d_inner fixed * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * reshaping ssm_norm for 34B * removing generate_mup * remove duplicates metadata keys * rm comment * final comment * fix unused args * fix constants * fix bad merge * Update src/llama-model.cpp Co-authored-by: compilade <git@compilade.net> * falcon-h1: remove unused ssm_in_b and bad merge * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * falcon-h1: fix last comment * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * falcon-h1: revert add_add_bos(False) * falcon-h1: fix tied weights * falcon-h1: remove whitespace * falcon-h1: fix wrong size param * falcon-h1: fix whitespace issues --------- Co-authored-by: younesbelkada <younes.belkada@tii.ae> Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: compilade <git@compilade.net> b5851	2025-07-09 10:03:49 +02:00
Xuan-Son Nguyen	20b7bf8a32	convert : fix smollm3 jinja template (#14586 )	2025-07-09 09:26:13 +03:00
Francis Couture-Harpin	f7c7a926f0	model : use ggml_swiglu_split for Mamba Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-08 15:45:42 -04:00
Francis Couture-Harpin	2f39cd7bb7	model : remove unnecessary prefix for tensor loading constants Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-08 15:37:49 -04:00
Francis Couture-Harpin	db5ff0cc6b	jamba : remove redundant nullptr initializations	2025-07-08 15:15:49 -04:00
Francis Couture-Harpin	b0b280ea28	Merge branch 'master' into compilade/refactor-kv-cache	2025-07-08 15:09:02 -04:00
Jeff Bolz	6efcd65945	vulkan: optimize flash attention split_k_reduce (#14554 ) * vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek). b5849	2025-07-08 20:11:42 +02:00
stevenkuang	699f4392a3	model : fix hunyuan moe chat template (#14584 ) Signed-off-by: stevenkuang <stevenkuang@tencent.com> b5848	2025-07-08 18:29:29 +02:00
Xuan-Son Nguyen	08382869a2	model : add SmolLM3 (#14581 ) * Init - first pass. * Model -> ModelBase. * fix errors in conversion. * Update the graph. * up. * up. * wip * cgraph ok * rm redundant code --------- Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com> b5847	2025-07-08 18:07:01 +02:00
compilade	bb4f7a9e4e	memory : fix broken batch splits for recurrent cache (#14575 ) Splits producing more than one ubatch per batch for recurrent models were broken with #14512. This fixes it by moving the completeness check after the ubatch split loop. b5846	2025-07-08 18:37:47 +03:00
Jeff Bolz	b8eeb8741d	vulkan : fix rope with partial rotation and non-cont src (#14582 ) b5845	2025-07-08 15:21:21 +02:00
Alawode Oluwandabira	17a1f0d2d4	server: Add ability to mount server at prefix (#14544 ) * Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix b5844	2025-07-08 11:47:33 +03:00
Xuan-Son Nguyen	8f22dc0a53	model : add hunyuan moe (#14425 ) * model : add hunyuan moe * tokenizer ok * fix tensor name * cgraph init * chat template * wip * almost working * skip embed, fix bos * cleanup * yarn scaling * cleanup * correct rope type * failed token fix * ntk alpha freq_base * tokenization working * cleanup and pr changes * vocab_size sanity check * ntk alpha generic * Update convert_hf_to_gguf.py * Apply suggestions from code review * fix regression * fix style --------- Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com> b5843	2025-07-08 11:24:06 +03:00
Jeff Bolz	53903ae6fa	vulkan: increase timeout for CI (#14574 )	2025-07-08 09:38:31 +02:00
Georgi Gerganov	4d0dcd4a06	cuda : fix rope with partial rotation and non-cont src (#14580 ) * cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci b5841	2025-07-08 10:15:21 +03:00
Aman Gupta	75c91de6e9	CUDA: add bilinear interpolation for upscale (#14563 ) b5840	2025-07-08 10:11:18 +08:00
R0CKSTAR	68155c66f0	musa: fix build warnings (unused variable) (#14561 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b5839	2025-07-08 07:58:30 +08:00
Sigbjørn Skjæret	e1a7059053	llama : fix incorrect minicpm3 v_states shape (#14571 ) b5838	2025-07-07 23:35:35 +02:00
Sigbjørn Skjæret	12f55c302b	llama : remove ggml_cont where possible (#14568 ) b5837	2025-07-07 21:35:08 +02:00
Francis Couture-Harpin	f71635824b	Merge branch 'master' into compilade/refactor-kv-cache	2025-07-07 14:57:56 -04:00
Aman Gupta	b9c3eefde1	CUDA: add bf16 and i32 to getrows (#14529 ) b5836	2025-07-07 21:45:43 +08:00
Eve	6491d6e4f1	vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485 ) Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com> b5835	2025-07-06 12:29:36 +02:00
Jeff Bolz	e592be1575	vulkan: fix rms_norm+mul fusion (#14545 ) The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results. b5834	2025-07-06 10:08:16 +02:00
Jeff Bolz	a0374a67e2	vulkan: Handle updated FA dim2/3 definition (#14518 ) * vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1 b5833	2025-07-05 09:26:04 +02:00
Sigbjørn Skjæret	ddef99522d	server : fix assistant prefilling when content is an array (#14360 ) b5832	2025-07-05 09:17:14 +02:00
Sigbjørn Skjæret	6681688146	opencl: add GELU_ERF (#14476 ) b5831	2025-07-04 23:24:56 -07:00
Georgi Gerganov	bac8bed248	eval-callback : check for empty input (#14539 ) b5830	2025-07-05 07:18:09 +03:00
R0CKSTAR	b81510a7b7	test-backend-ops: add support for specifying output format (#14368 ) * test-backend-ops: add support for specifying output format Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Add build_commit and build_number in test_result Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * refactor Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Get build commit from ggml_commit() Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Merge errors into test_operation_info && address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * remove visitor nonsense * remove visitor comment Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> --------- Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> b5829	2025-07-05 12:10:53 +08:00
Georgi Gerganov	ef797db357	metal : disable fast math in all quantize kernels (#14528 ) ggml-ci b5828	2025-07-04 19:19:09 +03:00
Georgi Gerganov	67d1ef23c6	batch : add optional for sequential equal split (#14511 ) ggml-ci b5827	2025-07-04 09:08:59 +03:00
Georgi Gerganov	7b50f7c025	graph : prepare for 4D mask (#14515 ) ggml-ci b5826	2025-07-04 09:05:36 +03:00
Georgi Gerganov	c79184d2d1	batch : add n_used count (#14512 ) ggml-ci b5825	2025-07-04 09:04:59 +03:00
luyhcsu	499a8f5a78	CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002 ) Co-authored-by: luyuhong <luyuhong@kylinos.cn> b5824	2025-07-04 11:50:07 +08:00
Francis Couture-Harpin	07c252f038	model : add Jamba to Mamba-specific hparams printing	2025-07-03 17:13:18 -04:00
Francis Couture-Harpin	20f8e43e63	graph : add back hybrid memory graph input But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).	2025-07-03 17:07:46 -04:00
Sigbjørn Skjæret	28657a8229	ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445 ) b5823	2025-07-03 23:07:22 +02:00
Francis Couture-Harpin	4682e21c46	Merge branch 'master' into compilade/refactor-kv-cache	2025-07-03 16:04:55 -04:00
lhez	bee28421be	opencl : broadcast for soft_max (#14510 ) b5822	2025-07-03 20:22:24 +02:00
Jeff Bolz	2b72bedec1	vulkan: support mixed/deepseekR1 FA head sizes (#14509 ) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes b5821	2025-07-03 20:21:14 +02:00
Johannes Gäßler	c8c4495b8d	ggml: backward pass for split swiglu (#14483 ) b5820	2025-07-03 17:05:18 +02:00
Nicolò Scipione	7b63a71a6b	Fix conditional enabling following arch checks for ggml-sycl (#14504 ) Signed-off-by: nscipione <nicolo.scipione@codeplay.com> b5819	2025-07-03 11:00:03 +02:00
Xuan-Son Nguyen	0c2ee38ab7	convert : correct gemma 3n conversion (#14450 ) * convert : correct gemma 3n conversion * rm redundant code	2025-07-03 10:03:06 +02:00
Georgi Gerganov	a70c8a0c4b	kv-cache : use ggml_set_rows (#14285 ) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci b5817	2025-07-03 10:53:35 +03:00
Georgi Gerganov	9067487c44	ggml : fix FA mask dim 2 and 3 (#14505 ) * ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1 b5816	2025-07-03 10:46:57 +03:00
Georgi Gerganov	d4cdd9c1c3	ggml : remove kompute backend (#14501 ) ggml-ci b5815	2025-07-03 07:48:32 +03:00
Francis Couture-Harpin	908e6559d6	convert : fix jamba conv1d shape squeezing	2025-07-02 23:49:12 -04:00

1 2 3 4 5 ...

5912 Commits