llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-08 10:07:01 +00:00

Author	SHA1	Message	Date
Aaron Teo	63fbc45ed6	ggml-zdnn: switch to std vector instead of array Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 01:09:01 +08:00
Aaron Teo	b7f4b6fde3	ggml-zdnn: rework init_tensor to create new buffers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 01:03:53 +08:00
Aaron Teo	ee0ed78d54	ggml-zdnn: add check for view tensors to prevent init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:56:32 +08:00
Aaron Teo	13c64448bd	ggml-zdnn: assign tensor->extra to buffer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:48:32 +08:00
Aaron Teo	13c05872f2	ggml-zdnn: implement at least 1 op to test Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:44:05 +08:00
Aaron Teo	9e84742e72	ggml-zdnn: test ztensor finding in init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:40:22 +08:00
Aaron Teo	af9f4f0039	ggml-zdnn: fix compiler warnings and bugfixes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:25:41 +08:00
Aaron Teo	ae2f656d7e	ggml-zdnn: bugfix new impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:18:53 +08:00
Aaron Teo	7c6395f826	ggml-zdnn: rewrite the backend implementation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-24 00:14:45 +08:00
Aaron Teo	04ddb2ac95	ggml-zdnn: update op out_prod to use tensor->extra Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-23 19:51:37 +08:00
Aaron Teo	77a753297b	ggml-zdnn: support op out_prod Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-23 19:28:51 +08:00
Aaron Teo	11d58d29de	ggml-zdnn: add comments to prevent accidentally deleting lines Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-23 14:54:44 +08:00
Aaron Teo	529bdb9fbd	ggml-zdnn: last working matmul version Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-22 00:29:47 +08:00
Aaron Teo	60b9874dea	ggml-zdnn: update set_tensor logging to check only for matmul Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-21 21:11:39 +08:00
Aaron Teo	b9756b6dd4	ggml-zdnn: add more loggers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-21 21:09:21 +08:00
Aaron Teo	1989fc9bf4	ggml-zdnn: add set_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-21 20:37:53 +08:00
Aaron Teo	36d76c30fb	ggml-zdnn: run compute and store into tensor->extra Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-21 20:30:54 +08:00
Aaron Teo	02cfcfb270	ggml-zdnn: add output buffer check Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-21 20:19:20 +08:00
Aaron Teo	fd4914b060	ggml-zdnn: tensor->extra logging check Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: add layout name mapping, ztensor information Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: separate logging into its own line Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: add shape comparison Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: add ggml_tensor shape log Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: fix incorrect shape logging Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-18 21:00:30 +08:00
Aaron Teo	e084821a3f	ggml-zdnn: inital backend impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: temp change z17 to arch15 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: fix build bugs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-18 20:04:20 +08:00
Georgi Gerganov	01612b7409	llama : reuse compute graphs (#14482 ) * llama : reuse compute graphs ggml-ci * llama-bench : add graph reuse parameter ggml-ci * cont : remove the parameter and the sched resets ggml-ci * graph : rename update() to can_reuse() ggml-ci * params : remove is_same() ggml-ci * graph : set res->params in llm_graph_context constructor ggml-ci * graph : avoid set_max_nodes in llm_graph_result ggml-ci * kv-cache : reuse llama_context's graph result instance ggml-ci * context : reset the previous graph result upon memory updates ggml-ci * batch : llama_ubatch now carries its data instead of pointing to balloc ggml-ci * merge : fix build ggml-ci * graph : fix can_reuse() checks when flash-attention is disabled * graph : move llm_graph_result impl in source file + debug env ggml-ci b5922	2025-07-17 19:08:33 +03:00
Tarek Dakhran	086cf81e88	llama : fix parallel processing for lfm2 (#14705 ) b5921	2025-07-17 09:22:11 +02:00
Georgi Gerganov	d9b691081c	kv-cache : opt mask set input (#14600 ) ggml-ci b5920	2025-07-17 09:49:15 +03:00
Georgi Gerganov	ad57d3edd2	batch : fix uninitialized has_cpl flag (#14733 ) ggml-ci b5919	2025-07-17 09:45:54 +03:00
Sigbjørn Skjæret	1ba45d4982	ci : disable failing vulkan crossbuilds (#14723 )	2025-07-16 20:52:08 -03:00
Sigbjørn Skjæret	19e5943d9e	convert : make hf token optional (#14717 ) * make hf token optional * fail if we can't get necessary tokenizer config	2025-07-16 23:17:43 +02:00
Diner Burger	496957e1cb	llama : fix parameter order for hybrid memory initialization (#14725 ) b5916	2025-07-16 21:17:25 +02:00
Reese Levine	21c021745d	ggml: Add initial WebGPU backend (#14521 ) * Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults * Initialize webgpu device * Making progress on setting up the backend * Finish more boilerplate/utility functions * Organize file and work on alloc buffer * Add webgpu_context to prepare for actually running some shaders * Work on memset and add shader loading * Work on memset polyfill * Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it * Implement get_tensor and buffer_clear * Finish rest of setup * Start work on compute graph * Basic mat mul working * Work on emscripten build * Basic WebGPU backend instructions * Use EMSCRIPTEN flag * Work on passing ci, implement 4d tensor multiplication * Pass thread safety test * Implement permuting for mul_mat and cpy * minor cleanups * Address feedback * Remove division by type size in cpy op * Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends * Fix name * Fix macos dawn prefix path b5915	2025-07-16 18:18:51 +03:00
tempstudio	b0f0ecc3dc	model : support output bias for qwen2 (#14711 ) Co-authored-by: qwaqrm <qwaqrm@126.com> b5914	2025-07-16 18:02:06 +03:00
Georgi Gerganov	225e7a1438	llama : add high-throughput mode (#14363 ) * kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (#14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b5913	2025-07-16 16:35:42 +03:00
Aman Gupta	ab14019821	Support diffusion models: Add Dream 7B (#14644 ) * Support diffusion models: Add Dream 7B * Move diffusion to examples * Move stuff to examples. Add patch to not use kv-cache * Address review comments * Make sampling fast * llama: remove diffusion functions * Add basic timings + cleanup * More cleanup * Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length * fixup! * Review: move everything to diffusion-cli for now b5912	2025-07-16 20:03:51 +08:00
Georgi Gerganov	64978340b0	ggml : add asserts (#14720 ) * ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> b5911	2025-07-16 14:43:32 +03:00
Georgi Gerganov	6ffd4e9c44	server : pre-calculate EOG logit biases (#14721 ) ggml-ci b5910	2025-07-16 14:04:12 +03:00
Shunta Saito	e4841d24d3	llama : fix parallel processing for plamo2 (#14716 ) b5909	2025-07-16 12:12:22 +02:00
Georgi Gerganov	538cc77f7f	server : fix handling of the ignore_eos flag (#14710 ) ggml-ci b5908	2025-07-16 12:13:57 +03:00
Johannes Gäßler	5cae766541	scripts: synthetic prompt mode for server-bench.py (#14695 )	2025-07-16 09:33:28 +02:00
Sigbjørn Skjæret	4b91d6f71f	convert : only check for tokenizer folder if we need it (#14704 )	2025-07-16 08:52:04 +02:00
Sigbjørn Skjæret	cf91f217f1	convert : add pre-computed hashes first to prevent order mishaps (#14701 )	2025-07-16 08:51:12 +02:00
Min-Hua	79e0b68c17	llama: add LLAMA_API to deprecated llama_kv_self_seq_div (#14708 ) Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env: attributeError: function 'llama_kv_self_seq_div' not found. Did you mean: 'llama_kv_self_seq_add'? Although llama_kv_self_seq_div() has been marked deprecated but it is necessary to export it to make llama-cpp-python happy. Observed software version: OS: windows compiler: MSVC llama-cpp-python: tag: v0.3.12-cu124 llama.cpp: tag: b5833 Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai> b5904	2025-07-16 07:00:42 +03:00
Ed Addario	c81f4192f9	gguf-py : dump bpw per layer and model in markdown mode (#14703 )	2025-07-16 00:04:42 +02:00
Gabriel Larson	4a4f426944	model : add Kimi-K2 support (#14654 ) * Kimi-K2 conversion * add Kimi_K2 pre type * Kimi-K2 * Kimi-K2 unicode * Kimi-K2 * LLAMA_MAX_EXPERTS 384 * fix vocab iteration * regex space fix * add kimi-k2 to pre_computed_hashes * Updated with kimi-k2 get_vocab_base_pre hash * fix whitespaces * fix flake errors * remove more unicode.cpp whitespaces * change set_vocab() flow * add moonshotai-Kimi-K2.jinja to /models/templates/ * update moonshotai-Kimi-K2.jinja * add kimi-k2 chat template * add kimi-k2 * update NotImplementedError Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * except Exception Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){} --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b5902	2025-07-15 21:54:22 +02:00
Jeff Bolz	ba1ceb3456	vulkan: fix noncontig check for mat_mul_id splitting (#14683 ) * vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K b5901	2025-07-15 21:51:09 +02:00
Jeff Bolz	10a0351a97	vulkan: add RTE variants for glu/add/sub/mul/div (#14653 ) b5900	2025-07-15 21:32:11 +02:00
Shunta Saito	68e37a61a7	model : add PLaMo-2 support (#14560 ) * Add PLaMo-2 model using hybrid memory module * Fix z shape * Add cmath to include from llama-vocab.h * Explicitly dequantize normalization weights before RoPE apply * Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing * Use ATTN_K/Q_NORM for k,q weights to prevent quantization * Remove SSM_BCDT that is not used from anywhere * Do not duplicate embedding weights for output.weight * Fix tokenizer encoding problem for multibyte strings * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up * Remove unnecessary part for Grouped Query Attention * Fix how to load special token id to gguf * Remove unused tensor mapping * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations * Update src/llama-vocab.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix plamo2 tokenizer session to prevent multiple calls of build() --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b5899	2025-07-15 18:11:42 +02:00
R0CKSTAR	cbc68be51d	cuda: fix build warnings in set-rows.cu (unused variable) (#14687 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b5898	2025-07-15 15:28:53 +08:00
Anton Mitkov	bdca38376f	sycl: Hotfix for non dnnl codepath (#14677 ) b5897	2025-07-14 18:12:42 +01:00
shalinib-ibm	55c509daf5	ggml : refactor llamafile_sgemm PPC code (#14673 ) Remove un-necessary templates from class definition and packing functions Reduce deeply nested conditionals, if-else switching in mnapck function Replace repetitive code with inline functions in Packing functions 2 ~ 7% improvement in Q8 Model 15 ~ 50% improvement in Q4 Model Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> b5896	2025-07-14 16:16:42 +03:00
Aman Gupta	9c9e4fc635	llama-context: add ability to get logits (#14672 ) b5895	2025-07-14 21:01:41 +08:00
Johannes Gäßler	494c5899cb	scripts: benchmark for HTTP server throughput (#14668 ) * scripts: benchmark for HTTP server throughput * fix server connection reset b5894	2025-07-14 13:14:30 +02:00
Akarshan Biswas	0f4c6ec0f1	SYCL: use 1D kernel for set_rows (#14618 ) * SYCL: Use 1D kernel for set_rows * Remove dangling comment * Refactor and use ceil_div b5893	2025-07-14 10:37:55 +01:00

1 2 3 4 5 ...

5992 Commits