Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						53f925074d 
					 
					
						
						
							
							sync : vendor ( #13901 )  
						
						 
						
						... 
						
						
						
						* sync : vendor
ggml-ci
* cont : fix httplib version
ggml-ci
* cont : fix lint
* cont : fix lint
* vendor : move to common folder /vendor
ggml-ci
* cont : fix lint
* cont : move httplib to /vendor + use json_fwd.hpp
ggml-ci
* cont : fix server build
ggml-ci
* cont : add missing headers
ggml-ci
* cont : header clean-up
ggml-ci 
						
						
					 
					
						2025-05-30 16:25:45 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						07e4351ce6 
					 
					
						
						
							
							convert : allow partial update to the chkhsh pre-tokenizer list ( #13847 )  
						
						 
						
						... 
						
						
						
						* convert : allow partial update to the chkhsh pre-tokenizer list
* code style
* update tokenizer out
* rm inp/out files for models not having gguf
* fixed hash for glm
* skip nomic-bert-moe test
* Update convert_hf_to_gguf_update.py
* fix minerva-7b hash
* rm redundant import 
						
						
					 
					
						2025-05-30 12:24:37 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						66c92061f5 
					 
					
						
						
							
							tests : remove json.hpp from a test ( #13880 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-29 12:17:16 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f9cd68398b 
					 
					
						
						
							
							sampling : make sure samplers return at least 1 token ( #13822 )  
						
						 
						
						... 
						
						
						
						* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci 
						
						
					 
					
						2025-05-27 12:07:52 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						03f582ae8f 
					 
					
						
						
							
							server: fix streaming crashes ( #13786 )  
						
						 
						
						... 
						
						
						
						* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash 
						
						
					 
					
						2025-05-26 16:03:57 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						d74e94c1b3 
					 
					
						
						
							
							server: fix format of streamed tool call deltas (diff name, fix id location) (#13800 )  
						
						 
						
						... 
						
						
						
						* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta 
						
						
					 
					
						2025-05-26 14:56:49 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						e121edc432 
					 
					
						
						
							
							server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 )  
						
						 
						
						... 
						
						
						
						---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
					 
					
						2025-05-26 00:30:51 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						aa50ba462f 
					 
					
						
						
							
							tests : improve UGM tokenizer test coverage ( #13773 )  
						
						 
						
						
						
						
					 
					
						2025-05-25 16:22:29 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						f5cd27b71d 
					 
					
						
						
							
							server: streaming of tool calls and thoughts when --jinja is on (#12379 )  
						
						 
						
						... 
						
						
						
						* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com > 
						
						
					 
					
						2025-05-25 01:48:08 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						759e37b0d8 
					 
					
						
						
							
							tests : avoid github urls due to throttling ( #13654 )  
						
						 
						
						
						
						
					 
					
						2025-05-20 12:03:17 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						aa48e373f2 
					 
					
						
						
							
							server: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )  
						
						 
						
						... 
						
						
						
						* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729 
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
					 
					
						2025-05-15 02:39:51 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						3198405e98 
					 
					
						
						
							
							common: add partial regex support (#12808 )  
						
						 
						
						... 
						
						
						
						* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2025-05-14 19:50:57 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						10d2af0eaa 
					 
					
						
						
							
							llama/ggml: add LLM training support ( #10544 )  
						
						 
						
						... 
						
						
						
						* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period 
						
						
					 
					
						2025-05-12 14:44:49 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								David Huang 
							
						 
					 
					
						
						
							
						
						7f323a589f 
					 
					
						
						
							
							Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B ( #13386 )  
						
						 
						
						
						
						
					 
					
						2025-05-11 14:18:39 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								DocShotgun 
							
						 
					 
					
						
						
							
						
						ffc727203a 
					 
					
						
						
							
							sampling : make top_n_sigma no-op at <=0 or a single candidate ( #13345 )  
						
						 
						
						
						
						
					 
					
						2025-05-06 22:36:24 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						27aa259532 
					 
					
						
						
							
							mtmd : add C public API ( #13184 )  
						
						 
						
						... 
						
						
						
						* init
* wip
* working version
* add mtmd::bitmaps
* add test target
* rm redundant define
* test: mtmd_input_chunks_free
* rm outdated comment
* fix merging issue
* explicitly create mtmd::input_chunks
* mtmd_input_chunk_copy
* add clone()
* add const to various places
* add warning about breaking changes
* helper: use mtmd_image_tokens_get_n_pos 
						
						
					 
					
						2025-05-04 23:43:42 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						9f2da5871f 
					 
					
						
						
							
							llama : build windows releases with dl backends ( #13220 )  
						
						 
						
						
						
						
					 
					
						2025-05-04 14:20:49 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						1d36b3670b 
					 
					
						
						
							
							llama : move end-user examples to tools directory ( #13249 )  
						
						 
						
						... 
						
						
						
						* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co > 
						
						
					 
					
						2025-05-02 20:27:13 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b34443923c 
					 
					
						
						
							
							sync : ggml ( #13268 )  
						
						 
						
						... 
						
						
						
						* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)
* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)
* review: remove src_x/y < 0 checks; add performance tests
* sync : ggml
ggml-ci
* vulkan : fix lint (#0 )
---------
Co-authored-by: Acly <aclysia@gmail.com > 
						
						
					 
					
						2025-05-02 20:54:30 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								piDack 
							
						 
					 
					
						
						
							
						
						2af6880178 
					 
					
						
						
							
							llama-chat : reset glmedge chat template ( #13253 )  
						
						 
						
						... 
						
						
						
						* reset glmedge chat template
* fix glmedge chat template 
						
						
					 
					
						2025-05-02 11:06:09 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								matteo 
							
						 
					 
					
						
						
							
						
						e0f572c846 
					 
					
						
						
							
							llama-chat : update GLM4 chat template ( #13238 )  
						
						 
						
						... 
						
						
						
						* update GLM4 chat template
* Update chat template
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
					 
					
						2025-05-01 21:16:38 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						b0ecbd434b 
					 
					
						
						
							
							test: non-cont. b in test-backend-ops -o MUL_MAT ( #13187 )  
						
						 
						
						
						
						
					 
					
						2025-05-01 20:18:56 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						e1e8e0991f 
					 
					
						
						
							
							CUDA: batched+noncont MMQ, refactor bs>1 MoE code ( #13199 )  
						
						 
						
						
						
						
					 
					
						2025-04-30 23:12:59 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						da84c04d8f 
					 
					
						
						
							
							docker : do not build tests ( #13204 )  
						
						 
						
						... 
						
						
						
						* docker : do not build tests
* include "ggml-cpu.h" 
						
						
					 
					
						2025-04-30 10:44:07 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						4e87962e34 
					 
					
						
						
							
							mtmd : fix glm-edge redundant token count ( #13139 )  
						
						 
						
						... 
						
						
						
						* mtmd : fix glm-edge redundant token count
* fix chat template
* temporary disable GLMEdge test chat tmpl 
						
						
					 
					
						2025-04-28 16:12:56 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						2d451c8059 
					 
					
						
						
							
							common : add common_remote_get_content ( #13123 )  
						
						 
						
						... 
						
						
						
						* common : add common_remote_get_content
* support max size and timeout
* add tests 
						
						
					 
					
						2025-04-26 22:58:12 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								frob 
							
						 
					 
					
						
						
							
						
						d5fe4e81bd 
					 
					
						
						
							
							grammar : handle maxItems == 0 in JSON schema ( #13117 )  
						
						 
						
						... 
						
						
						
						Co-authored-by: Richard Lyons <frob@cloudstaff.com > 
						
						
					 
					
						2025-04-26 10:10:20 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						edb18b6e8f 
					 
					
						
						
							
							clip : fix pixtral on some GPU backends ( #13097 )  
						
						 
						
						... 
						
						
						
						* clip : fix pixtral on some GPU backends
* refactor inp_raw set
* rm outdated comment
* fix dynamic size
* add TODO 
						
						
					 
					
						2025-04-25 14:31:42 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						13b4548877 
					 
					
						
						
							
							cmake : do not include ./src as public for libllama ( #13062 )  
						
						 
						
						... 
						
						
						
						* cmake : do not include ./src as public for libllama
ggml-ci
* cmake : rework tests
ggml-ci
* llguidance : remove unicode include
ggml-ci
* cmake : make c++17 private
ggml-ci 
						
						
					 
					
						2025-04-24 16:00:10 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						658987cfc9 
					 
					
						
						
							
							CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID ( #13014 )  
						
						 
						
						... 
						
						
						
						* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID
* fix logic for RoPE support, CUDA graphs 
						
						
					 
					
						2025-04-22 21:27:40 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						2f74c354c0 
					 
					
						
						
							
							graph : make FA compatible with MLA + add initial Metal kernels ( #12953 )  
						
						 
						
						... 
						
						
						
						* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci 
						
						
					 
					
						2025-04-17 18:16:36 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						015022bb53 
					 
					
						
						
							
							vulkan: enable coopmat2 FA gqa and split_k optimizations more often ( #12931 )  
						
						 
						
						... 
						
						
						
						The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.
split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting. 
						
						
					 
					
						2025-04-16 20:37:25 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						b6930ebc42 
					 
					
						
						
							
							tool-call: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (#12900 )  
						
						 
						
						... 
						
						
						
						* `tool-call`: don't call common_chat_params_init_hermes_2_pro when there aren't tools (or when there's a schema)
* test all chat formats w/o tools 
						
						
					 
					
						2025-04-11 21:47:52 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						1d2b613445 
					 
					
						
						
							
							tests : fix init order ( #0 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-04-11 00:17:47 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						fe92821ea9 
					 
					
						
						
							
							ggml : add bilinear upscale support (ggml/1185)  
						
						 
						
						
						
						
					 
					
						2025-04-11 00:17:47 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Plamen Minev 
							
						 
					 
					
						
						
							
						
						381603a775 
					 
					
						
						
							
							ci: detach common from the library ( #12827 )  
						
						 
						
						... 
						
						
						
						* fix: detach common from the library
* fix: building chat test template 
						
						
					 
					
						2025-04-09 10:11:11 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						bd3f59f812 
					 
					
						
						
							
							cmake : enable curl by default ( #12761 )  
						
						 
						
						... 
						
						
						
						* cmake : enable curl by default
* no curl if no examples
* fix build
* fix build-linux-cross
* add windows-setup-curl
* fix
* shell
* fix path
* fix windows-latest-cmake*
* run: include_directories
* LLAMA_RUN_EXTRA_LIBS
* sycl: no llama_curl
* no test-arg-parser on windows
* clarification
* try riscv64 / arm64
* windows: include libcurl inside release binary
* add msg
* fix mac / ios / android build
* will this fix xcode?
* try clearing the cache
* add bunch of licenses
* revert clear cache
* fix xcode
* fix xcode (2)
* fix typo 
						
						
					 
					
						2025-04-07 13:35:19 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								R0CKSTAR 
							
						 
					 
					
						
						
							
						
						5f696e88e0 
					 
					
						
						
							
							sync : minja (inclusionAI/Ling) and update tests ( #12699 )  
						
						 
						
						... 
						
						
						
						Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com > 
						
						
					 
					
						2025-04-03 13:51:35 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						f01bd02376 
					 
					
						
						
							
							vulkan: Implement split_k for coopmat2 flash attention. ( #12627 )  
						
						 
						
						... 
						
						
						
						When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large. 
						
						
					 
					
						2025-04-02 14:25:08 -05:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						267c1399f1 
					 
					
						
						
							
							common : refactor downloading system, handle mmproj with -hf option ( #12694 )  
						
						 
						
						... 
						
						
						
						* (wip) refactor downloading system [no ci]
* fix all examples
* fix mmproj with -hf
* gemma3: update readme
* only handle mmproj in llava example
* fix multi-shard download
* windows: fix problem with std::min and std::max
* fix 2 
						
						
					 
					
						2025-04-01 23:44:05 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Sergei Vorobyov 
							
						 
					 
					
						
						
							
						
						7242dd9675 
					 
					
						
						
							
							llama-chat : Add Yandex instruct model template support ( #12621 )  
						
						 
						
						... 
						
						
						
						* add yandex template
* update yandex chat template
* fix tests
* adjust chat template
* fix style
* fix tool macro in template
* add clarify comment
---------
Co-authored-by: Sergei Vorobev <serv01@yandex-team.ru >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
					 
					
						2025-03-30 20:12:03 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b4ae50810e 
					 
					
						
						
							
							metal : improve FA + improve MoE ( #12612 )  
						
						 
						
						... 
						
						
						
						* ggml : FA with different K, V head sizes (CPU)
ggml-ci
* metal : add FA with HS=192
* metal : extend FA to support different K and V head sizes
ggml-ci
* metal : add FA vector kernels for heads K 192 and V 128
ggml-ci
* ggml : restrict op on other backends to equal head sizes
ggml-ci
* metal : optimize FA-vec kernel
ggml-ci
* metal : FA remove mq registers
* metal : improve MoE mul_mat_id condition
ggml-ci
* metal : fix comments + remove unnecessary addition
ggml-ci
* metal : avoid too much shared memory usage with mul_mat_id
ggml-ci 
						
						
					 
					
						2025-03-28 20:21:59 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Michał Moskal 
							
						 
					 
					
						
						
							
						
						2447ad8a98 
					 
					
						
						
							
							upgrade to llguidance 0.7.10 ( #12576 )  
						
						 
						
						
						
						
					 
					
						2025-03-26 11:06:09 -07:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						9b169a4d4e 
					 
					
						
						
							
							vulkan: fix mul_mat_vec failure in backend tests ( #12529 )  
						
						 
						
						... 
						
						
						
						The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs. 
						
						
					 
					
						2025-03-24 07:56:17 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						ba932dfb50 
					 
					
						
						
							
							ggml : fix quantized cpy op ( #12310 )  
						
						 
						
						... 
						
						
						
						* ggml : fix quantized cpy op
ggml-ci
* tests : add cpy tests for all types
ggml-ci
* tests : add BF16 copy tests
ggml-ci
* tests : fix loop for same-type copy
ggml-ci
* tests : add option to permute the dst tensor
ggml-ci 
						
						
					 
					
						2025-03-22 16:23:26 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						eddfb43850 
					 
					
						
						
							
							vulkan: Optimize mul_mat_vec p021 and nc shaders ( #12505 )  
						
						 
						
						... 
						
						
						
						* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
* vulkan: Optimize mul_mat_vec p021 and nc shaders.
These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally. 
						
						
					 
					
						2025-03-22 09:40:11 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Gaurav Garg 
							
						 
					 
					
						
						
							
						
						517b5ddbf0 
					 
					
						
						
							
							CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case ( #12183 )  
						
						 
						
						... 
						
						
						
						- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Fixes Issue: #12182 
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de > 
						
						
					 
					
						2025-03-19 20:52:06 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Molly Sophia 
							
						 
					 
					
						
						
							
						
						7dfad387e3 
					 
					
						
						
							
							llama: Add support for RWKV v7 architecture ( #12412 )  
						
						 
						
						... 
						
						
						
						* ggml: Add op l2_norm
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* ggml: Add op rwkv_wkv7
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: Add support for RWKV7 and ARWKV7 models
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: fix inference with RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: add more (a)rwkv7 variants in size
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Apply code-format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* fix MUSA build
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: fix shape error with rwkv using llama-parallel
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com > 
						
						
					 
					
						2025-03-18 07:27:50 +08:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						bf69cfe62f 
					 
					
						
						
							
							vulkan: fix bug in coopmat1 mul_mat_id ( #12316 )  
						
						 
						
						... 
						
						
						
						* tests: run mul_mat_id with a larger N
* vulkan: fix bug in coopmat1 mul_mat_id 
						
						
					 
					
						2025-03-12 06:59:19 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e128a1bf5b 
					 
					
						
						
							
							tests : fix test-quantize-fns to init the CPU backend ( #12306 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-03-10 14:07:15 +02:00