Diego Devesa 
							
						 
					 
					
						
						
							
						
						7f4fbe5183 
					 
					
						
						
							
							llama : allow building all tests on windows when not using shared libs ( #13980 )  
						
						... 
						
						
						
						* llama : allow building all tests on windows when not using shared libraries
* add static windows build to ci
* tests : enable debug logs for test-chat
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2025-06-09 20:03:09 +02:00 
						 
				 
			
				
					
						
							
							
								Ervin Áron Tasnádi 
							
						 
					 
					
						
						
							
						
						0d3984424f 
					 
					
						
						
							
							ggml-vulkan: adds support for op CONV_TRANSPOSE_1D ( #13813 )  
						
						... 
						
						
						
						* * ggml-vulkan: adds op CONV_TRANSPOSE_1D
* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
* Missing barrier added to shader.
Number of additional tests reduced to 108.
* * Fixes typo in variable name.
* Removes extra whitespaces.
* Adds int64->int32 casts to prevent possible warnings.
* Problem size reduced in tests to pass tests with llvmpipe.
* supports_op condition moved from unintended position 
						
						
					 
					
						2025-06-04 22:02:00 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						c9bbc77931 
					 
					
						
						
							
							server: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 )  
						
						... 
						
						
						
						* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts 
						
						
					 
					
						2025-06-02 10:15:44 -07:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						7675c555a1 
					 
					
						
						
							
							gguf: fix failure on version == 0 ( #13956 )  
						
						
						
						
					 
					
						2025-06-01 18:08:05 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						e15898d1c7 
					 
					
						
						
							
							server: allow unclosed thinking tags ( #13931 )  
						
						
						
						
					 
					
						2025-05-31 08:26:10 -07:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						53f925074d 
					 
					
						
						
							
							sync : vendor ( #13901 )  
						
						... 
						
						
						
						* sync : vendor
ggml-ci
* cont : fix httplib version
ggml-ci
* cont : fix lint
* cont : fix lint
* vendor : move to common folder /vendor
ggml-ci
* cont : fix lint
* cont : move httplib to /vendor + use json_fwd.hpp
ggml-ci
* cont : fix server build
ggml-ci
* cont : add missing headers
ggml-ci
* cont : header clean-up
ggml-ci 
						
						
					 
					
						2025-05-30 16:25:45 +03:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						07e4351ce6 
					 
					
						
						
							
							convert : allow partial update to the chkhsh pre-tokenizer list ( #13847 )  
						
						... 
						
						
						
						* convert : allow partial update to the chkhsh pre-tokenizer list
* code style
* update tokenizer out
* rm inp/out files for models not having gguf
* fixed hash for glm
* skip nomic-bert-moe test
* Update convert_hf_to_gguf_update.py
* fix minerva-7b hash
* rm redundant import 
						
						
					 
					
						2025-05-30 12:24:37 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						66c92061f5 
					 
					
						
						
							
							tests : remove json.hpp from a test ( #13880 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-29 12:17:16 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f9cd68398b 
					 
					
						
						
							
							sampling : make sure samplers return at least 1 token ( #13822 )  
						
						... 
						
						
						
						* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci 
						
						
					 
					
						2025-05-27 12:07:52 +03:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						03f582ae8f 
					 
					
						
						
							
							server: fix streaming crashes ( #13786 )  
						
						... 
						
						
						
						* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash 
						
						
					 
					
						2025-05-26 16:03:57 +01:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						d74e94c1b3 
					 
					
						
						
							
							server: fix format of streamed tool call deltas (diff name, fix id location) (#13800 )  
						
						... 
						
						
						
						* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta 
						
						
					 
					
						2025-05-26 14:56:49 +01:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						e121edc432 
					 
					
						
						
							
							server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 )  
						
						... 
						
						
						
						---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
					 
					
						2025-05-26 00:30:51 +01:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						aa50ba462f 
					 
					
						
						
							
							tests : improve UGM tokenizer test coverage ( #13773 )  
						
						
						
						
					 
					
						2025-05-25 16:22:29 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						f5cd27b71d 
					 
					
						
						
							
							server: streaming of tool calls and thoughts when --jinja is on (#12379 )  
						
						... 
						
						
						
						* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com > 
						
						
					 
					
						2025-05-25 01:48:08 +01:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						759e37b0d8 
					 
					
						
						
							
							tests : avoid github urls due to throttling ( #13654 )  
						
						
						
						
					 
					
						2025-05-20 12:03:17 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						aa48e373f2 
					 
					
						
						
							
							server: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )  
						
						... 
						
						
						
						* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729 
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
					 
					
						2025-05-15 02:39:51 +01:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						3198405e98 
					 
					
						
						
							
							common: add partial regex support (#12808 )  
						
						... 
						
						
						
						* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2025-05-14 19:50:57 +01:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						10d2af0eaa 
					 
					
						
						
							
							llama/ggml: add LLM training support ( #10544 )  
						
						... 
						
						
						
						* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period 
						
						
					 
					
						2025-05-12 14:44:49 +02:00 
						 
				 
			
				
					
						
							
							
								David Huang 
							
						 
					 
					
						
						
							
						
						7f323a589f 
					 
					
						
						
							
							Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B ( #13386 )  
						
						
						
						
					 
					
						2025-05-11 14:18:39 +02:00 
						 
				 
			
				
					
						
							
							
								DocShotgun 
							
						 
					 
					
						
						
							
						
						ffc727203a 
					 
					
						
						
							
							sampling : make top_n_sigma no-op at <=0 or a single candidate ( #13345 )  
						
						
						
						
					 
					
						2025-05-06 22:36:24 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						27aa259532 
					 
					
						
						
							
							mtmd : add C public API ( #13184 )  
						
						... 
						
						
						
						* init
* wip
* working version
* add mtmd::bitmaps
* add test target
* rm redundant define
* test: mtmd_input_chunks_free
* rm outdated comment
* fix merging issue
* explicitly create mtmd::input_chunks
* mtmd_input_chunk_copy
* add clone()
* add const to various places
* add warning about breaking changes
* helper: use mtmd_image_tokens_get_n_pos 
						
						
					 
					
						2025-05-04 23:43:42 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						9f2da5871f 
					 
					
						
						
							
							llama : build windows releases with dl backends ( #13220 )  
						
						
						
						
					 
					
						2025-05-04 14:20:49 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						1d36b3670b 
					 
					
						
						
							
							llama : move end-user examples to tools directory ( #13249 )  
						
						... 
						
						
						
						* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co > 
						
						
					 
					
						2025-05-02 20:27:13 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b34443923c 
					 
					
						
						
							
							sync : ggml ( #13268 )  
						
						... 
						
						
						
						* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)
* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)
* review: remove src_x/y < 0 checks; add performance tests
* sync : ggml
ggml-ci
* vulkan : fix lint (#0 )
---------
Co-authored-by: Acly <aclysia@gmail.com > 
						
						
					 
					
						2025-05-02 20:54:30 +03:00 
						 
				 
			
				
					
						
							
							
								piDack 
							
						 
					 
					
						
						
							
						
						2af6880178 
					 
					
						
						
							
							llama-chat : reset glmedge chat template ( #13253 )  
						
						... 
						
						
						
						* reset glmedge chat template
* fix glmedge chat template 
						
						
					 
					
						2025-05-02 11:06:09 +02:00 
						 
				 
			
				
					
						
							
							
								matteo 
							
						 
					 
					
						
						
							
						
						e0f572c846 
					 
					
						
						
							
							llama-chat : update GLM4 chat template ( #13238 )  
						
						... 
						
						
						
						* update GLM4 chat template
* Update chat template
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
					 
					
						2025-05-01 21:16:38 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						b0ecbd434b 
					 
					
						
						
							
							test: non-cont. b in test-backend-ops -o MUL_MAT ( #13187 )  
						
						
						
						
					 
					
						2025-05-01 20:18:56 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						e1e8e0991f 
					 
					
						
						
							
							CUDA: batched+noncont MMQ, refactor bs>1 MoE code ( #13199 )  
						
						
						
						
					 
					
						2025-04-30 23:12:59 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						da84c04d8f 
					 
					
						
						
							
							docker : do not build tests ( #13204 )  
						
						... 
						
						
						
						* docker : do not build tests
* include "ggml-cpu.h" 
						
						
					 
					
						2025-04-30 10:44:07 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						4e87962e34 
					 
					
						
						
							
							mtmd : fix glm-edge redundant token count ( #13139 )  
						
						... 
						
						
						
						* mtmd : fix glm-edge redundant token count
* fix chat template
* temporary disable GLMEdge test chat tmpl 
						
						
					 
					
						2025-04-28 16:12:56 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						2d451c8059 
					 
					
						
						
							
							common : add common_remote_get_content ( #13123 )  
						
						... 
						
						
						
						* common : add common_remote_get_content
* support max size and timeout
* add tests 
						
						
					 
					
						2025-04-26 22:58:12 +02:00 
						 
				 
			
				
					
						
							
							
								frob 
							
						 
					 
					
						
						
							
						
						d5fe4e81bd 
					 
					
						
						
							
							grammar : handle maxItems == 0 in JSON schema ( #13117 )  
						
						... 
						
						
						
						Co-authored-by: Richard Lyons <frob@cloudstaff.com > 
						
						
					 
					
						2025-04-26 10:10:20 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						edb18b6e8f 
					 
					
						
						
							
							clip : fix pixtral on some GPU backends ( #13097 )  
						
						... 
						
						
						
						* clip : fix pixtral on some GPU backends
* refactor inp_raw set
* rm outdated comment
* fix dynamic size
* add TODO 
						
						
					 
					
						2025-04-25 14:31:42 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						13b4548877 
					 
					
						
						
							
							cmake : do not include ./src as public for libllama ( #13062 )  
						
						... 
						
						
						
						* cmake : do not include ./src as public for libllama
ggml-ci
* cmake : rework tests
ggml-ci
* llguidance : remove unicode include
ggml-ci
* cmake : make c++17 private
ggml-ci 
						
						
					 
					
						2025-04-24 16:00:10 +03:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						658987cfc9 
					 
					
						
						
							
							CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID ( #13014 )  
						
						... 
						
						
						
						* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID
* fix logic for RoPE support, CUDA graphs 
						
						
					 
					
						2025-04-22 21:27:40 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						2f74c354c0 
					 
					
						
						
							
							graph : make FA compatible with MLA + add initial Metal kernels ( #12953 )  
						
						... 
						
						
						
						* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci 
						
						
					 
					
						2025-04-17 18:16:36 +03:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						015022bb53 
					 
					
						
						
							
							vulkan: enable coopmat2 FA gqa and split_k optimizations more often ( #12931 )  
						
						... 
						
						
						
						The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.
split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting. 
						
						
					 
					
						2025-04-16 20:37:25 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						b6930ebc42 
					 
					
						
						
							
							tool-call: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (#12900 )  
						
						... 
						
						
						
						* `tool-call`: don't call common_chat_params_init_hermes_2_pro when there aren't tools (or when there's a schema)
* test all chat formats w/o tools 
						
						
					 
					
						2025-04-11 21:47:52 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						1d2b613445 
					 
					
						
						
							
							tests : fix init order ( #0 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-04-11 00:17:47 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						fe92821ea9 
					 
					
						
						
							
							ggml : add bilinear upscale support (ggml/1185)  
						
						
						
						
					 
					
						2025-04-11 00:17:47 +03:00 
						 
				 
			
				
					
						
							
							
								Plamen Minev 
							
						 
					 
					
						
						
							
						
						381603a775 
					 
					
						
						
							
							ci: detach common from the library ( #12827 )  
						
						... 
						
						
						
						* fix: detach common from the library
* fix: building chat test template 
						
						
					 
					
						2025-04-09 10:11:11 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						bd3f59f812 
					 
					
						
						
							
							cmake : enable curl by default ( #12761 )  
						
						... 
						
						
						
						* cmake : enable curl by default
* no curl if no examples
* fix build
* fix build-linux-cross
* add windows-setup-curl
* fix
* shell
* fix path
* fix windows-latest-cmake*
* run: include_directories
* LLAMA_RUN_EXTRA_LIBS
* sycl: no llama_curl
* no test-arg-parser on windows
* clarification
* try riscv64 / arm64
* windows: include libcurl inside release binary
* add msg
* fix mac / ios / android build
* will this fix xcode?
* try clearing the cache
* add bunch of licenses
* revert clear cache
* fix xcode
* fix xcode (2)
* fix typo 
						
						
					 
					
						2025-04-07 13:35:19 +02:00 
						 
				 
			
				
					
						
							
							
								R0CKSTAR 
							
						 
					 
					
						
						
							
						
						5f696e88e0 
					 
					
						
						
							
							sync : minja (inclusionAI/Ling) and update tests ( #12699 )  
						
						... 
						
						
						
						Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com > 
						
						
					 
					
						2025-04-03 13:51:35 +02:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						f01bd02376 
					 
					
						
						
							
							vulkan: Implement split_k for coopmat2 flash attention. ( #12627 )  
						
						... 
						
						
						
						When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large. 
						
						
					 
					
						2025-04-02 14:25:08 -05:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						267c1399f1 
					 
					
						
						
							
							common : refactor downloading system, handle mmproj with -hf option ( #12694 )  
						
						... 
						
						
						
						* (wip) refactor downloading system [no ci]
* fix all examples
* fix mmproj with -hf
* gemma3: update readme
* only handle mmproj in llava example
* fix multi-shard download
* windows: fix problem with std::min and std::max
* fix 2 
						
						
					 
					
						2025-04-01 23:44:05 +02:00 
						 
				 
			
				
					
						
							
							
								Sergei Vorobyov 
							
						 
					 
					
						
						
							
						
						7242dd9675 
					 
					
						
						
							
							llama-chat : Add Yandex instruct model template support ( #12621 )  
						
						... 
						
						
						
						* add yandex template
* update yandex chat template
* fix tests
* adjust chat template
* fix style
* fix tool macro in template
* add clarify comment
---------
Co-authored-by: Sergei Vorobev <serv01@yandex-team.ru >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
					 
					
						2025-03-30 20:12:03 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b4ae50810e 
					 
					
						
						
							
							metal : improve FA + improve MoE ( #12612 )  
						
						... 
						
						
						
						* ggml : FA with different K, V head sizes (CPU)
ggml-ci
* metal : add FA with HS=192
* metal : extend FA to support different K and V head sizes
ggml-ci
* metal : add FA vector kernels for heads K 192 and V 128
ggml-ci
* ggml : restrict op on other backends to equal head sizes
ggml-ci
* metal : optimize FA-vec kernel
ggml-ci
* metal : FA remove mq registers
* metal : improve MoE mul_mat_id condition
ggml-ci
* metal : fix comments + remove unnecessary addition
ggml-ci
* metal : avoid too much shared memory usage with mul_mat_id
ggml-ci 
						
						
					 
					
						2025-03-28 20:21:59 +02:00 
						 
				 
			
				
					
						
							
							
								Michał Moskal 
							
						 
					 
					
						
						
							
						
						2447ad8a98 
					 
					
						
						
							
							upgrade to llguidance 0.7.10 ( #12576 )  
						
						
						
						
					 
					
						2025-03-26 11:06:09 -07:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						9b169a4d4e 
					 
					
						
						
							
							vulkan: fix mul_mat_vec failure in backend tests ( #12529 )  
						
						... 
						
						
						
						The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs. 
						
						
					 
					
						2025-03-24 07:56:17 +01:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						ba932dfb50 
					 
					
						
						
							
							ggml : fix quantized cpy op ( #12310 )  
						
						... 
						
						
						
						* ggml : fix quantized cpy op
ggml-ci
* tests : add cpy tests for all types
ggml-ci
* tests : add BF16 copy tests
ggml-ci
* tests : fix loop for same-type copy
ggml-ci
* tests : add option to permute the dst tensor
ggml-ci 
						
						
					 
					
						2025-03-22 16:23:26 +02:00