Nick 
							
						 
					 
					
						
						
							
						
						9c55e5c5c2 
					 
					
						
						
							
							fix: check model pointer validity before use ( #13631 )  
						
						
						
						
							
 
						
					 
					
						2025-05-19 13:25:41 +03:00 
						 
				 
			
				
					
						
							
							
								Chenguang Li 
							
						 
					 
					
						
						
							
						
						33d7aed4a8 
					 
					
						
						
							
							CANN: Support MOE Model MUL_MAT_ID ( #13042 )  
						
						... 
						
						
						
						Signed-off-by: noemotiovon <757486878@qq.com > 
						
						
							
 
						
					 
					
						2025-05-19 14:21:17 +08:00 
						 
				 
			
				
					
						
							
							
								Isaac McFadyen 
							
						 
					 
					
						
						
							
						
						6a2bc8bfb7 
					 
					
						
						
							
							server : added --no-prefill-assistant flag ( #13608 )  
						
						... 
						
						
						
						* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md 
						
						
							
 
						
					 
					
						2025-05-17 23:59:48 +02:00 
						 
				 
			
				
					
						
							
							
								Gilad S. 
							
						 
					 
					
						
						
							
						
						e3a7cf6c5b 
					 
					
						
						
							
							cmake: use the current build config for vulkan-shaders-gen ( #13595 )  
						
						... 
						
						
						
						* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config` 
						
						
							
 
						
					 
					
						2025-05-17 15:26:43 -03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						518329b2d4 
					 
					
						
						
							
							parallel : add option for non-shared and larger prompts ( #13598 )  
						
						... 
						
						
						
						* parallel : add option for non-shared and larger prompts
* parallel : update readme [no ci]
* cont : add note about base models [no ci]
* parallel : better var name
ggml-ci 
						
						
							
						
					 
					
						2025-05-17 12:58:55 +03:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						2f5a4e1e09 
					 
					
						
						
							
							vulkan: move common FA code to flash_attn_base.comp ( #13556 )  
						
						... 
						
						
						
						* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix 
						
						
							
 
						
					 
					
						2025-05-17 09:14:55 +02:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						4f41ee11d6 
					 
					
						
						
							
							vulkan: use scalar FA rather than coopmat2 when N==1 ( #13554 )  
						
						
						
						
							
 
						
					 
					
						2025-05-17 08:35:47 +02:00 
						 
				 
			
				
					
						
							
							
								Z 
							
						 
					 
					
						
						
							
						
						3e0be1cace 
					 
					
						
						
							
							llguidance : official v0.7.20 release (no actual changes) [noci] ( #13594 )  
						
						
						
						
							
 
						
					 
					
						2025-05-16 22:56:28 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						6aa892ec2a 
					 
					
						
						
							
							server : do not return error out of context (with ctx shift disabled) ( #13577 )  
						
						
						
						
							
 
						
					 
					
						2025-05-16 21:50:00 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						aea9f8b4e7 
					 
					
						
						
							
							webui : improve accessibility for visually impaired people ( #13551 )  
						
						... 
						
						
						
						* webui : improve accessibility for visually impaired people
* add a11y for extra contents
* fix some labels being read twice
* add skip to main content 
						
						
							
						
					 
					
						2025-05-16 21:49:01 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						06c1e4abc1 
					 
					
						
						
							
							readme : add list of dependencies and their license ( #13591 )  
						
						
						
						
							
						
					 
					
						2025-05-16 20:04:18 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						415e40a357 
					 
					
						
						
							
							releases : use arm version of curl for arm releases ( #13592 )  
						
						
						
						
							
 
						
					 
					
						2025-05-16 19:36:51 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						654a67794f 
					 
					
						
						
							
							metal : add FA-vec kernel for head size 64 ( #13583 )  
						
						... 
						
						
						
						ggml-ci 
						
						
							
 
						
					 
					
						2025-05-16 20:32:58 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						5364ae4ba5 
					 
					
						
						
							
							llama : print hint when loading a model when no backends are loaded ( #13589 )  
						
						
						
						
							
 
						
					 
					
						2025-05-16 16:38:07 +02:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						7c07ac244d 
					 
					
						
						
							
							ci : add ppc64el to build-linux-cross ( #13575 )  
						
						
						
						
							
						
					 
					
						2025-05-16 14:54:23 +02:00 
						 
				 
			
				
					
						
							
							
								Łukasz Ślusarczyk 
							
						 
					 
					
						
						
							
						
						0a338ed013 
					 
					
						
						
							
							sycl : fixed compilation warnings ( #13582 )  
						
						
						
						
							
 
						
					 
					
						2025-05-16 18:15:29 +08:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						bc098c3cf0 
					 
					
						
						
							
							minja: sync (qwen3) ( #13573 )  
						
						... 
						
						
						
						* minja: sync f06140fa52https://github.com/google/minja/pull/67  (@grf53)
- https://github.com/google/minja/pull/66  (@taha-yassine)
- https://github.com/google/minja/pull/63  (@grf53)
- https://github.com/google/minja/pull/58 
---------
Co-authored-by: ochafik <ochafik@google.com > 
						
						
							
 
						
					 
					
						2025-05-15 23:29:10 +01:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						c6a2c9e741 
					 
					
						
						
							
							gguf : use ggml log system ( #13571 )  
						
						... 
						
						
						
						* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages 
						
						
							
 
						
					 
					
						2025-05-15 19:13:11 +02:00 
						 
				 
			
				
					
						
							
							
								Daniel Tang 
							
						 
					 
					
						
						
							
						
						07ad2b6db3 
					 
					
						
						
							
							gguf-py : fix disconnect-before-connect in editor-gui ( #13569 )  
						
						... 
						
						
						
						The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10. 
						
						
							
						
					 
					
						2025-05-15 18:47:10 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						c531edfa34 
					 
					
						
						
							
							convert : fix conversion for llama 4 ( #13567 )  
						
						
						
						
							
						
					 
					
						2025-05-15 17:40:07 +02:00 
						 
				 
			
				
					
						
							
							
								Atharva Dubey 
							
						 
					 
					
						
						
							
						
						02cdd2d8b0 
					 
					
						
						
							
							sycl: simplify bin_bcast_kernel ( #13383 )  
						
						
						
						
							
						
					 
					
						2025-05-15 17:39:52 +02:00 
						 
				 
			
				
					
						
							
							
								Svetlozar Georgiev 
							
						 
					 
					
						
						
							
						
						64bb51cf90 
					 
					
						
						
							
							sycl: reordered Q4_K MMVQ ( #13109 )  
						
						
						
						
							
						
					 
					
						2025-05-15 17:35:44 +02:00 
						 
				 
			
				
					
						
							
							
								Łukasz Ślusarczyk 
							
						 
					 
					
						
						
							
						
						9c404ed54c 
					 
					
						
						
							
							sycl: use oneDNN for matrices multiplication ( #12972 )  
						
						
						
						
							
 
						
					 
					
						2025-05-15 16:53:41 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						6c8b91500e 
					 
					
						
						
							
							llama-bench : fix -ot with dl backends ( #13563 )  
						
						
						
						
							
 
						
					 
					
						2025-05-15 15:46:55 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						3cc1f1f1d2 
					 
					
						
						
							
							webui : handle PDF input (as text or image) + convert pasted long content to file ( #13562 )  
						
						... 
						
						
						
						* webui : handle PDF input (as text or image)
* handle the case where pdf image + server without mtmd
* fix bug missing pages 
						
						
							
						
					 
					
						2025-05-15 14:24:50 +02:00 
						 
				 
			
				
					
						
							
							
								Piotr Wilkin (ilintar) 
							
						 
					 
					
						
						
							
						
						c753d7bed0 
					 
					
						
						
							
							server : proper error handling for missing elements in messages array (OpenAI compatible backend) ( #13540 )  
						
						
						
						
							
 
						
					 
					
						2025-05-15 08:40:58 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b2838049cc 
					 
					
						
						
							
							bench : handle decode errors ( #13548 )  
						
						... 
						
						
						
						ggml-ci 
						
						
							
 
						
					 
					
						2025-05-15 05:57:02 +03:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						aa48e373f2 
					 
					
						
						
							
							server: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )  
						
						... 
						
						
						
						* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729 
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
							
 
						
					 
					
						2025-05-15 02:39:51 +01:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e3a9421b78 
					 
					
						
						
							
							kv-cache : fix out-of-bounds view during reserve graph ( #13547 )  
						
						... 
						
						
						
						* kv-cache : fix reserve graph out-of-bounds access
ggml-ci
* cont : add comment
* cont : fix comments [no ci]
* cont : more correct comment [no ci] 
						
						
							
						
					 
					
						2025-05-14 23:15:15 +03:00 
						 
				 
			
				
					
						
							
							
								Yibo Cai 
							
						 
					 
					
						
						
							
						
						5ab5d5fb25 
					 
					
						
						
							
							arm64: optimize q6_k_q8_k kernel with i8mm ( #13519 )  
						
						... 
						
						
						
						This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64
---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
``` 
						
						
							
 
						
					 
					
						2025-05-14 21:53:52 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						3198405e98 
					 
					
						
						
							
							common: add partial regex support (#12808 )  
						
						... 
						
						
						
						* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
							
 
						
					 
					
						2025-05-14 19:50:57 +01:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						f5170c1d7a 
					 
					
						
						
							
							editorconfig : fix trailing whitespace from  #13542  ( #13546 )  
						
						
						
						
							
						
					 
					
						2025-05-14 21:22:49 +03:00 
						 
				 
			
				
					
						
							
							
								Gilad S. 
							
						 
					 
					
						
						
							
						
						017f10b5fa 
					 
					
						
						
							
							fix: crash when calling llama_state_get_size on a context without a KV cache ( #13542 )  
						
						
						
						
							
 
						
					 
					
						2025-05-14 19:18:18 +03:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						4696d56749 
					 
					
						
						
							
							CUDA: fix crash on large batch size for quant. MoE ( #13537 )  
						
						
						
						
							
 
						
					 
					
						2025-05-14 16:41:02 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						b7d2672082 
					 
					
						
						
							
							llama : fix quantize with dl backends ( #13539 )  
						
						
						
						
							
						
					 
					
						2025-05-14 16:12:36 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						6da34fa276 
					 
					
						
						
							
							CUDA: faster Deepseek FA, add Turing support ( #13435 )  
						
						
						
						
							
 
						
					 
					
						2025-05-14 16:08:20 +02:00 
						 
				 
			
				
					
						
							
							
								Gabe Goodhart 
							
						 
					 
					
						
						
							
						
						5e7d95e22e 
					 
					
						
						
							
							fix: Move build_inp_pos to the top of the graph section for build_granite ( #13538 )  
						
						... 
						
						
						
						This matches how others do it, but will still avoid the extra
initialization when rope is disabled.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com > 
						
						
							
 
						
					 
					
						2025-05-14 15:53:59 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						053174436f 
					 
					
						
						
							
							server : passthrough the /models endpoint during loading ( #13535 )  
						
						... 
						
						
						
						* server : passthrough the /models endpoint during loading
* server : update readme + return json for "meta" field 
						
						
							
 
						
					 
					
						2025-05-14 15:42:10 +03:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						360a9c98e1 
					 
					
						
						
							
							server : fix cache_tokens bug with no cache_prompt ( #13533 )  
						
						
						
						
							
 
						
					 
					
						2025-05-14 13:35:07 +02:00 
						 
				 
			
				
					
						
							
							
								bandoti 
							
						 
					 
					
						
						
							
						
						09d13d94fb 
					 
					
						
						
							
							cmake: simplify vulkan shader test logic ( #13263 )  
						
						
						
						
							
 
						
					 
					
						2025-05-14 07:53:57 -03:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						24e86cae72 
					 
					
						
						
							
							vulkan: KHR_coopmat flash attention ( #13506 )  
						
						... 
						
						
						
						This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows. 
						
						
							
 
						
					 
					
						2025-05-14 11:55:26 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						bb1681fbd5 
					 
					
						
						
							
							webui : use fflate for more deterministic gzip compress ( #13525 )  
						
						... 
						
						
						
						* webui : use pako for more deterministic gzip compress
* simpler code
* use fflate instead of pako 
						
						
							
						
					 
					
						2025-05-14 10:26:12 +02:00 
						 
				 
			
				
					
						
							
							
								Luca Stefani 
							
						 
					 
					
						
						
							
						
						d486dd3e8e 
					 
					
						
						
							
							webui: Allow pasting file from clipboard ( #13526 )  
						
						... 
						
						
						
						* server: Allow pasting file from clipboard
* server: Prevent default action on file paste
* update build
* format then build combined
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co > 
						
						
							
						
					 
					
						2025-05-14 10:07:31 +02:00 
						 
				 
			
				
					
						
							
							
								ddpasa 
							
						 
					 
					
						
						
							
						
						21ca987fba 
					 
					
						
						
							
							docs: Update link to ggml-org in multimodal.md ( #13513 )  
						
						... 
						
						
						
						* Update multimodal.md
Minor change to include the huggingface link
* Update docs/multimodal.md
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
							
						
					 
					
						2025-05-14 09:59:12 +02:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						be1d4a13db 
					 
					
						
						
							
							scripts : fix compare-llama-bench.py show parameter ( #13514 )  
						
						
						
						
							
						
					 
					
						2025-05-14 08:41:01 +02:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						ab3971f2a0 
					 
					
						
						
							
							vulkan: workaround FA compile failures on macos ( #13517 )  
						
						
						
						
							
 
						
					 
					
						2025-05-14 06:15:50 +02:00 
						 
				 
			
				
					
						
							
							
								Ed Addario 
							
						 
					 
					
						
						
							
						
						e5c834f718 
					 
					
						
						
							
							quantize : improve tensor-type pattern matching ( #13033 )  
						
						
						
						
							
 
						
					 
					
						2025-05-13 19:12:31 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						71bdbdb587 
					 
					
						
						
							
							clip : clip.h become private API ( ⚠️  breaking change) ( #13510 )  
						
						
						
						
							
 
						
					 
					
						2025-05-13 17:07:21 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f0995d28ce 
					 
					
						
						
							
							metal : use FA-vec kernel up to batch size 20 ( #13496 )  
						
						... 
						
						
						
						* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
* metal : use FA-vec kernel up to batch size 20
ggml-ci 
						
						
							
 
						
					 
					
						2025-05-13 18:04:39 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						c252e0c409 
					 
					
						
						
							
							metal : optimize multi-sequence FA vec kernel ( #13493 )  
						
						... 
						
						
						
						* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci 
						
						
							
 
						
					 
					
						2025-05-13 18:04:00 +03:00