Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						fd1234cb46 
					 
					
						
						
							
							llama : add gpt-oss ( #15091 )  
						
						... 
						
						
						
						* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: slaren <slarengh@gmail.com >
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com >
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: slaren <slarengh@gmail.com > 
						
						
					 
					
						2025-08-05 22:10:36 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						bf9087f59a 
					 
					
						
						
							
							metal : fuse add, mul + add tests ( #14596 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-07-18 20:37:26 +03:00 
						 
				 
			
				
					
						
							
							
								Jesse Gross 
							
						 
					 
					
						
						
							
						
						f057808ffa 
					 
					
						
						
							
							ggml: Don't assert fail when tensor data changes ( #13222 )  
						
						... 
						
						
						
						The following scenario will cause an assertion failure in the graph
allocator:
 - Build and allocate a graph containing a tensor with a non-NULL data
   pointer
 - Build and allocate a new graph where that data is NULL
Result:
ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed
This happens during revalidation because we think that memory should
have been previously allocated based on the current graph but in
reality the previous graph was different. In this situation, we
should do a full reallocation pass. 
						
						
					 
					
						2025-05-01 22:46:10 +02:00 
						 
				 
			
				
					
						
							
							
								William Tambellini 
							
						 
					 
					
						
						
							
						
						70680c48e5 
					 
					
						
						
							
							ggml : upgrade init_tensor API to return a ggml_status ( #11854 )  
						
						... 
						
						
						
						* Upgrade init_tensor API to return a ggml_status
To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.
* misc fixes
---------
Co-authored-by: slaren <slarengh@gmail.com > 
						
						
					 
					
						2025-02-28 14:41:47 +01:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						1b598b3058 
					 
					
						
						
							
							vulkan: use smaller combined allocations to avoid fragmentation ( #11551 )  
						
						
						
						
					 
					
						2025-02-06 07:02:18 +01:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						9c8dcefe17 
					 
					
						
						
							
							CUDA: backwards pass for misc. ops, add tests ( #11257 )  
						
						... 
						
						
						
						* CUDA: backwards pass for misc. ops, add tests
* remove restrict from pointers 
						
						
					 
					
						2025-01-16 16:43:38 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Bevenius 
							
						 
					 
					
						
						
							
						
						130d0c90bd 
					 
					
						
						
							
							ggml : remove return from ggml_gallocr_allocate_node (ggml/1048)  
						
						... 
						
						
						
						This commit removes the return statement from ggml_gallocr_allocate_node
function.
The motivation behind this change is to make the code more readable and
consistent. 
						
						
					 
					
						2024-12-17 18:35:49 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						8a43e940ab 
					 
					
						
						
							
							ggml: new optimization interface (ggml/988)  
						
						
						
						
					 
					
						2024-11-17 08:30:29 +02:00 
						 
				 
			
				
					
						
							
							
								Daniel Bevenius 
							
						 
					 
					
						
						
							
						
						cd60b88bf7 
					 
					
						
						
							
							ggml-alloc : remove buffer_id from leaf_alloc (ggml/987)  
						
						... 
						
						
						
						This commit removes the buffer_id field from the leaf_alloc struct.
The motivation for is that this field is only written to and never
read/used as far as I can tell. Each tensor_alloc has a buffer_id field
and this is what caused me to look into this more closely, to
understand what the buffer_id in leaf_alloc was used for. 
						
						
					 
					
						2024-10-16 11:28:01 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						96776405a1 
					 
					
						
						
							
							ggml : move more prints to the ggml log system ( #9839 )  
						
						... 
						
						
						
						* ggml : move more prints to the ggml log system
* show BLAS OpenMP warnings in all builds using debug print 
						
						
					 
					
						2024-10-11 15:34:45 +02:00 
						 
				 
			
				
					
						
							
							
								slaren 
							
						 
					 
					
						
						
							
						
						d09770cae7 
					 
					
						
						
							
							ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG ( #9573 )  
						
						
						
						
					 
					
						2024-09-21 14:24:23 +02:00 
						 
				 
			
				
					
						
							
							
								slaren 
							
						 
					 
					
						
						
							
						
						2b1f616b20 
					 
					
						
						
							
							ggml : reduce hash table reset cost ( #8698 )  
						
						... 
						
						
						
						* ggml : reduce hash table reset cost
* fix unreachable code warnings after GGML_ASSERT(false)
* GGML_ASSERT(false) -> GGML_ABORT("fatal error")
* GGML_ABORT use format string 
						
						
					 
					
						2024-07-27 04:41:55 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						a15ef8f8a0 
					 
					
						
						
							
							CUDA: fix partial offloading for ne0 % 256 != 0 ( #8572 )  
						
						
						
						
					 
					
						2024-07-18 23:48:47 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f3f65429c4 
					 
					
						
						
							
							llama : reorganize source code + improve CMake ( #8006 )  
						
						... 
						
						
						
						* scripts : update sync [no ci]
* files : relocate [no ci]
* ci : disable kompute build [no ci]
* cmake : fixes [no ci]
* server : fix mingw build
ggml-ci
* cmake : minor [no ci]
* cmake : link math library [no ci]
* cmake : build normal ggml library (not object library) [no ci]
* cmake : fix kompute build
ggml-ci
* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE
ggml-ci
* move public backend headers to the public include directory (#8122 )
* move public backend headers to the public include directory
* nix test
* spm : fix metal header
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* scripts : fix sync paths [no ci]
* scripts : sync ggml-blas.h [no ci]
---------
Co-authored-by: slaren <slarengh@gmail.com > 
						
						
					 
					
						2024-06-26 18:33:02 +03:00