Johannes Gäßler
							
						 
					 | 
					
						
						
							
						
						1f0dabda8d
					 | 
					
						
						
							
							CUDA: use tensor cores for MMQ (#7676)
						
						
						
						
						
						
						
						* CUDA: int8 tensor cores for MMQ (legacy quants)
* fix out-of-bounds writes
* __builtin_assume -> GGML_CUDA_ASSUME
* fix writeback returning too early 
						
						
					 | 
					
						2024-06-10 11:45:13 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Johannes Gäßler
							
						 
					 | 
					
						
						
							
						
						750f60c03e
					 | 
					
						
						
							
							CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (#7681)
						
						
						
						
						
						
					 | 
					
						2024-06-01 15:47:04 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Johannes Gäßler
							
						 
					 | 
					
						
						
							
						
						9b596417af
					 | 
					
						
						
							
							CUDA: quantized KV support for FA vec (#7527)
						
						
						
						
						
						
						
						* CUDA: quantized KV support for FA vec
* try CI fix
* fix commented-out kernel variants
* add q8_0 q4_0 tests
* fix nwarps > batch size
* split fattn compile via extern templates
* fix flake8
* fix metal tests
* fix cmake
* make generate_cu_files.py executable
* add autogenerated .cu files
* fix AMD
* error if type_v != FP16 and not flash_attn
* remove obsolete code 
						
						
					 | 
					
						2024-06-01 08:44:14 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Johannes Gäßler
							
						 
					 | 
					
						
						
							
						
						133d99c599
					 | 
					
						
						
							
							CUDA: deduplicate FlashAttention code (#7352)
						
						
						
						
						
						
					 | 
					
						2024-05-18 12:36:25 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Johannes Gäßler
							
						 
					 | 
					
						
						
							
						
						dc685be466
					 | 
					
						
						
							
							CUDA: add FP32 FlashAttention vector kernel (#7188)
						
						
						
						
						
						
						
						* CUDA: add FP32 FlashAttention vector kernel
* fixup! CUDA: add FP32 FlashAttention vector kernel
* fixup! fixup! CUDA: add FP32 FlashAttention vector kernel
* fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel 
						
						
					 | 
					
						2024-05-12 19:40:45 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
							
							
						
					 |