Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						08737ef720 
					 
					
						
						
							
							cuda : fix convert function ( #1412 )  
						
						 
						
						
						
						
					 
					
						2023-05-13 17:40:58 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						905d87b70a 
					 
					
						
						
							
							ggml : GPU-accelerated token generation ( #1412 )  
						
						 
						
						... 
						
						
						
						* CUDA kernel for q4_0 dequant. + mat. vec. mult.
* Added q4_1 via template
* Added missing __syncthreads();
* --gpu_layers -> --gpu-layers
* Shorter dequantize_mul_mat_vec line
* q5_0 dequantize_mul_mat kernel
* More readable dequantize_mul_mat_vec logic
* dequantize_mul_mat_vec kernels for q5_1, q8_0, f16
* llama : offload "output" tensor to GPU too + coding style fixes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2023-05-13 16:38:36 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b9fd7eee57 
					 
					
						
						
							
							ggml : remove bit shuffling ( #1405 )  
						
						 
						
						... 
						
						
						
						* ggml : remove Q4_0 bit shufling (ARM NEON)
* ggml : remove Q4_1 bit shuffling (ARM NEON + reference)
* ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)
* ggml : remove Q4_2 bit shuffling (WIP, BROKEN)
* ggml : remove Q5_0 bit shuffling (ARM NEON)
* ggml : 2x faster scalar implementations
* ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)
* ggml : simplify scalar dot
* ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit
* ggml : fix Q4_1 quantization
* ggml : update cuBLAS + normalize variable names
* ggml : remove Q4_2 mode
* ggml : minor formatting
* ggml : fix Q5_0 quantization
* scripts : add script for measuring the time per token
* AVX implementations (#1370 )
* ggml : uniform 5th bit extraction
* llama : produce error upon loading old model files
* llama : fix model magic/version write
* ggml : speed-up Q5_0 + Q5_1 at 4 threads
* ggml : preserve old Q4 and Q5 formats
* ggml : simplify Q8_1 - no need for low / high sums anymore
* ggml : fix Q8_0 and Q8_1 rounding
* Revert "AVX implementations (#1370 )"
This reverts commit 948d124837 .
* ggml : fix AVX2 implementation
* sha : update hashes for 7B and 13B
* readme : update timings + remove warning banner
* llama : update v2 PR number to 1405
* ggml : fix WASM comments
* ggml : back to original bit order
* readme : add note that Q4 and Q5 have been changed
* llama : fix return for unknown version
---------
Co-authored-by: Stephan Walter <stephan@walter.name > 
						
						
					 
					
						2023-05-12 00:23:08 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						1f48b0abcf 
					 
					
						
						
							
							Documented CUDA reproducibility, added warning ( #1346 )  
						
						 
						
						
						
						
					 
					
						2023-05-08 02:42:01 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								slaren 
							
						 
					 
					
						
						
							
						
						58b367c2d7 
					 
					
						
						
							
							cuBLAS: refactor and optimize f16 mat mul performance ( #1259 )  
						
						 
						
						... 
						
						
						
						* cuBLAS: refactor, convert fp16 to fp32 on device
* cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16
* fix build
* cuBLAS: update block_q5_1 
						
						
					 
					
						2023-05-01 18:11:07 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								slaren 
							
						 
					 
					
						
						
							
						
						b925f1f1b0 
					 
					
						
						
							
							cuBLAS: fall back to pageable memory if pinned alloc fails ( #1233 )  
						
						 
						
						... 
						
						
						
						* cuBLAS: fall back to pageable memory if pinned alloc fails
* cuBLAS: do not use pinned memory if env variable GGML_CUDA_NO_PINNED is set 
						
						
					 
					
						2023-05-01 13:32:22 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								slaren 
							
						 
					 
					
						
						
							
						
						7fc50c051a 
					 
					
						
						
							
							cuBLAS: use host pinned memory and dequantize while copying ( #1207 )  
						
						 
						
						... 
						
						
						
						* cuBLAS: dequantize simultaneously while copying memory
* cuBLAS: use host pinned memory
* cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory
* cuBLAS: also pin kv cache
* fix rebase 
						
						
					 
					
						2023-04-29 02:04:18 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Henri Vasserman 
							
						 
					 
					
						
						
							
						
						b1ee8f59b4 
					 
					
						
						
							
							cuBLAS: non-contiguous tensor support ( #1215 )  
						
						 
						
						... 
						
						
						
						* Cuda: non-contiguous tensor support
* remove extra stuff
* rename
* fix error
* more fixes, now OpenBLAS and CLBlast build too
* now then? 
						
						
					 
					
						2023-04-29 01:31:56 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Stephan Walter 
							
						 
					 
					
						
						
							
						
						36d19a603b 
					 
					
						
						
							
							Remove Q4_3 which is no better than Q5 ( #1218 )  
						
						 
						
						
						
						
					 
					
						2023-04-28 23:10:43 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						574406dc7e 
					 
					
						
						
							
							ggml : add Q5_0 and Q5_1 quantization ( #1187 )  
						
						 
						
						... 
						
						
						
						* ggml : add Q5_0 quantization (cuBLAS only)
* ggml : fix Q5_0 qh -> uint32_t
* ggml : fix q5_0 histogram stats
* ggml : q5_0 scalar dot product
* ggml : q5_0 ARM NEON dot
* ggml : q5_0 more efficient ARM NEON using uint64_t masks
* ggml : rename Q5_0 -> Q5_1
* ggml : adding Q5_0 mode
* quantize : add Q5_0 and Q5_1 to map
* ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195 )
---------
Co-authored-by: Stephan Walter <stephan@walter.name > 
						
						
					 
					
						2023-04-26 23:14:13 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						7a32fcb3b2 
					 
					
						
						
							
							ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) ( #1179 )  
						
						 
						
						... 
						
						
						
						* ggml : add Q8_0 quantization format (rename the old one to Q8_1)
* tests : fix test-quantize-fns
* ggml : finalize Q8_0 implementation
* ggml : use q4_0_q8_0 and q4_2_q8_0
* ggml : fix Q8_0 dot product bug (ARM)
* ggml : Q8_0 unroll x2
* ggml : fix bug - using wrong block type
* ggml : extend quantize_fns_t with "vec_dot_type"
* ggml : fix Q8_0 to use 255 values out of 256
* ggml : fix assert using wrong QK4_2 instead of QK4_3 
						
						
					 
					
						2023-04-25 23:40:51 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								slaren 
							
						 
					 
					
						
						
							
						
						50cb666b8a 
					 
					
						
						
							
							Improve cuBLAS performance by using a memory pool ( #1094 )  
						
						 
						
						... 
						
						
						
						* Improve cuBLAS performance by using a memory pool
* Move cuda specific definitions to ggml-cuda.h/cu
* Add CXX flags to nvcc
* Change memory pool synchronization mechanism to a spin lock
General code cleanup 
						
						
					 
					
						2023-04-21 21:59:17 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								slaren 
							
						 
					 
					
						
						
							
						
						2005469ea1 
					 
					
						
						
							
							Add Q4_3 support to cuBLAS ( #1086 )  
						
						 
						
						
						
						
					 
					
						2023-04-20 20:49:53 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								slaren 
							
						 
					 
					
						
						
							
						
						02d6988121 
					 
					
						
						
							
							Improve cuBLAS performance by dequantizing on the GPU ( #1065 )  
						
						 
						
						
						
						
					 
					
						2023-04-20 03:14:14 +02:00