Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						792d1a1b16 
					 
					
						
						
							
							llama : minor  
						
						
						
						
							
						
					 
					
						2023-10-30 11:34:47 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f39e6075cf 
					 
					
						
						
							
							llama : add llm_build_kqv helper  
						
						... 
						
						
						
						ggml-ci 
						
						
							
						
					 
					
						2023-10-29 22:45:03 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						c9121fdd0f 
					 
					
						
						
							
							llama : remove obsolete comments in build graphs  
						
						
						
						
							
						
					 
					
						2023-10-29 21:44:19 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a104abea48 
					 
					
						
						
							
							llama : simplify falcon Q, K, V computation  
						
						
						
						
							
						
					 
					
						2023-10-29 21:24:25 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						31a12f3d03 
					 
					
						
						
							
							llama : fix llm_build_k_shift to use n_head_kv instead of n_head  
						
						
						
						
							
						
					 
					
						2023-10-29 21:17:46 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						5990861938 
					 
					
						
						
							
							llama : remove obsolete offload names  
						
						
						
						
							
						
					 
					
						2023-10-29 21:11:20 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3e0462594b 
					 
					
						
						
							
							llama : add llm_build_kv_store helper  
						
						... 
						
						
						
						ggml-ci 
						
						
							
						
					 
					
						2023-10-29 21:09:34 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						909d64471b 
					 
					
						
						
							
							llama : fix offloading after recent changes  
						
						
						
						
							
						
					 
					
						2023-10-29 20:38:49 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						38728a0be0 
					 
					
						
						
							
							llama : add llm_build_k_shift helper  
						
						... 
						
						
						
						ggml-ci 
						
						
							
						
					 
					
						2023-10-29 19:23:07 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						dbf836bb64 
					 
					
						
						
							
							llama : add llm_build_ffn helper function ( #3849 )  
						
						... 
						
						
						
						ggml-ci 
						
						
							
						
					 
					
						2023-10-29 18:47:46 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						7db9c96d8a 
					 
					
						
						
							
							llama : add llm_build_norm helper function  
						
						... 
						
						
						
						ggml-ci 
						
						
							
						
					 
					
						2023-10-29 15:48:48 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						210e6e5d02 
					 
					
						
						
							
							llama : remove obsolete map for layer counting  
						
						
						
						
							
						
					 
					
						2023-10-29 13:39:04 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						79ad734417 
					 
					
						
						
							
							llama : comment  
						
						... 
						
						
						
						ggml-ci 
						
						
							
						
					 
					
						2023-10-29 13:27:53 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						761087932b 
					 
					
						
						
							
							llama : add functional header  
						
						
						
						
							
						
					 
					
						2023-10-29 13:26:32 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						8925cf9ef8 
					 
					
						
						
							
							llama : add layer index to all tensor names  
						
						
						
						
							
						
					 
					
						2023-10-29 13:22:15 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						1e9c5443c2 
					 
					
						
						
							
							llama : refactor tensor offloading as callback  
						
						
						
						
							
						
					 
					
						2023-10-29 13:05:10 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						da936188d8 
					 
					
						
						
							
							llama : move refact in correct place + optimize graph input  
						
						
						
						
							
						
					 
					
						2023-10-29 11:48:58 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						739b85c985 
					 
					
						
						
							
							llama : try to fix build  
						
						
						
						
							
						
					 
					
						2023-10-29 11:25:32 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						25cfbf6776 
					 
					
						
						
							
							llama : fix non-CUDA build  
						
						
						
						
							
						
					 
					
						2023-10-29 11:12:03 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b4ad03b3a7 
					 
					
						
						
							
							llama : try to optimize offloading code  
						
						
						
						
							
						
					 
					
						2023-10-29 10:33:11 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						79617902ea 
					 
					
						
						
							
							llama : fix res_norm offloading  
						
						
						
						
							
						
					 
					
						2023-10-29 09:20:35 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e14aa46151 
					 
					
						
						
							
							llama : do tensor offload only with CUDA  
						
						
						
						
							
						
					 
					
						2023-10-29 08:03:46 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						0dc05b8433 
					 
					
						
						
							
							llama : factor graph input into a function  
						
						
						
						
							
						
					 
					
						2023-10-29 07:52:43 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						4e98897ede 
					 
					
						
						
							
							llama : support offloading result_norm + comments  
						
						
						
						
							
						
					 
					
						2023-10-29 07:36:07 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						51c4f9ee9f 
					 
					
						
						
							
							llama : comments  
						
						
						
						
							
						
					 
					
						2023-10-28 22:50:08 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3af8771389 
					 
					
						
						
							
							llama : update offload log messages to print node index  
						
						
						
						
							
						
					 
					
						2023-10-28 22:36:44 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						83d2c43791 
					 
					
						
						
							
							llama : offload rest of the models  
						
						... 
						
						
						
						ggml-ci 
						
						
							
						
					 
					
						2023-10-28 22:30:54 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						38aca9e1ab 
					 
					
						
						
							
							llama : factor out tensor offloading outside the build call (wip)  
						
						... 
						
						
						
						ggml-ci 
						
						
							
						
					 
					
						2023-10-28 21:22:31 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						5946d98fc8 
					 
					
						
						
							
							metal : disable kernel load log  
						
						
						
						
							
						
					 
					
						2023-10-28 21:22:01 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						8b2420d249 
					 
					
						
						
							
							llama : factor out ggml-alloc from graph graph build functions  
						
						... 
						
						
						
						ggml-ci 
						
						
							
						
					 
					
						2023-10-28 19:54:28 +03:00 
						 
				 
			
				
					
						
							
							
								Erik Scholz 
							
						 
					 
					
						
						
							
						
						ff3bad83e2 
					 
					
						
						
							
							flake : update flake.lock for newer transformers version + provide extra dev shell ( #3797 )  
						
						... 
						
						
						
						* flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts) 
						
						
							
						
					 
					
						2023-10-28 16:41:07 +02:00 
						 
				 
			
				
					
						
							
							
								Aarni Koskela 
							
						 
					 
					
						
						
							
						
						82a6646e02 
					 
					
						
						
							
							metal : try cwd for ggml-metal.metal if bundle lookup fails ( #3793 )  
						
						... 
						
						
						
						* Try cwd for ggml-metal if bundle lookup fails
When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`.  In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.
Follows up on #1782 
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
							
 
						
					 
					
						2023-10-28 15:43:01 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						ba231e8a6d 
					 
					
						
						
							
							issues : change label from bug to bug-unconfirmed ( #3748 )  
						
						
						
						
							
						
					 
					
						2023-10-28 15:35:26 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						8a2f2fea29 
					 
					
						
						
							
							convert : ignore tokens if their IDs are within [0, vocab_size) ( #3831 )  
						
						
						
						
							
						
					 
					
						2023-10-28 06:25:15 -06:00 
						 
				 
			
				
					
						
							
							
								Kerfuffle 
							
						 
					 
					
						
						
							
						
						bd6d9e2059 
					 
					
						
						
							
							llama : allow quantizing k-quants to fall back when tensor size incompatible ( #3747 )  
						
						... 
						
						
						
						* Allow quantizing k-quants to fall back when tensor size incompatible
* quantizing: Add warning when tensors were incompatible with k-quants
Clean up k-quants state passing a bit 
						
						
							
 
						
					 
					
						2023-10-28 14:54:24 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						ee1a0ec9cb 
					 
					
						
						
							
							llama : add option for greedy sampling with probs ( #3813 )  
						
						... 
						
						
						
						* llama : add option for greedy sampling with probs
* llama : add comment about llama_sample_token_greedy() missing probs
* sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs 
						
						
							
 
						
					 
					
						2023-10-28 14:23:11 +03:00 
						 
				 
			
				
					
						
							
							
								Henk Poley 
							
						 
					 
					
						
						
							
						
						177461104b 
					 
					
						
						
							
							common : print that one line of the syntax help *also* to standard output ( #3823 )  
						
						
						
						
							
 
						
					 
					
						2023-10-28 13:16:33 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						fdee152e4e 
					 
					
						
						
							
							starcoder : add GPU offloading ( #3827 )  
						
						... 
						
						
						
						* starcoder : do not GPU split 1D bias tensors
* starcoder : offload layers to GPU
ggml-ci 
						
						
							
 
						
					 
					
						2023-10-28 12:06:08 +03:00 
						 
				 
			
				
					
						
							
							
								Kerfuffle 
							
						 
					 
					
						
						
							
						
						41aee4df82 
					 
					
						
						
							
							speculative : ensure draft and target model vocab matches ( #3812 )  
						
						... 
						
						
						
						* speculative: Ensure draft and target model vocab matches
* Tolerate small differences when checking dft vs tgt vocab 
						
						
							
 
						
					 
					
						2023-10-28 00:40:07 +03:00 
						 
				 
			
				
					
						
							
							
								cebtenzzre 
							
						 
					 
					
						
						
							
						
						6d459cbfbe 
					 
					
						
						
							
							llama : correctly report GGUFv3 format ( #3818 )  
						
						
						
						
							
 
						
					 
					
						2023-10-27 17:33:53 -04:00 
						 
				 
			
				
					
						
							
							
								Thibault Terrasson 
							
						 
					 
					
						
						
							
						
						c8d6a1f34a 
					 
					
						
						
							
							simple : fix batch handling ( #3803 )  
						
						
						
						
							
 
						
					 
					
						2023-10-27 08:37:41 -06:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						2f9ec7e271 
					 
					
						
						
							
							cuda : improve text-generation and batched decoding performance ( #3776 )  
						
						... 
						
						
						
						* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels 
						
						
							
 
						
					 
					
						2023-10-27 17:01:23 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						34b2a5e1ee 
					 
					
						
						
							
							server : do not release slot on image input ( #3798 )  
						
						
						
						
							
 
						
					 
					
						2023-10-26 22:54:17 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						6961c4bd0b 
					 
					
						
						
							
							batched-bench : print params at start  
						
						
						
						
							
 
						
					 
					
						2023-10-25 10:26:27 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						cc44877486 
					 
					
						
						
							
							log : disable pid in log filenames  
						
						
						
						
							
 
						
					 
					
						2023-10-25 10:09:16 +03:00 
						 
				 
			
				
					
						
							
							
								cebtenzzre 
							
						 
					 
					
						
						
							
						
						ad93962657 
					 
					
						
						
							
							server : add parameter -tb N, --threads-batch N ( #3584 ) ( #3768 )  
						
						... 
						
						
						
						Co-authored-by: Michael Coppola <m18coppola@gmail.com >
Co-authored-by: Michael Coppola <info@michaeljcoppola.com > 
						
						
							
 
						
					 
					
						2023-10-24 23:10:43 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						1717521cdb 
					 
					
						
						
							
							server : do not block system prompt update ( #3767 )  
						
						... 
						
						
						
						* server : do not block system prompt update
* server : update state machine logic to process system prompts
* server : minor 
						
						
							
 
						
					 
					
						2023-10-24 23:08:20 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b2f7e04bd3 
					 
					
						
						
							
							sync : ggml (conv ops + cuda MSVC fixes) ( #3765 )  
						
						... 
						
						
						
						ggml-ci 
						
						
							
 
						
					 
					
						2023-10-24 21:51:20 +03:00 
						 
				 
			
				
					
						
							
							
								John Smith 
							
						 
					 
					
						
						
							
						
						abd21fc99f 
					 
					
						
						
							
							cmake : add missed dependencies ( #3763 )  
						
						
						
						
							
 
						
					 
					
						2023-10-24 20:48:45 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						2b4ea35e56 
					 
					
						
						
							
							cuda : add batched cuBLAS GEMM for faster attention ( #3749 )  
						
						... 
						
						
						
						* cmake : add helper for faster CUDA builds
* batched : add NGL arg
* ggml : skip nops in compute_forward
* cuda : minor indentation
* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)
* Apply suggestions from code review
These changes plus:
```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```
are needed to compile with ROCM. I haven't done performance testing, but it seems to work.
I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.
* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define
* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases
* cuda : reduce mallocs in cublasGemmBatchedEx branch
* cuda : add TODO for calling cublas from kernel + using mem pool
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com > 
						
						
							
 
						
					 
					
						2023-10-24 16:48:37 +03:00