slaren 
							
						 
					 
					
						
						
							
						
						f5ef5cfb18 
					 
					
						
						
							
							ggml-cuda : perform cublas mat mul of quantized types as f16 ( #3412 )  
						
						... 
						
						
						
						* ggml-cuda : perform cublas matrix multiplication of quantized types as fp16
* rename CC_TURING to CC_VOLTA
* disable fp16 mat mul completely with multi GPU 
						
						
					 
					
						2023-09-30 18:12:57 +02:00 
						 
				 
			
				
					
						
							
							
								slaren 
							
						 
					 
					
						
						
							
						
						16bc66d947 
					 
					
						
						
							
							llama.cpp : split llama_context_params into model and context params ( #3301 )  
						
						... 
						
						
						
						* llama.cpp : split llama_context_params into model and context params
ggml-ci
* fix metal build
* fix freq_base/scale default to model value
* llama-bench : keep the same model between tests when possible
* move n_threads to llama_context_params, add n_threads_batch
* fix mpi build
* remove kv_size(), cuda scratch fixes
* remove low-vram option
* add n_threads_batch to system info, refactor to get_system_info()
* add documentation about --threads-batch to the READMEs
* llama-bench fix
* main : fix rope freq/scale warning
* llama.cpp : add llama_get_model
common : add llama_tokenize from model
* remove duplicated ctx/model functions
ggml-ci
* cuda : print total VRAM used 
						
						
					 
					
						2023-09-28 22:42:38 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						ec893798b7 
					 
					
						
						
							
							llama : custom attention mask + parallel decoding + no context swaps ( #3228 )  
						
						... 
						
						
						
						* tests : verify that RoPE is "additive"
* llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask)
* ggml : ggml_rope now takes a vector with positions instead of n_past
* metal : add rope_f16 kernel + optimize cpy kernels
* llama : unified KV cache + batch inference API
* llama : add new llama_decode() API that works with llama_batch
* llama : add cell_max heuristic for more efficient kv_cache
* llama : extend llama_kv_cache API
* llama : more robust cell_max heuristic + wip shift
* metal : disable concurrency optimization
* llama : add llama_kv_cache_shift_seq + no more context swaps
* llama : apply K-cache roping for Falcon and Baichuan
* speculative : fix KV cache management
* parallel : example for serving multiple users in parallel
* parallel : disable hot-plug to avoid cache fragmentation
* fixes : speculative KV cache + llama worst-case graph
* llama : extend batch API to select which logits to output
* llama : fix worst case graph build
* ggml-cuda : update rope implementation for parallel decoding (#3254 )
* ggml-cuda : update rope implementation for parallel decoding
* better solution for p0 computation
* fix rope
* simpler rope implementation
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* make : add parallel to build + fix static functions in llama.cpp
* simple : fix token counting
* parallel : various improvements
* llama : fix cell_max logic + rename functions
* parallel : try smaller batches when the KV cache is fragmented
* parallel : fix sequence termination criteria
* llama : silence errors KV cache errors
* parallel : remove new line from prompt
* parallel : process system prompt once + configurable paramters + llama API
* parallel : remove question with short answers
* parallel : count cache misses
* parallel : print misses on each request
* parallel : minor
* llama : fix n_kv to never become 0
* parallel : rename hot-plug to continuous-batching
* llama : improve llama_batch API + simplify parallel example
* simple : add parallel decoding support
* simple : improve comments + free batch
* ggml-cuda : add rope f16, restore performance with parallel decoding (#3272 )
* ggml-cuda : add rope f16, restore performance
* offload KQ_mask with all models
* fix rope shift
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* llama : disable MPI for now
ggml-ci
* train : make KQ_pos memory buffer permanent via dummy scale op
* ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275 )
ggml-ci
* parallel : fix bug (extra BOS) + smaller token_prev array
* parallel : fix cases where the input prompts can overflow the batch
* parallel : add disabled experimental batch chunking in powers of two
* llama : llama.h formatting + comments
* simple : add README.md
* llama : fix kv cache heuristic when context is less than 32
* parallel : fix crash when `-n -1`
* llama : simplify returns if/else branches
* metal : use mm kernels for batch size > 2
* examples : utilize new llama_get_logits_ith()
* examples : add example for batched decoding
* examples : do not eval prompt 2 times (close  #3348 )
* server : clear the KV cache beyond n_past before llama_decode
* server : avoid context swaps by shifting the KV cache
---------
Co-authored-by: slaren <slarengh@gmail.com > 
						
						
					 
					
						2023-09-28 19:04:36 +03:00 
						 
				 
			
				
					
						
							
							
								slaren 
							
						 
					 
					
						
						
							
						
						da0400344b 
					 
					
						
						
							
							ggml-cuda : perform cublas fp16 matrix multiplication as fp16 ( #3370 )  
						
						... 
						
						
						
						* ggml-cuda : perform cublas fp16 matrix multiplication as fp16
* try to fix rocm build
* restrict fp16 mat mul to volta and up 
						
						
					 
					
						2023-09-28 13:08:28 +03:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						ee66942d7e 
					 
					
						
						
							
							CUDA: fix peer access logic ( #3231 )  
						
						
						
						
					 
					
						2023-09-17 23:35:20 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						111163e246 
					 
					
						
						
							
							CUDA: enable peer access between devices ( #2470 )  
						
						
						
						
					 
					
						2023-09-17 16:37:53 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						578d8c8f5c 
					 
					
						
						
							
							CUDA: fix scratch malloced on non-main device ( #3220 )  
						
						
						
						
					 
					
						2023-09-17 14:16:22 +02:00 
						 
				 
			
				
					
						
							
							
								Vlad 
							
						 
					 
					
						
						
							
						
						5dbc2b3213 
					 
					
						
						
							
							Enable build with CUDA 11.0 (make) ( #3132 )  
						
						... 
						
						
						
						* CUDA 11.0 fixes
* Cleaner CUDA/host flags separation
Also renamed GGML_ASSUME into GGML_CUDA_ASSUME 
						
						
					 
					
						2023-09-16 16:55:43 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						0a5eebb45d 
					 
					
						
						
							
							CUDA: mul_mat_q RDNA2 tunings ( #2910 )  
						
						... 
						
						
						
						* CUDA: mul_mat_q RDNA2 tunings
* Update ggml-cuda.cu
Co-authored-by: Henri Vasserman <henv@hot.ee >
---------
Co-authored-by: Henri Vasserman <henv@hot.ee > 
						
						
					 
					
						2023-09-13 11:20:24 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						4f7cd6ba9c 
					 
					
						
						
							
							CUDA: fix LoRAs ( #3130 )  
						
						
						
						
					 
					
						2023-09-13 00:15:33 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						89e89599fd 
					 
					
						
						
							
							CUDA: fix mul_mat_q not used for output tensor ( #3127 )  
						
						
						
						
					 
					
						2023-09-11 22:58:41 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						d54a4027a6 
					 
					
						
						
							
							CUDA: lower GPU latency + fix Windows performance ( #3110 )  
						
						
						
						
					 
					
						2023-09-11 19:55:51 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						8a4ca9af56 
					 
					
						
						
							
							CUDA: add device number to error messages ( #3112 )  
						
						
						
						
					 
					
						2023-09-11 13:00:24 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b3e9852e47 
					 
					
						
						
							
							sync : ggml (CUDA GLM RoPE + POSIX) ( #3082 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2023-09-08 17:58:07 +03:00 
						 
				 
			
				
					
						
							
							
								Jiahao Li 
							
						 
					 
					
						
						
							
						
						35195689cd 
					 
					
						
						
							
							2x faster (rms) norm cuda kernels (3.7% e2e improvement) ( #2985 )  
						
						... 
						
						
						
						* 2x faster (rms) norm cuda kernels
* Fix code style 
						
						
					 
					
						2023-09-04 08:53:30 +02:00 
						 
				 
			
				
					
						
							
							
								Engininja2 
							
						 
					 
					
						
						
							
						
						f04d002844 
					 
					
						
						
							
							cuda : vsubss4 for older versions of ROCm/clang ( #2942 )  
						
						
						
						
					 
					
						2023-09-01 23:33:19 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						92b1bbd2ec 
					 
					
						
						
							
							CUDA: fix RoPE asserts, block sizes ( #2833 )  
						
						
						
						
					 
					
						2023-08-28 14:23:55 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						eaa13a48ff 
					 
					
						
						
							
							falcon : fix CUDA inference by making K and Q contiguous ( #2830 )  
						
						... 
						
						
						
						* falcon : fix CUDA inference by making K and Q contiguous
ggml-ci
* cuda : add assert to guard from non-cont ropes 
						
						
					 
					
						2023-08-27 16:40:48 +03:00 
						 
				 
			
				
					
						
							
							
								Kawrakow 
							
						 
					 
					
						
						
							
						
						a6d1189fdd 
					 
					
						
						
							
							k_quants tuning for Falcon-7b ( #2816 )  
						
						... 
						
						
						
						* Make ggml-cuda.cu build with QK_K = 64
Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces
a meaningful result.
* k_quants tuning for Falcon-7b
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com > 
						
						
					 
					
						2023-08-27 15:19:59 +03:00 
						 
				 
			
				
					
						
							
							
								Henri Vasserman 
							
						 
					 
					
						
						
							
						
						6bbc598a63 
					 
					
						
						
							
							ROCm Port ( #1087 )  
						
						... 
						
						
						
						* use hipblas based on cublas
* Update Makefile for the Cuda kernels
* Expand arch list and make it overrideable
* Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5 )
* add hipBLAS to README
* new build arg LLAMA_CUDA_MMQ_Y
* fix half2 decomposition
* Add intrinsics polyfills for AMD
* AMD assembly optimized __dp4a
* Allow overriding CC_TURING
* use "ROCm" instead of "CUDA"
* ignore all build dirs
* Add Dockerfiles
* fix llama-bench
* fix -nommq help for non CUDA/HIP
---------
Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com >
Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com >
Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com >
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com >
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com >
Co-authored-by: jammm <2500920+jammm@users.noreply.github.com >
Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com > 
						
						
					 
					
						2023-08-25 12:09:42 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3f460a2b72 
					 
					
						
						
							
							cuda : add RoPE kernel for mode == 2 (NeoX) ( #2760 )  
						
						... 
						
						
						
						* cuda : add RoPE kernel for mode == 2 (NeoX)
* falcon : do not offload the embeddings layer 
						
						
					 
					
						2023-08-25 11:55:59 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						cf658adc83 
					 
					
						
						
							
							llm : add Falcon support ( #2717 )  
						
						... 
						
						
						
						* llama : refactor GGUF constants into static maps
* llama : check if model architecture is known
* llama : refactor llama_model_load_internal()
* gguf : add KV constant maps
* llm : read arch-specific KVs
* convert : add dummy scores + types
* falcon : load tensor data (CPU only)
* llama : fix loading progress bar
* llama : add arch member to llama_model
* falcon : CPU inference working
* falcon : support non-40B models
* falcon : minor
* llama : minor updates
ggml-ci
* convert-falcon-hf-to-gguf.py : fix special token mapping
* llama.cpp : llama default UNK token = id 0
* llama.cpp : fix bpe tokenizer
* llama.cpp : fix the fix of bpe tokenizer
* ggml : pass eps to ggml_norm
* metal : implement RoPE (mode = 2) + avoid ggml_repeat
* ggml : ggml_repeat always creates new tensor
* falcon : copy-paste self-attention from LLaMA
* metal : print extra compute pipeline info
* falcon : minor changes (still chasing the Metal problem)
* llama.cpp : fix linefeed token
* metal : fix GELU kernel numerical stability by using precise::tanh
* metal : temporary workaround for the concurrency optimization bug
* falcon : add CUDA offloading (#2739 )
* llama : better model naming and size reporting
* llama : prep new tokenizer support
* llama : advanced BPE tokenizer based on ggllm.cpp imlpementation
* llama : remove oboslete comment
ggml-ci
* common : remove obsolete BPE API + disable test-tokenizer-1
* llama : revert BPE special-case in llama_byte_to_token()
* cuda : add TODOs for RoPE NeoX implementation
* llama : default special tokens based on vocab type
* perplexity : add log for start of tokenization
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com >
Co-authored-by: slaren <slarengh@gmail.com > 
						
						
					 
					
						2023-08-23 23:08:04 +03:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						c63bb1d16a 
					 
					
						
						
							
							CUDA: use mul_mat_q kernels by default ( #2683 )  
						
						
						
						
					 
					
						2023-08-22 22:47:05 +02:00 
						 
				 
			
				
					
						
							
							
								Jiahao Li 
							
						 
					 
					
						
						
							
						
						800c9635b4 
					 
					
						
						
							
							Fix CUDA softmax by subtracting max value before exp ( #2665 )  
						
						
						
						
					 
					
						2023-08-22 20:27:06 +02:00 
						 
				 
			
				
					
						
							
							
								slaren 
							
						 
					 
					
						
						
							
						
						1123f7fbdf 
					 
					
						
						
							
							ggml-cuda : use graph allocator ( #2684 )  
						
						... 
						
						
						
						use a different function for no_alloc to avoid breaking backwards compat, fixes lora
remove 512 n_batch limit
fixed 2048 batch size
cleanup
Co-authored-by: Johannes Gäßler <johannesg@5d6.de > 
						
						
					 
					
						2023-08-22 15:25:19 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						ef3f333d37 
					 
					
						
						
							
							ggml : sync latest (SAM + SD operators, CUDA alibi) ( #2709 )  
						
						... 
						
						
						
						* ggml : sync latest (SAM + SD operators, CUDA alibi)
ggml-ci
* ggml : fix tabs 
						
						
					 
					
						2023-08-22 14:22:08 +03:00 
						 
				 
			
				
					
						
							
							
								slaren 
							
						 
					 
					
						
						
							
						
						097e121e2f 
					 
					
						
						
							
							llama : add benchmark example ( #2626 )  
						
						... 
						
						
						
						* llama : add benchmark example
* add to examples CMakeLists.txt
* fix msvc build
* add missing include
* add Bessel's correction to stdev calculation
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* improve markdown formatting
* add missing include
* print warning is NDEBUG is not defined
* remove n_prompt and n_gen from the matrix, use each value separately instead
* better checks for non-optimized builds
* llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the first call
* fix json formatting
* add sql output
* add basic cpu and gpu info (linx/cuda only)
* markdown: also show values that differ from the default
* markdown: add build id
* cleanup
* improve formatting
* formatting
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de > 
						
						
					 
					
						2023-08-18 12:44:58 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						1cd06fa25e 
					 
					
						
						
							
							CUDA: launch_bounds, small q4_K, q5_K mmq refactor ( #2596 )  
						
						
						
						
					 
					
						2023-08-14 10:41:22 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						f64d44a9b9 
					 
					
						
						
							
							CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time ( #2590 )  
						
						
						
						
					 
					
						2023-08-13 00:24:45 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						25d43e0eb5 
					 
					
						
						
							
							CUDA: tuned mul_mat_q kernels ( #2546 )  
						
						
						
						
					 
					
						2023-08-09 09:42:34 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						f514d1b306 
					 
					
						
						
							
							CUDA: faster k-quant mul_mat_q kernels ( #2525 )  
						
						
						
						
					 
					
						2023-08-05 18:20:44 +02:00 
						 
				 
			
				
					
						
							
							
								Cebtenzzre 
							
						 
					 
					
						
						
							
						
						4329d1acb0 
					 
					
						
						
							
							CUDA: use min compute capability of GPUs actually used ( #2506 )  
						
						
						
						
					 
					
						2023-08-04 17:35:22 +02:00 
						 
				 
			
				
					
						
							
							
								Cebtenzzre 
							
						 
					 
					
						
						
							
						
						02f9d96a86 
					 
					
						
						
							
							CUDA: check if event is NULL before cudaStreamWaitEvent ( #2505 )  
						
						... 
						
						
						
						Fixes  #2503  
					
						2023-08-04 17:34:32 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						468ea24fb4 
					 
					
						
						
							
							CUDA: faster non k-quant mul_mat_q kernels ( #2483 )  
						
						
						
						
					 
					
						2023-08-02 18:04:04 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						4f6b60c776 
					 
					
						
						
							
							CUDA: Fix models with output size != 32000 ( #2480 )  
						
						
						
						
					 
					
						2023-08-02 16:48:10 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						0728c5a8b9 
					 
					
						
						
							
							CUDA: mmq CLI option, fixed mmq build issues ( #2453 )  
						
						
						
						
					 
					
						2023-07-31 15:44:35 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						1215ed7d5c 
					 
					
						
						
							
							CUDA: Implemented row flattening for non-glm RoPE ( #2468 )  
						
						
						
						
					 
					
						2023-07-31 14:32:30 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						2dbf518911 
					 
					
						
						
							
							CUDA: fewer memory bank conflicts for mul_mat_q ( #2458 )  
						
						
						
						
					 
					
						2023-07-31 13:18:51 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						11f3ca06b8 
					 
					
						
						
							
							CUDA: Quantized matrix matrix multiplication ( #2160 )  
						
						... 
						
						
						
						* mmq implementation for non k-quants
* q6_K
* q2_K
* q3_k
* q4_K
* vdr
* q5_K
* faster q8_1 loading
* loop unrolling
* add __restrict__
* q2_K sc_high
* GGML_CUDA_MMQ_Y
* Updated Makefile
* Update Makefile
* DMMV_F16 -> F16
* Updated README, CMakeLists
* Fix CMakeLists.txt
* Fix CMakeLists.txt
* Fix multi GPU out-of-bounds 
						
						
					 
					
						2023-07-29 23:04:44 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						9baf9ef304 
					 
					
						
						
							
							CUDA: faster multi GPU synchronization ( #2448 )  
						
						
						
						
					 
					
						2023-07-29 23:04:10 +02:00 
						 
				 
			
				
					
						
							
							
								Kawrakow 
							
						 
					 
					
						
						
							
						
						129d844c87 
					 
					
						
						
							
							Fix Q4_K and Q5_K for QK_K = 64 on CUDA ( #2359 )  
						
						... 
						
						
						
						* Fix Q4_K and Q5_K for QK_K = 64
* Very slightly better Q5_K bit fiddling
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com > 
						
						
					 
					
						2023-07-25 13:48:04 +03:00 
						 
				 
			
				
					
						
							
							
								slaren 
							
						 
					 
					
						
						
							
						
						41c674161f 
					 
					
						
						
							
							make rms_norm_eps a parameter ( #2374 )  
						
						... 
						
						
						
						* make rms_norm_eps a parameter
* add rms_norm_eps to command line
* fix baby llama, test-grad0
* use scientific notation for eps param in the help
ggml-ci 
						
						
					 
					
						2023-07-24 17:57:12 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						5b2b2dc6ae 
					 
					
						
						
							
							ggml : sync (unary ops refactor, static-correctness) ( #2370 )  
						
						... 
						
						
						
						* ggml : sync (unary ops, tests)
ggml-ci
* tests : remove unnecessary funcs 
						
						
					 
					
						2023-07-24 14:46:21 +03:00 
						 
				 
			
				
					
						
							
							
								Kawrakow 
							
						 
					 
					
						
						
							
						
						2f9cf974a0 
					 
					
						
						
							
							Some more Q4_K and Q5_K speedup on CUDA ( #2346 )  
						
						... 
						
						
						
						* Faster Q5_K on CUDA
* Small Q5_K improvement on older GPUs
* Spped up Q4_K on CUDA
GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t
* Spped up Q4_K on CUDA
GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080:  9.8 ms/t ->  9.5 ms/t
* Address PR comments
* Add some comments to satisfy PR reviewer
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com > 
						
						
					 
					
						2023-07-24 00:19:47 +03:00 
						 
				 
			
				
					
						
							
							
								slaren 
							
						 
					 
					
						
						
							
						
						95a6c595e7 
					 
					
						
						
							
							ggml: move op parameters from tensors to ggml_tensor::op_params ( #2333 )  
						
						... 
						
						
						
						* ggml: move op parameters from tensors to ggml_tensor::op_params
* alibi: use memcpy for float params
* remove `src[1] = NULL` in ops 
						
						
					 
					
						2023-07-23 14:36:02 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e76d630df1 
					 
					
						
						
							
							llama : grouped-query attention + LLaMAv2 70B support ( #2276 )  
						
						... 
						
						
						
						* CUDA: GQA implementation
* llama : support for GQA and LLaMAv2 70B
ggml-ci
* py : fix hparams parsing (if-else blocks)
ggml-ci
* py : oh boy ..
ggml-ci
* help : fix gqa value for 70B
ggml-ci
---------
Co-authored-by: JohannesGaessler <johannesg@5d6.de > 
						
						
					 
					
						2023-07-23 15:09:47 +03:00 
						 
				 
			
				
					
						
							
							
								Kawrakow 
							
						 
					 
					
						
						
							
						
						d2a43664f9 
					 
					
						
						
							
							Speed up Q4_K ( #2322 )  
						
						... 
						
						
						
						Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com > 
						
						
					 
					
						2023-07-23 08:49:20 +03:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						b9b7d94fc1 
					 
					
						
						
							
							CUDA: Fixed 7b q3_K_S with mul_mat_vec_q ( #2313 )  
						
						
						
						
					 
					
						2023-07-22 21:27:34 +02:00 
						 
				 
			
				
					
						
							
							
								Kawrakow 
							
						 
					 
					
						
						
							
						
						d924522a46 
					 
					
						
						
							
							Custom RoPE + bettter memory management for CUDA ( #2295 )  
						
						... 
						
						
						
						* Custom RoPE + bettter memory management for CUDA
* Adjusted look ahead in ggml_cuda_pool_malloc to 5%
This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com > 
						
						
					 
					
						2023-07-21 17:27:51 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						ae178ab46b 
					 
					
						
						
							
							llama : make tensor_split ptr instead of array ( #2272 )  
						
						
						
						
					 
					
						2023-07-21 13:10:51 +03:00