mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-11-04 09:32:00 +00:00 
			
		
		
		
	- Launch an appropriate number of invocations (next larger power of two). 32 invocations is common and the barrier is much cheaper there. - Specialize for "needs bounds checking" vs not. - Make the code less branchy and [[unroll]] the loops. In the final code, I see no branches inside the main loop (only predicated stores) when needs_bounds_check is false. - Always sort ascending, then apply the ascending vs descending option when doing the final stores to memory. - Copy the values into shared memory, makes them slightly cheaper to access.
		
			
				
	
	
	
		
			240 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			240 KiB