mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2025-11-20 12:07:33 +00:00
* oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>
43 lines
1.0 KiB
Plaintext
43 lines
1.0 KiB
Plaintext
#version 450
|
|
|
|
#extension GL_EXT_control_flow_attributes : require
|
|
|
|
#include "types.comp"
|
|
|
|
layout (push_constant) uniform parameter
|
|
{
|
|
uint ne0;
|
|
uint ne1;
|
|
uint s01;
|
|
uint s02;
|
|
uint s11;
|
|
uint s21;
|
|
} p;
|
|
|
|
#define BLOCK_SIZE 512
|
|
|
|
layout(local_size_x = BLOCK_SIZE, local_size_y = 1, local_size_z = 1) in;
|
|
|
|
layout (binding = 0) readonly buffer X {A_TYPE data_a[];};
|
|
layout (binding = 1) readonly buffer Y {B_TYPE data_b[];};
|
|
layout (binding = 2) readonly buffer Z {int32_t data_c[];};
|
|
layout (binding = 3) writeonly buffer D {D_TYPE data_d[];};
|
|
|
|
void main() {
|
|
const uint i1 = gl_WorkGroupID.x;
|
|
const uint i2 = gl_WorkGroupID.y;
|
|
|
|
const uint i11 = data_c[i1 + i2 * p.s21];
|
|
|
|
const uint s1 = p.ne0;
|
|
const uint s2 = p.ne0 * p.ne1;
|
|
|
|
const uint d0 = i1 * s1 + i2 * s2;
|
|
const uint a0 = i1 * p.s01 + i2 * p.s02;
|
|
const uint b0 = i11 * p.s11;
|
|
|
|
for (uint i0 = gl_LocalInvocationID.x; i0 < p.ne0; i0 += BLOCK_SIZE) {
|
|
data_d[d0 + i0] = data_a[a0 + i0] + data_b[b0 + i0];
|
|
}
|
|
}
|