* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162)
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* Vulkan: add conv_transpose_2d operation
* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)
* Vulkan: fix incorrect indentation in conv_transpose_2d shader
* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation
* Vulkan: revert the order of the index calculation and bound check in conv_2d shader
* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.
* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D
* use fast_div to improve performance
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* optimize
* use a concise expression to further speedup the cuda kernel
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* some f32 tests passing
* Disable set_rows until it's implemented
* f32 add all tests passing
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Add templated addition, clean up code
* Get addition and multiplication working
* Implement rms_norm
* Add get_rows implementation
* Add new get_rows files
* Refactor use of wg size entry
* Fix compilation
* Try manually unrolled q4_0 quant
* Revert "Try manually unrolled q4_0 quant"
This reverts commit 77f8b96515.
* Move to constant max wg size
* Check for tensor size in supports_op
* Vectorize f32 and change default workgroup size
* Move f32 get_rows from < 4 to % 4 != 0
* fix linter errors
* Add in-place tests
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
* metal : improve naming
* metal : refactor device
ggml-ci
* cont : props
ggml-ci
* metal : apply ggml_mem_ranges_t
ggml-ci
* metal : remove GGML_METAL_USE_BF16
ggml-ci
* metal : refactor device buffer
ggml-ci
* cont : fix naming
* metal : sync before destroying the backend
ggml-ci
* metal : refactor context
ggml-ci
* metal : migrate ggml-metal.m to ggml-metal.cpp
ggml-ci
* metal : adjust ops API
ggml-ci
* metal : use C++ to store piplienes
ggml-ci
* metal : migrate ops to separate functions
ggml-ci
* metal : add ggml_metal_library_t
ggml-ci
* metal : improve naming
ggml-ci
* metal : cleanp
ggml-ci
* metal : add support for GGML_OP_LOG
ggml-ci
* metal : fix error handling
ggml-ci
* Add fastdiv and fastmodulo to k_bin_bcast kernel
* Address review comments
* `prod_` instead of `prod` suffix
* Add test case for `k_bin_bcast_unravel` in CUDA backend
* CUDA: Add mul_mat_id support the mmf
Add support for mul_mat_id for bs < 16
* Review: use warp_size, fix should_use_mmf condition
* Launch one block per expert, stride along n_expert_used
* templatize mul_mat_id
* Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids
* Reduce compile times by dividing mmf into f16, bf16 and f32 variants
* Divide mmf by ncols_dst
* Add missing files
* Fix MUSA/HIP builds
* requirements : update transformers/torch for Embedding Gemma
This commit updates the requirements to support converting
Embedding Gemma 300m models.
The motivation for this change is that during development I had a local
copy of the transformers package which is what I used for converting
the models. This was a mistake on my part and I should have also updated
my transformers version to the official release.
I had checked the requirements/requirements-convert_legacy_llama.txt
file and noted that the version was >=4.45.1,<5.0.0 and came to the
conculusion that no updated would be needed, this assumed that
Embedding Gemma would be in a transformers release at the time
Commit fb15d649ed ("llama : add support
for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to
convert themselves would be able to do so. However, Embedding Gemma is
a preview release and this commit updates the requirements to use this
preview release.
* resolve additional python dependencies
* fix pyright errors in tokenizer test and remove unused import
* Add DeepSeek V3.1 thinking mode support
- Added COMMON_CHAT_FORMAT_DEEPSEEK_V3_1 enum value
- Created common_chat_params_init_deepseek_v3_1() function (currently uses R1 implementation)
- Created common_chat_parse_deepseek_v3_1() function that handles V3.1 thinking format:
- Extracts reasoning content before '</think>' tag into reasoning_content
- Extracts regular content after '</think>' tag into content
- No opening '<think>' tag in V3.1 format
- Added detection logic for V3.1 templates based on pattern: 'message['prefix'] is defined and message['prefix'] and thinking'
- Added V3.1 case to parsing switch statement
This addresses the issue where V3.1 outputs reasoning content followed by '</think>' and then regular content without the opening '<think>' tag.
* Another attempt by V3.1 non-thinking
* Fix test, but it's not asserting anything.
* Ignore vim swap files in tests dir
* Update the test
* Try using try_find_literal instead of regex
* passing test
* Revert "Try using try_find_literal instead of regex"
This reverts commit c50d887ec2.
* Remove unnecessary change
* Remove comment
* Add code to handle non-thinking mode.
* Try to set message['prefix'] when thinking is enabled.
* This fixes reasoning, but breaks normal content. We need state in the
chat parser.
* DeepSeek V3.1 thinking is now the default. Disable with `--reasoning-budget 0`.
* Simplify (DeepSeek V3.1 reasoning)
* Fix sign inversion bug
* Add some tool calling code (not working).
* Tool calls working in non-reasoning mode.
* Attempt a unit test for tool call parsing.
* Passing test
* Add tests for both happy path and broken fenced DeepSeek V3.1 tool call variants.
* Passing DeepSeek V3.1 tool call tests, but model is not working.
* Revert assistance response prefill change. Not my monkeys.
* Add fenced_thinking unit test variant. Passes, but thinking tool calling
still isn't working for some reason.
* Tests pass in reasoning mode. Also e2e tool test passes.
* Make a copy of the parse_json_tool_calls function for deepseek-v3.1 so
as to not accidentally introduce regressions.
* Fix thinking_forced_open logic. tool calling broken. Need to add another
test case.
* That's what I get for cargo culting a newline.
* Add multi tool call test for deepseek v3.1 non-reasoning
* Move test, remove .gitignore change
* Place deepseek-v3.1 reasoning test directly into existing reasoning
function per CISC's request.
* Address whitespace CI failure.
* Merge two assert_equals per CISC's request.
* Add DeepSeek-V3.1 tests to tests/test-chat.cpp per CISC's request.
* Merge deepseek V3.1 and regular parse_json_tool_calls() function
behaviors by adding optional update_cursor argument.
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* DeepSeek V3.1 fix reasoning_format none
* Strip grammar down to strictly what we expect based on model card. Throw
out parts we cargo culted from R1 that don't make sense.
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* DeepSeek V3.1 - Add edge case where thinking is forced open, there is
tool calling in the reasoning content, but then the model just stops the
output without closing the </think> tag, so it's not a partial. In this
case, use the tool call in the reasoning content.
* DeepSeek V3.1 - simplify update_cursor
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix indent
---------
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ggml: allow casting between f32 and i32
* fix cuda
* add vulkan
* fix CPU non-cont
* add non-cont test case
* add note
* extend test number range
* correct note
* add cont version for vulkan
This commit adds two new command-line options to the
test-backend-ops.cpp that allow users to list all available GGML
operations and to show test coverage of these operations.
The motivation for this is that it can be useful to quickly see which
operations are currently covered by tests and which are not. Also it
migth be useful when using the `support` mode.
* sampling : optimize sorting using bucket sort in more places
ggml-ci
* sampling : do not sort in dist sampler
ggml-ci
* sampling : avoid heap allocations for sort buffers
ggml-ci
* common : add option to sort sampling candidates by probability
ggml-ci
* sampling : revert the change for preserving sort buffers
* sampling : use std::copy instead of memcpy
* sampling : clarify purpose of partial sort helpers
ggml-ci
* cont : remove wrong comment [no ci]
* common : update comment
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* metal : mul_mm_id remove hdst
* metal : remove mul_mm_id hsrc1
* metal : mul_mm_id simplify + add test
* metal : opt mul_mm_id map0
* metal : optimize mul_mm_id id gathering
* metal : mul/div opt
* metal : optimize mul_mm_id_map0
ggml-ci
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.
Store the FA pipelines in a map, indexed by the pipeline state.
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs
There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.
The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.
* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.
* complete rebase against fused adds - multi_add shader can also compute partial sums
* fix validation errors
* disable add_rms_fusion for Intel due to possible driver bug
* resolve against #15489, sync after clearing partial sums
* vulkan : support ggml_mean
* vulkan : support sum, sum_rows and mean with non-contiguous tensors
* vulkan : fix subbuffer size not accounting for misalign offset
* tests : add backend-op tests for non-contiguous sum_rows
* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support
* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
* vulkan: Reuse conversion results in prealloc_y
Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.
* don't use shared pointer for prealloc_y_last_pipeline_used
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
* vulkan: fuse adds
Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.
* check runtimeDescriptorArray feature
* disable multi_add for Intel due to likely driver bug
* model : add harmony parser for gpt-oss
* gpt-oss : fix grammar trigger from causing empty stack
* gpt-oss: tweak the grammar trigger again
* gpt-oss : add support for recipient in role header
* gpt-oss : fix ungrouped tool calls in grammar
* gpt-oss : loosen function name matching during parse
* gpt-oss : clean up workarounds
* gpt-oss : add template tests
* gpt-oss : simulate thinking and tool call tags
* gpt-oss : undo think tags when reasoning_format is none
* gpt-oss : set special tokens back to user defined
* gpt-oss : update openai-gpt-oss template
* server : filter out harmony thought messages
* gpt-oss : simplify parsing
* examples/finetune -opt SGD (stochastic gradient descent) memory opt
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
* Vulkan: Implement GGML_OP_OPT_STEP_SGD
* tests: Fix OPT_STEP_SGD test-backend-ops
* SGD op param store weight-decay and not 1-alpha*wd
* minor + cosmetic changes
* fix vulkan sgd
* try CI fix
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Factor out `reduce_rows_f32` from common.cuh
This increases iteration cycle speed by not having to recompile
every kernel all the time
* Hide memory-latency by loop unrolling in reduce_rows_f32
* Further optimizations to `reduce_rows_f32`
1. Increase threadblock size to better hide latency of memory requests.
As a consequence of bigger threadblocks, do 2-step summation, using
shared memory to communicate results between invocations
2. Use sum_temp array to reduce waits on sum
3. Adjust num_unroll to reflext bigger threadblock
4. Improve default block_dims, increase support for more block_dims
* Add perf tests for `reduce_rows_f32` kernel
* Add heuristic to toggle 128/512 threads based on sm count
Break even point was the minimum of the following multiples.
| GPU Model | Nrow SM Count Multiple |
| ----------- | ----------- |
| RTX 4000 SFF ADA | 2.0x |
| RTX 6000 ADA | 2.5x |
| RTX PRO 6000 Blackwell Max-Q | 3.04x |
| RTX PRO 4500 Blackwell | 3.15x |
* Ensure perf gains also for small ncols and large nrows
Alternative to this, one could have also made the number of unrollings
template-able, but that would require compiling the kernel multiple
times, increasing binary size unnecessarily
* Modify perf and unit-tests
* Apply auto-formatting by clang
* Fix CI build failure
See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486
Building with VS generator worked though.
* Remove sm_count property from `ggml_backend_cuda_context`
Requested by @JohannesGaessler, and should fix remaining CI issues as a
side-effect
* Add CUB-based implementation for GGML_OP_MEAN
Currently this branch is only executed for nrows==1
* Add heuristics to execute CUB branch only when it brings perf
Heuristics were determined on the following HW:
* RTX 4000 SFF ADA
* RTX 6000 ADA
* RTX PRO 6000 Blackwell Max-Q
* RTX PRO 4500 Blackwell
* Add unit-test for CUB-based mean
Tests should run with CUDA Graphs enabled per default on NVGPUs
* Rename `USE_CUB` to `GGML_CUDA_USE_CUB`
Suggested by @JohannesGaessler
* Unindent Preprocessor directives
See
https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506