leejet
21e933806f
cuda: get_rows: dfloat2 -> float2
2025-08-31 12:10:01 +08:00
leejet
8f5e7b0ce6
remove unused variables
2025-08-31 12:02:23 +08:00
leejet
b4c50bec23
fix test_im2col_3d
2025-08-31 12:01:23 +08:00
leejet
e66bf6e503
cpu: im2col_3d support non continuous src
...
Co-authored-by: Jeff Bolz <jbolz@nvidia.com >
2025-08-31 11:58:32 +08:00
leejet
3f901e316b
test-backend-ops.cpp: remove trailing whitespace
2025-08-31 00:55:34 +08:00
leejet
aafa79ae03
add test_im2col_3d to test-backend-ops
2025-08-31 00:51:05 +08:00
leejet
0d5eb51252
cuda: use simpler loop in get_rows
2025-08-31 00:21:24 +08:00
leejet
131ae2d585
adjust the code style
2025-08-31 00:04:27 +08:00
leejet
c9b9fabe08
fix cpu im2col_3d
2025-08-30 11:25:07 +08:00
leejet
f6278c832f
cuda: remove unnecessary MIN define
2025-08-30 04:14:19 +08:00
leejet
f6a874c04a
avoid build failure on MacOS
2025-08-30 03:53:03 +08:00
leejet
d11a729898
avoid build failure
2025-08-30 03:48:47 +08:00
leejet
9d035c4c4a
correct GGML_OP_COUNT assertion
2025-08-30 03:36:59 +08:00
leejet
df05913bc4
avoid ggml_conv_3d conflict
2025-08-30 03:28:07 +08:00
leejet
d30e07dbb3
fix cuda get_rows
2025-08-30 03:13:57 +08:00
leejet
d8377a0a37
gguf: support loading tensors which n_dims > GGML_MAX_DIMS
2025-08-30 03:11:09 +08:00
leejet
dd745ba31f
make im2col_3d faster
2025-08-30 03:11:09 +08:00
leejet
ae47caca70
fix cuda pad/scale/im2col3d
2025-08-30 03:11:08 +08:00
leejet
85c8e1e519
cuda: make im2col a little faster
2025-08-30 03:11:08 +08:00
leejet
f7a12f9e69
cuda/cpu: add im2col_3d support
2025-08-30 03:11:08 +08:00
leejet
93c7e775b8
add ggml_pad_ext for cpu & cuda backend
2025-08-30 02:56:56 +08:00
leejet
c92f9b4a68
add conv3d support
2025-08-30 02:56:56 +08:00
ExtReMLapin
792b44f2ed
server : add documentation for parallel_tool_calls param ( #15647 )
...
Co-authored-by: Pierre F <no@p.e>
2025-08-29 20:25:40 +03:00
Aman Gupta
81017865ee
CUDA: fix bug in rms_norm fusion ( #15660 )
...
* CUDA: fix bug in rms_norm fusion
* Fix bug for OP_REPEAT
* Fix index for add
b6318
2025-08-29 21:30:06 +08:00
Piotr Wilkin (ilintar)
60e5eee31f
chat : Seed OSS thinking + tool call support ( #15552 )
...
* Reasoning and tool-calling support for Seed OSS
* Fix grammar and partial parsing
* Whitespace
* New chat template
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Remove unused 'purge_healing_marker' helper
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
b6317
2025-08-29 14:53:41 +02:00
Aman Gupta
009b709d6e
CUDA: fuse adds, fuse add with rms norm ( #15631 )
...
* CUDA: fused add with rms_norm_mul
* Non-broadcast fuse works
* Add fused adds
* format
* Remove n_fuse from template params
* Address review comments
* Move template inside binbcast
b6316
2025-08-29 11:35:58 +08:00
Gabe Goodhart
e8d99dd0b6
nvidia nemotron nano v2 (nemotronh) ( #15507 )
...
* feat: Add NEMOTRONH to python arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add NEMOTRONH to c++ arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add NEMOTRONH to llama-arch layer map
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: First pass at conversion for nemotronh
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add a verbose log for each tensor loaded
This is really helpful for diagnosing mismatches between the expected and
received tensors
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: First (broken) pass at nemotronh model architecture
It generates tokens, just not valid ones!
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Explicitly enable add_bos_token during conversion
The `tokenizer.json`/`tokenizer_config.json` in the model are a bit
contradictory. In the config, add_bos_token is set to False, but the
tokenizer model itself has a post_processor that adds the BOS token via
type: TemplateProcessing
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Only allocate attention cache for attention layers (not non-recurrent)
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Move residual add to after every block
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Use the correct norm tensor for the MLP blocks
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* Nemotron-H: MLP gate cleanup (pass NULL for unused gate)
This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.
* SSM: respect ssm_dt_rank for dt_dim when provided
Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).
* fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)
* Rename nemotronh to nemotron_h for consistency
- Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py
- Change architecture string from 'nemotronh' to 'nemotron_h' in all files
- Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H
- Update class name llm_build_nemotronh to llm_build_nemotron_h
- Consistent naming with underscore convention (nemotron_h vs nemotronh)
* feat: Support conversion for older NemotronH models
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Co-authored-by: Maicon Domingues <dominguesm@outlook.com >
Co-authored-by: weatherman <fxdstudios@gmail.com >
b6315
2025-08-28 18:39:31 -06:00
Gabe Goodhart
a8bca68f72
fix: Compute the full sum in llama-eval-callback, not just the sum of printed values ( #15637 )
...
This makes it much easier to compare between llama.cpp and transformers!
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
b6314
2025-08-28 15:27:36 -05:00
mnehete32
c97dc09391
CUDA: add conv2d ( #15635 )
...
* CUDA: add conv2d
* CUDA: conv2d - correct formatting and added const
b6313
2025-08-28 20:33:03 +02:00
Aaron Teo
6c442f42ff
ggml-cpu: fix invalid hsum build in debug s390x ( #15634 )
...
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
b6312
2025-08-28 22:39:27 +08:00
compilade
73804145ab
ggml : fix SSM_SCAN for n_groups > 1 ( #15625 )
b6311
2025-08-28 10:11:36 -04:00
Georgi Gerganov
c8d0d14e77
kv-cache : fix find_slot to not search for continuous slot ( #15638 )
...
ggml-ci
b6310
2025-08-28 17:09:05 +03:00
Sigbjørn Skjæret
84ab83cc0b
model : jina-embeddings-v3 support ( #13693 )
...
* initial jina-embeddings-v3 support
* initial jina-embeddings-v3 support
* initial jina-embeddings-v3 support
* fix vocab parsing with only tokenizer.json
* set mask token lstrip attribute
* additional unk_token_id fallback just in case [no ci]
* revert vocab_size() change [no ci]
* merge tensor loading into general bert
* rope
* add lora embedding and loading (non-functional)
* export separate lora ggufs instead
* add adapter metadata api
* use std::string
* convert_hf_to_lora compatibility
* fix assert
* apply suggestions from review
* apply suggestion from review
b6309
2025-08-28 15:49:50 +02:00
Aman Gupta
55042b3692
scripts: add sqlite3 check for compare-commits.sh ( #15633 )
2025-08-28 19:23:22 +08:00
Georgi Gerganov
8a4280ce43
kv-cache : remove LLAMA_SET_ROWS checks ( #15505 )
...
ggml-ci
b6307
2025-08-28 12:27:02 +03:00
Aleksei Nikiforov
64387f6e95
gguf-py: byteswapping improvements ( #12851 )
...
* gguf-py: implement byteswapping for Q4_0
This is needed to byteswap Mistral model.
Also restore original shapes after byteswapping tensors.
It is not needed at the moment, but do it in case
they'd be used in future.
* Rework byteswapping code in gguf-py
Move out details from byteswapping tensor blocks code
2025-08-28 16:56:41 +08:00
Joshua Cogliati
d35a1e8c41
cli : change log to warning to explain reason for stopping ( #15604 )
...
* Change to warn instead of debug, to explain reason for stopping.
* Update tools/main/main.cpp
Fix printing --2
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b6305
2025-08-28 10:48:20 +03:00
Daniel Bevenius
46d9caa27a
model-conversion : add mmproj conversion target ( #15628 )
...
This commit adds a new target to the Makefile for converting models that
are multimodal. This target will convert the original model and in
addition also create the mmproj GGUF model.
The motivation for this change is that for models that are multimodal,
for example those that contain a vision encoders, we will often want to
upload both the quantized model and the vision encoder model to
HuggingFace.
Example usage:
```console
$ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/
...
The environment variable CONVERTED_MODEL can be set to this path using:
export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf
The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf
```
The converted original model can then be quantized, and after that both
the quantized model and the mmproj file can then be uploaded to
HuggingFace.
Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main
2025-08-28 09:26:48 +02:00
matiaslin
5a0e3ef6f0
cuda: Add cublasLt_static linking when GGML_STATIC is enabled ( #15622 )
...
Prior to this change, we faced undefined cublasLt references when
attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux.
We add linking with CUDA::cublasLt_static when CUDA version is greater
than 10.1.
b6303
2025-08-28 02:32:36 +02:00
Johannes Gäßler
fbef0fad7a
server: higher timeout for tests ( #15621 )
2025-08-27 20:58:09 +02:00
Georgi Gerganov
da54f9f1a2
presets : add qwen3-30B-a3b FIM ( #15616 )
b6301
2025-08-27 15:48:07 +03:00
uvos
47373271f9
HIP: Enable support for ggml_backend_cuda_register_host_buffer ( #15615 )
b6300
2025-08-27 13:58:54 +02:00
Georgi Gerganov
1bded5a3b3
kv-cache : better estimate of n_kv for multi-sequence batches ( #15610 )
...
ggml-ci
b6299
2025-08-27 13:55:12 +03:00
Chenguang Li
1e7489745a
CANN: refactor mask handling and improve performance in FA ( #15561 )
...
* CANN(flash-attn): refactor mask handling and improve performance
1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.
Signed-off-by: noemotiovon <757486878@qq.com >
* [CANN]: fix review
Signed-off-by: noemotiovon <757486878@qq.com >
* [CANN]: Optimization FA BNSD to BSND
Signed-off-by: noemotiovon <757486878@qq.com >
---------
Signed-off-by: noemotiovon <757486878@qq.com >
b6298
2025-08-27 17:21:41 +08:00
xctan
1cf123a343
ggml-cpu : add basic RVV support for vector f32 ops ( #15057 )
...
* ggml-cpu : add basic RVV support for vector f32 ops
* ggml-cpu : add RVV support for f32 softmax
b6297
2025-08-27 16:44:22 +08:00
Daniel Bevenius
fcca2182a1
common : add -m to bash completion for --model [no ci] ( #15591 )
...
This commit updates the bash completion script to include the -m
short option for the --model argument.
The motivation for this is that currently tab completion only works the
full --model option, and it is nice to have it work for the short option
as well.
2025-08-27 10:28:53 +02:00
rmatif
86076f92de
OpenCL: add fused group_norm/norm, mul, add ( #15314 )
...
* add fused group_norm/norm, mul, add
* fix spacing
* revert rms_norm logic
* fix trailing whitespace
b6295
2025-08-26 23:36:05 -07:00
Diego Devesa
bcbddcd54f
tests : fix test-opt with GGML_BACKEND_DL ( #15599 )
b6294
2025-08-26 22:14:38 +02:00
Akarshan Biswas
8b69686136
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size ( #15592 )
...
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.
This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
b6293
2025-08-27 00:27:49 +05:30
fidoriel
8ce3ff1d91
mtmd : fix mtmd ios build ( #15579 )
b6292
2025-08-26 20:05:50 +02:00