Francis Couture-Harpin
a60a24beed
Merge branch 'master' into compilade/refactor-kv-cache
2025-07-09 09:38:48 -04:00
Miaoqian Lin
26a48ad699
ggml : prevent integer overflow in gguf tensor size calculation ( #14595 )
b5854
2025-07-09 14:33:53 +02:00
Dowon
ffd59e7d18
model : add skt/A.X-4.0 model vocabulary ( #14589 )
b5853
2025-07-09 11:22:31 +03:00
Sigbjørn Skjæret
105554595f
llama : remove unintended whitespace ( #14592 )
b5852
2025-07-09 10:19:50 +02:00
ibrahim khadraoui
04655063c4
model : add support for Falcon-H1 family ( #14534 )
...
* v1
* push more fixes
* another fix
* fix
* more fixes
* minor fix
* more cleaning on python code
* python fixes
* changed precision for multipliers float 32->64
* fixes
* another fix
* fix
* pre-norm -> norm
* fix
* Revert "fix"
This reverts commit 243e4d1a50 .
* fix
* small fix ffn_norm
* try
* mix instead of max
* fix vocab size
* conflict solve
* fixed multipliers
* falcon-h1 specefic vocab resolved
* read arch from gguf.MODEL_ARCH
* mamba_d_ssm added to d_inner find_hparam
* remove unused functions from gguf_writer.py
* override modify_tensors instead of get_tensors
* fix conversion and d_inner
* added some cb functions for debugging puposes
* inp_out_ids moved outside of layers loop
* mup_vec create as float64
* fix rope_theta
* injected mup
* clean ups
* rm extra space
* rm unused MAMBA_CHUNK_SIZE
* rm unused key
* add bos False
* changed ROPE_TYPE
* cleaning debugging stuff
* cleaning debug quant
* fix comment
* some cleanups
* some cleanups
* Update src/llama-model-loader.cpp
* more cleanups
* moe cleanuips
* d_ssm -> d_inner;
* cleaning unused hparams
* cleanup
* more cleanups
* more cleanups on python conversion;
* minor cleanups
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* remove todo
* added falcon-h1
* tensor not required
* clean
* remove unneeded attributes
* more cleanups and fixed conversion
* remove final_norm
* flake8 fixes
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* flake8 fixes
* Update src/llama-hparams.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* added hashes
* Update src/llama-arch.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update src/llama-vocab.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* update the update file
* Revert "update the update file"
This reverts commit 082ab4ad2a .
* fix: address suggestions
* fix: update convert_hf_to_gguf.py
* Update gguf-py/gguf/constants.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* d_inner fixed
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* reshaping ssm_norm for 34B
* removing generate_mup
* remove duplicates metadata keys
* rm comment
* final comment
* fix unused args
* fix constants
* fix bad merge
* Update src/llama-model.cpp
Co-authored-by: compilade <git@compilade.net >
* falcon-h1: remove unused ssm_in_b and bad merge
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* falcon-h1: fix last comment
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net >
* falcon-h1: revert add_add_bos(False)
* falcon-h1: fix tied weights
* falcon-h1: remove whitespace
* falcon-h1: fix wrong size param
* falcon-h1: fix whitespace issues
---------
Co-authored-by: younesbelkada <younes.belkada@tii.ae >
Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
Co-authored-by: compilade <git@compilade.net >
b5851
2025-07-09 10:03:49 +02:00
Xuan-Son Nguyen
20b7bf8a32
convert : fix smollm3 jinja template ( #14586 )
2025-07-09 09:26:13 +03:00
Francis Couture-Harpin
f7c7a926f0
model : use ggml_swiglu_split for Mamba
...
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2025-07-08 15:45:42 -04:00
Francis Couture-Harpin
2f39cd7bb7
model : remove unnecessary prefix for tensor loading constants
...
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2025-07-08 15:37:49 -04:00
Francis Couture-Harpin
db5ff0cc6b
jamba : remove redundant nullptr initializations
2025-07-08 15:15:49 -04:00
Francis Couture-Harpin
b0b280ea28
Merge branch 'master' into compilade/refactor-kv-cache
2025-07-08 15:09:02 -04:00
Jeff Bolz
6efcd65945
vulkan: optimize flash attention split_k_reduce ( #14554 )
...
* vulkan: allow FA split_k with smaller KV values
* vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
b5849
2025-07-08 20:11:42 +02:00
stevenkuang
699f4392a3
model : fix hunyuan moe chat template ( #14584 )
...
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
b5848
2025-07-08 18:29:29 +02:00
Xuan-Son Nguyen
08382869a2
model : add SmolLM3 ( #14581 )
...
* Init - first pass.
* Model -> ModelBase.
* fix errors in conversion.
* Update the graph.
* up.
* up.
* wip
* cgraph ok
* rm redundant code
---------
Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com >
b5847
2025-07-08 18:07:01 +02:00
compilade
bb4f7a9e4e
memory : fix broken batch splits for recurrent cache ( #14575 )
...
Splits producing more than one ubatch per batch for recurrent models
were broken with #14512 .
This fixes it by moving the completeness check after the ubatch split loop.
b5846
2025-07-08 18:37:47 +03:00
Jeff Bolz
b8eeb8741d
vulkan : fix rope with partial rotation and non-cont src ( #14582 )
b5845
2025-07-08 15:21:21 +02:00
Alawode Oluwandabira
17a1f0d2d4
server: Add ability to mount server at prefix ( #14544 )
...
* Add server_prefix
* Correct server path env
* Rename cli flag to --api-prefix
* Change all to api_prefix
b5844
2025-07-08 11:47:33 +03:00
Xuan-Son Nguyen
8f22dc0a53
model : add hunyuan moe ( #14425 )
...
* model : add hunyuan moe
* tokenizer ok
* fix tensor name
* cgraph init
* chat template
* wip
* almost working
* skip embed, fix bos
* cleanup
* yarn scaling
* cleanup
* correct rope type
* failed token fix
* ntk alpha freq_base
* tokenization working
* cleanup and pr changes
* vocab_size sanity check
* ntk alpha generic
* Update convert_hf_to_gguf.py
* Apply suggestions from code review
* fix regression
* fix style
---------
Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com >
b5843
2025-07-08 11:24:06 +03:00
Jeff Bolz
53903ae6fa
vulkan: increase timeout for CI ( #14574 )
2025-07-08 09:38:31 +02:00
Georgi Gerganov
4d0dcd4a06
cuda : fix rope with partial rotation and non-cont src ( #14580 )
...
* cuda : fix rope non-cont
ggml-ci
* cont : fix multi-rope + add test
ggml-ci
* sycl : try fix
ggml-ci
* cont : fix sycl + clean-up cuda
ggml-ci
b5841
2025-07-08 10:15:21 +03:00
Aman Gupta
75c91de6e9
CUDA: add bilinear interpolation for upscale ( #14563 )
b5840
2025-07-08 10:11:18 +08:00
R0CKSTAR
68155c66f0
musa: fix build warnings (unused variable) ( #14561 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
b5839
2025-07-08 07:58:30 +08:00
Sigbjørn Skjæret
e1a7059053
llama : fix incorrect minicpm3 v_states shape ( #14571 )
b5838
2025-07-07 23:35:35 +02:00
Sigbjørn Skjæret
12f55c302b
llama : remove ggml_cont where possible ( #14568 )
b5837
2025-07-07 21:35:08 +02:00
Francis Couture-Harpin
f71635824b
Merge branch 'master' into compilade/refactor-kv-cache
2025-07-07 14:57:56 -04:00
Aman Gupta
b9c3eefde1
CUDA: add bf16 and i32 to getrows ( #14529 )
b5836
2025-07-07 21:45:43 +08:00
Eve
6491d6e4f1
vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) ( #14485 )
...
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260
Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com >
b5835
2025-07-06 12:29:36 +02:00
Jeff Bolz
e592be1575
vulkan: fix rms_norm+mul fusion ( #14545 )
...
The fused operation was grabbing the epsilon value from the wrong place.
Add an env var to disable fusion.
Add some missing checks for supported shapes/types.
Handle fused rms_norm+mul in check_results.
b5834
2025-07-06 10:08:16 +02:00
Jeff Bolz
a0374a67e2
vulkan: Handle updated FA dim2/3 definition ( #14518 )
...
* vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.
* handle null mask for gqa
* allow gqa with dim3>1
b5833
2025-07-05 09:26:04 +02:00
Sigbjørn Skjæret
ddef99522d
server : fix assistant prefilling when content is an array ( #14360 )
b5832
2025-07-05 09:17:14 +02:00
Sigbjørn Skjæret
6681688146
opencl: add GELU_ERF ( #14476 )
b5831
2025-07-04 23:24:56 -07:00
Georgi Gerganov
bac8bed248
eval-callback : check for empty input ( #14539 )
b5830
2025-07-05 07:18:09 +03:00
R0CKSTAR
b81510a7b7
test-backend-ops: add support for specifying output format ( #14368 )
...
* test-backend-ops: add support for specifying output format
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Add build_commit and build_number in test_result
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* refactor
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Get build commit from ggml_commit()
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Merge errors into test_operation_info && address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* remove visitor nonsense
* remove visitor comment
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
---------
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
Co-authored-by: slaren <slarengh@gmail.com >
b5829
2025-07-05 12:10:53 +08:00
Georgi Gerganov
ef797db357
metal : disable fast math in all quantize kernels ( #14528 )
...
ggml-ci
b5828
2025-07-04 19:19:09 +03:00
Georgi Gerganov
67d1ef23c6
batch : add optional for sequential equal split ( #14511 )
...
ggml-ci
b5827
2025-07-04 09:08:59 +03:00
Georgi Gerganov
7b50f7c025
graph : prepare for 4D mask ( #14515 )
...
ggml-ci
b5826
2025-07-04 09:05:36 +03:00
Georgi Gerganov
c79184d2d1
batch : add n_used count ( #14512 )
...
ggml-ci
b5825
2025-07-04 09:04:59 +03:00
luyhcsu
499a8f5a78
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator ( #14002 )
...
Co-authored-by: luyuhong <luyuhong@kylinos.cn >
b5824
2025-07-04 11:50:07 +08:00
Francis Couture-Harpin
07c252f038
model : add Jamba to Mamba-specific hparams printing
2025-07-03 17:13:18 -04:00
Francis Couture-Harpin
20f8e43e63
graph : add back hybrid memory graph input
...
But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).
2025-07-03 17:07:46 -04:00
Sigbjørn Skjæret
28657a8229
ggml : implement GEGLU_ERF and GEGLU_QUICK ops ( #14445 )
b5823
2025-07-03 23:07:22 +02:00
Francis Couture-Harpin
4682e21c46
Merge branch 'master' into compilade/refactor-kv-cache
2025-07-03 16:04:55 -04:00
lhez
bee28421be
opencl : broadcast for soft_max ( #14510 )
b5822
2025-07-03 20:22:24 +02:00
Jeff Bolz
2b72bedec1
vulkan: support mixed/deepseekR1 FA head sizes ( #14509 )
...
* vulkan: better parameterize FA by head sizes
* vulkan: support mixed/deepseekR1 FA head sizes
b5821
2025-07-03 20:21:14 +02:00
Johannes Gäßler
c8c4495b8d
ggml: backward pass for split swiglu ( #14483 )
b5820
2025-07-03 17:05:18 +02:00
Nicolò Scipione
7b63a71a6b
Fix conditional enabling following arch checks for ggml-sycl ( #14504 )
...
Signed-off-by: nscipione <nicolo.scipione@codeplay.com >
b5819
2025-07-03 11:00:03 +02:00
Xuan-Son Nguyen
0c2ee38ab7
convert : correct gemma 3n conversion ( #14450 )
...
* convert : correct gemma 3n conversion
* rm redundant code
2025-07-03 10:03:06 +02:00
Georgi Gerganov
a70c8a0c4b
kv-cache : use ggml_set_rows ( #14285 )
...
* kv-cache : use ggml_set_rows
ggml-ci
* graph : separate k and v indices
ggml-ci
* cont : remove redundant ifs
ggml-ci
* kv-cache : improve find_slot impl
* kv-cache : bounds-check when accessing slot_info indices
* kv-cache : add comments
ggml-ci
* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
b5817
2025-07-03 10:53:35 +03:00
Georgi Gerganov
9067487c44
ggml : fix FA mask dim 2 and 3 ( #14505 )
...
* ggml : fix FA mask dim 2 and 3
ggml-ci
* backends : unsupport batched FA in CUDA and Vulkan
ggml-ci
* vulkan : disable FA for mask->ne[2] != 1
b5816
2025-07-03 10:46:57 +03:00
Georgi Gerganov
d4cdd9c1c3
ggml : remove kompute backend ( #14501 )
...
ggml-ci
b5815
2025-07-03 07:48:32 +03:00
Francis Couture-Harpin
908e6559d6
convert : fix jamba conv1d shape squeezing
2025-07-02 23:49:12 -04:00