Commit Graph

6439 Commits

Author SHA1 Message Date
Aaron Teo
a1912c7fa9 devops: fix copying process
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 18:07:59 +08:00
Aaron Teo
03e642a9d1 devops: attempt at making it cache the build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 18:05:43 +08:00
Aaron Teo
0084c88929 devops: attempt at fixing missing dir
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:52:43 +08:00
Aaron Teo
73679520ce devops: formalise llama.cpp loc
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0a7664af84)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:51:20 +08:00
Aaron Teo
bff187d717 Revert "devops: formalise llama.cpp loc"
This reverts commit 0a7664af84.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:47:02 +08:00
Aaron Teo
0a7664af84 devops: formalise llama.cpp loc
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:40:27 +08:00
Aaron Teo
244d6cf56f devops: update debian target
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:29:00 +08:00
Aaron Teo
17a9985086 devops: fix missing shared libraries in base
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:24:23 +08:00
Aaron Teo
489e0ab54f devops: fix typos
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:19:30 +08:00
Aaron Teo
a0b22c8a29 devops: add cli target
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:14:33 +08:00
Aaron Teo
f6baab6be8 devops: finalise hardened server stage
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:59:53 +08:00
Aaron Teo
10714efb6d devops: move libggml-cpu and blas into bin
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:54:06 +08:00
Aaron Teo
ab79c0bb80 devops: remove move shared objects
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:45:17 +08:00
Aaron Teo
944ef7f0bc devops: fix missing ggml shared object
failure to load model

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:38:05 +08:00
Aaron Teo
b23e72e1d0 devops: attempt at fixing model loading failure
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:19:35 +08:00
Aaron Teo
451aceb9a0 devops: fix unknown model loading failures
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:16:49 +08:00
Aaron Teo
c3ab7855fd devops: fix permission issue
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 15:43:59 +08:00
Aaron Teo
7027c14d3c devops: fix missing stage ref
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 15:35:29 +08:00
Aaron Teo
74767bbc16 devops: add collector stage
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 15:34:47 +08:00
Aaron Teo
3a09c656a7 devops: fix shared libs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 15:25:01 +08:00
Aaron Teo
28b41f73ed devops: use correct libs path
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 02:59:06 +08:00
Aaron Teo
2ff6694a0f devops: fix shared libs in distroless
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 18:31:58 +08:00
Aaron Teo
a070157511 devops: remove apt commands from distroless
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 18:16:32 +08:00
Aaron Teo
23d34f9a98 devops: remove apt clean steps as distroless misses it
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 17:57:48 +08:00
Aaron Teo
e172b00445 devops: add server build step
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 17:50:10 +08:00
Aaron Teo
e53e1c450c devops: copy more tools
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 15:36:41 +08:00
Aaron Teo
ce7bd1955d devops: rework s390x docker
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 15:19:41 +08:00
Aaron Teo
955c426620 devops: move s390x docker into cpu docker
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 14:56:07 +08:00
Aaron Teo
75846921d8 devops: add missing ninja
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 14:03:38 +08:00
Aaron Teo
bdcbcaeead devops: add s390x dockerfile
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 13:59:54 +08:00
Jeff Bolz
d413dca003 tests: large sizes for get_rows (#15687) b6409 2025-09-07 23:23:41 -05:00
Chenguang Li
85ca66a746 CANN: Stream sync between devices for acl_graph (#15809)
* CANN: Switch to stream synchronization

Switch to stream synchronization because events are not effective.

Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: add Comments

---------

Co-authored-by: hipudding <huafengchun@gmail.com>
b6408
2025-09-08 10:03:29 +08:00
Jeff Bolz
3976dfbe00 vulkan: support im2col_3d (#15795) b6407 2025-09-07 13:50:26 -05:00
Aaron Teo
d36e61c580 ggml-cpu: clean up s390x SIMD (#15855)
* ggml-cpu: clean up s390x simd

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0da4b6aa07)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix hsum data types

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b6406
2025-09-08 02:18:28 +08:00
Jeff Bolz
c97b5e5854 vulkan: Support pad_ext (#15794) b6405 2025-09-07 19:00:49 +02:00
Jeff Bolz
267e99867f vulkan: Use larger loads in scalar/coopmat1 matmul (#15729)
I think glslang will translate an access like x[i][1].z to
OpAccessChain ... x, i, 1, 2
OpLoad float16_t ...

rather than loading all of x[i] in a single OpLoad. Change the
code to explicitly load the vector/matrix.
b6404
2025-09-07 18:53:07 +02:00
Daniel Bevenius
3b15924d71 ggml WebGPU: remove userdata from request adapter callback (#15527)
* ggml WebGPU: remove userdata from request adapter callback

This commit removes the `userdata` parameter from the WebGPU request
adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function
captures the `webgpu_context` directly.

The motivation for this change is to simplify the code and improve
readability.

* inline the callback lambda into the RequestAdapter call

This commit removes the callback lambda variable and inlines it directly
into the RequestAdapter call.
b6403
2025-09-07 11:19:45 +03:00
Johannes Gäßler
79bc429262 CUDA: faster tile FA (Pascal/AMD), headsize 256 (#15769) b6402 2025-09-07 00:26:28 +02:00
Charles Xu
c4df49a42d kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16 (#15817) b6401 2025-09-06 22:08:43 +08:00
Xuan-Son Nguyen
3c3635d2f2 server : speed up tests (#15836)
* server : speed up tests

* clean up

* restore timeout_seconds in some places

* flake8

* explicit offline
2025-09-06 14:45:24 +02:00
Xuan-Son Nguyen
61bdfd5298 server : implement prompt processing progress report in stream mode (#15827)
* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b6399
2025-09-06 13:35:04 +02:00
Johannes Gäßler
01806e7771 ggml-cpu: document use of "free" memory [no ci] (#15834) 2025-09-06 13:28:44 +02:00
Aaron Teo
186415d595 ggml-cpu: drop support for nnpa intrinsics (#15821) b6397 2025-09-06 11:27:28 +08:00
Gabe Goodhart
fd621880f3 aLoRA Support (#15327)
* feat: Add python-side constants and conversion for adapter.lora.invocation_string

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add c++ side constants for adapter.lora.invocation_string

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse invocation string for adapters from GGUF

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(python): Update conversion to alora_invocation_tokens

This is the preferred method in PEFT which is the source of ground truth

https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(cpp): Update to alora_invocation_tokens on c++ side

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add C APIs to get alora invocation token array from lora

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Initial implementation of alora cache logic in server

This does not yet do the part to identify the invocation tokens and only
apply the lora adapter afterwards, but it does seem to produce correct
results if the invocation tokens are the beginning of the uncached input.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Identify alora invocation sequences

This currently limits to a single enabled alora per slot. Multiple aloras
with different invocation sequences would be possible, but it would require
a more complex integration of the adapter toggling and is not really a well
studied case for alora since it's unclear if one alora can reuse cache from
previous prefill computed with a different alora.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Only reuse cache for tokens before the alora invocation start

This is a bit of an edge case, but theoretically a user could try the same
query with the alora disabled (just using the base model), then retry with
the alora. The cached tokens from the first pass should be invalid.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Handle un-cached tokens that come before the alora activation

The solution is to only fill up to the token before the invocation start in
the batch if there are any tokens to be prefilled between those pulled from
cache and the invocation start. When this is detected, the alora is
temporarily disabled with a scale of 0.0, then immediately re-enabled after
it has been initialized for the internal graph. Since the batch does not
complete the prompt tokens, the remaining prompt tokens are handled in the
next task, pulling all of the non-alora tokens from cache and proceeding
with prefill for the alora tokens.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use || instead of 'or'

Too much python 🤦

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix off-by-one for limiting cached tokens to before alora start

This was the cause of the inconsistent results from the dummy test script
with and without the turn that runs the prompt without the adapter before
running it with the adapter.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Support backwards-compatibility for "invocation_string" in adapter_config.json

While this has been replaced in the PEFT PR in favor of
alora_invocation_tokens, the existing adapters in the ibm-granite org on HF
use "invocation_string," so this will enable backwards compatibility and
enable testing now (before PEFT PR changes have percolated everywhere).

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove duplicate logging

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b6396
2025-09-05 17:32:39 -06:00
Sigbjørn Skjæret
4281c7b315 ci : exempt correct research label (#15825) 2025-09-06 01:21:15 +02:00
Gabe Goodhart
5fac79cbc7 Thinking model disabled assistant prefill (#15404)
* feat: Set enable_thinking IFF not disabled and supported

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix inverted logic condition for prefill error

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Always parse the enable_thinking kwarg to overwrite the default value

From what I can tell, this started as a Qwen3-specific keyword, but from
the use in `chat.cpp` translates this inputs.enable_thinking to the right
thinking kwarg for the given model, this is now more of a standardized
kwarg, so it should always override the default value when sent as part of
the chat_template_kwargs field in the API.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Don't limit tempalte expansion check to jinja

With the use_jinja check, non-jinja models would enable thinking and always
fail assistant prefill

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add the error text to json type errors in json_value

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Explicitly reject string values for "enable_thinking"

There are too many possible "truthy" / "falsy" strings and too many
ambiguous strings that don't have a clear truthy/falsy value, so the
simplest thing to do here is to reject the request. Ideally, this would be
a 422 (Unprocessable Entity), but right now it's coming back as a 500.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Move logic for detecting template enable_thinking support to common

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use raw pointer for common chat template function

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
b6394
2025-09-05 14:31:24 -06:00
Eric Curtin
408ff524b4 Implement --log-colors with always/never/auto (#15792)
With auto by default

Signed-off-by: Eric Curtin <ericcurtin17@gmail.com>
b6393
2025-09-05 19:43:59 +01:00
Johannes Gäßler
5143fa895e CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (#15802)
* CUDA: fastdiv, launch bounds for mmvq + q8_1 quant
b6392
2025-09-05 16:07:02 +02:00
Daniel Bevenius
3a550b5ca4 tests : add --list-ops and --show-coverage options (#15745)
This commit adds two new command-line options to the
test-backend-ops.cpp that allow users to list all available GGML
operations and to show test coverage of these operations.

The motivation for this is that it can be useful to quickly see which
operations are currently covered by tests and which are not. Also it
migth be useful when using the `support` mode.
b6391
2025-09-05 13:49:21 +01:00
Erik Scholz
a81283820a gguf: gguf_writer refactor (#15691)
* gguf: split gguf writer into base and buf impl
* gguf: templated gguf write out
* gguf: file based writer (avoid writing everything to memory first!)
* examples(llama2c): fix log not being the same level and compiler nits
b6390
2025-09-05 11:34:28 +02:00