* devops: move s390x and ppc64le ci build
we have access to ubuntu-24.04-s390x and ppc64le images now
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le for now since they have compiler errors
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: stop warnings as errors
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: switch to non-macro flag
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: going the llama macro route
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add big-endian gguf test models
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le to test s390x, check test build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dup .gguf.inp files for big-endian tests
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dup .gguf.out files for big-endian too
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add python setup and endian byteswap
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: pooring thing does not have s390x python3
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add missing rust compiler for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: try rust actions runner
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Revert "devops: try rust actions runner"
This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: try a different path for rust
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dump home directory and user info
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: install gguf-py only
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: missed relative path
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove big-endian files since local swapping is working
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: revert test-tokenizer-0 cmakelists
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix unicode flags conversion from and to uint16_t
Bitfields are allocated in different order on s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Simplify byteswap command
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix endianness detection in vocab loader
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Disable test-thread-safety on s390x
In this test a model is downloaded,
then immediately loaded to check if more downloads are needed,
and then used for test.
There is no clean way to separate all those steps
to add byteswapping between them, so just skip this test.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix q8_0 test in test-quantize-fns
vec_signed uses unexpected rounding mode.
Explicitly use different rounding function.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add big-endian stories260K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add s390x test-eval-callback
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix test does not exist
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix model not found llama-eval-callback
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix q3_K dot product error in test-quantize-fns on s390x
Array q8bytes had only 4 elements allocated, but 8 elements accessed.
This lead to write out of bounds and later read of overwritten values out of bounds
and incorrect result.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: re-enable ppc64le for testing
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: activate test-thread-safety for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le tests
for some reason it keeps failing test-thread-safety tests and I do not
have a machine that is able to replicate the tests.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: LLAMA_FATAL_WARNINGS=ON
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Correct repository URL for s390x for test-thread-safety model
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix fs_get_cache_directory
Ensure it works even if both XDG_CACHE_HOME and HOME are unset.
This might happen in containers.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Re-enable CI for ppc64le
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fortify ggml_rope_impl
Only memcpy data from sections argument if it's non-NULL.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way
* Update URL for big-endian model
* Update .github/workflows/build.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update remaining mentions of BE models to ggml-org/models repo
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
Co-authored-by: Aleksei Nikiforov <103434461+AlekseiNikiforovIBM@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162)
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
Generalize Linux check to `__linux__` to support non-glibc systems (like musl).
Also, return `false` on unknown/untested OS.
Without this commit, the code compiles (with warnings) but fails:
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C)
build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug)
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
....
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 4B
Illegal instruction (core dumped)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : remove adding extra dim timestep embedding
This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.
The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.
* ggml-cuda : fix padding in timestep embedding kernel
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* ggml-metal : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel
* ggml-opencl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-sycl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-vulkan : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-cpu : fix padding in timestep embedding function
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling
MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when
the target hardware actually supports them.
The motivation for this is to fix ggml CI build failures where the
feature detection correctly identifies that i8mm is not supported,
adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are
still enabled, causing the compiler to attempt to use vmmlaq_s32
intrinsics without i8mm support.
Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499
This commit fixes the zero padding for odd dimensions in
ggml_compute_forward_timestep_embedding_f32.
The motivation for this is that currently if an odd dimension is used,
the padding check incorrectly uses the dimension value for indexing.
For example, with dim=15:
Elements 0-6 are set to cosine values
Elements 7-13 are set to sine values
Element 14 is left uninitialized (contains garbage)
Element 15 is correctly set to zero
This fix changes embed_data[dim] to embed_data[2 * half] so that
element 14 (the first unused element) is properly set to zero as well
as the last element.
Resolves: https://github.com/ggml-org/ggml/issues/1324
* vulkan: sort graph to allow more parallel execution
Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.
With #15489, this reduces the number of synchronizations needed.
* call optimize_graph per-split
* ggml: allow casting between f32 and i32
* fix cuda
* add vulkan
* fix CPU non-cont
* add non-cont test case
* add note
* extend test number range
* correct note
* add cont version for vulkan
* SVE support for exponential functions
Add const notation to variable pg
* Update ggml/src/ggml-cpu/vec.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add const
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This patch improves GEMM for FP32 Data Type on PowerPC
Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.
Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).
Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces
Performance Testing:
Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.
model Size Params Backend Threads Test Patch Base
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44
25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* examples/finetune -opt SGD (stochastic gradient descent) memory opt
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
* Vulkan: Implement GGML_OP_OPT_STEP_SGD
* tests: Fix OPT_STEP_SGD test-backend-ops
* SGD op param store weight-decay and not 1-alpha*wd
* minor + cosmetic changes
* fix vulkan sgd
* try CI fix
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Initial Q2_K Block Interleaving Implementation
* Addressed review comments and clean up of the code
* Post rebase fixes
* Initial CI/CD fixes
* Update declarations in arch-fallback.h
* Changes for GEMV Q2_K in arch-fallback.h
* Enable repacking only on AVX-512 machines
* Update comments in repack.cpp
* Address q2k comments
---------
Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>
* remove redundant code in riscv
* remove redundant code in arm
* remove redundant code in loongarch
* remove redundant code in ppc
* remove redundant code in s390
* remove redundant code in wasm
* remove redundant code in x86
* remove fallback headers
* fix x86 ggml_vec_dot_q8_0_q8_0
* CMake config: Create target only once
Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.
* CMake config: Add CUDA link libs
* CMake config: Add OpenCL link libs
* CMake config: Use canonical find_dependency
Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.
* CMake config: Wire OpenMP dependency
This commit removes the inclusion of `<cstdlib>`.
The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions
2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* ggml : add ggml_scale_bias
* ggml_vec_mad1_f32
* add more simd
* add CUDA
* sycl
* vulkan
* cann (placeholder)
* opencl
* will this fix cpu?
* fix cuda
* suggestions from coderabbit
* fix cann compile error
* vDSP_vsmsa
* rm __ARM_FEATURE_SVE
* use memcpy for op params
* make code looks more consistent
* use scalar for __ARM_FEATURE_SVE
* add x param to ggml_vec_mad1_f32