Prior to this change, we faced undefined cublasLt references when
attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux.
We add linking with CUDA::cublasLt_static when CUDA version is greater
than 10.1.
* CUDA: optimize get_int_from_table_16
* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs
* revise documentation
---------
Co-authored-by: xix <xiapc@outlook.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* vulkan : support ggml_mean
* vulkan : support sum, sum_rows and mean with non-contiguous tensors
* vulkan : fix subbuffer size not accounting for misalign offset
* tests : add backend-op tests for non-contiguous sum_rows
* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support
* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
* Add Pad Reflect 1D CUDA support
* Update ggml/src/ggml-cuda/pad_reflect_1d.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* musa: fix build warnings
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
add expicit conversion operator to support older versions of rocm
Switch over to hip_bf16 from legacy hip_bfloat16
Simplify RDNA3 define
Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* examples/finetune -opt SGD (stochastic gradient descent) memory opt
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
* Vulkan: Implement GGML_OP_OPT_STEP_SGD
* tests: Fix OPT_STEP_SGD test-backend-ops
* SGD op param store weight-decay and not 1-alpha*wd
* minor + cosmetic changes
* fix vulkan sgd
* try CI fix
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Factor out `reduce_rows_f32` from common.cuh
This increases iteration cycle speed by not having to recompile
every kernel all the time
* Hide memory-latency by loop unrolling in reduce_rows_f32
* Further optimizations to `reduce_rows_f32`
1. Increase threadblock size to better hide latency of memory requests.
As a consequence of bigger threadblocks, do 2-step summation, using
shared memory to communicate results between invocations
2. Use sum_temp array to reduce waits on sum
3. Adjust num_unroll to reflext bigger threadblock
4. Improve default block_dims, increase support for more block_dims
* Add perf tests for `reduce_rows_f32` kernel
* Add heuristic to toggle 128/512 threads based on sm count
Break even point was the minimum of the following multiples.
| GPU Model | Nrow SM Count Multiple |
| ----------- | ----------- |
| RTX 4000 SFF ADA | 2.0x |
| RTX 6000 ADA | 2.5x |
| RTX PRO 6000 Blackwell Max-Q | 3.04x |
| RTX PRO 4500 Blackwell | 3.15x |
* Ensure perf gains also for small ncols and large nrows
Alternative to this, one could have also made the number of unrollings
template-able, but that would require compiling the kernel multiple
times, increasing binary size unnecessarily
* Modify perf and unit-tests
* Apply auto-formatting by clang
* Fix CI build failure
See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486
Building with VS generator worked though.
* Remove sm_count property from `ggml_backend_cuda_context`
Requested by @JohannesGaessler, and should fix remaining CI issues as a
side-effect
* Add CUB-based implementation for GGML_OP_MEAN
Currently this branch is only executed for nrows==1
* Add heuristics to execute CUB branch only when it brings perf
Heuristics were determined on the following HW:
* RTX 4000 SFF ADA
* RTX 6000 ADA
* RTX PRO 6000 Blackwell Max-Q
* RTX PRO 4500 Blackwell
* Add unit-test for CUB-based mean
Tests should run with CUDA Graphs enabled per default on NVGPUs
* Rename `USE_CUB` to `GGML_CUDA_USE_CUB`
Suggested by @JohannesGaessler
* Unindent Preprocessor directives
See
https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506
* cuda: refactored ssm_scan to use CUB
* fixed compilation error when when not using CUB
* assign L to constant and use size_t instead of int
* deduplicated functions
* change min blocks per mp to 1
* Use cub load and store warp transpose
* suppress clang warning
This is useful for testing for regressions on GCN with CDNA hardware.
With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.