llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	e58174cecb	llama : bump max seq limit from 64 to 256 (#15916 ) ggml-ci b6507	2025-09-18 12:47:56 +03:00
Georgi Gerganov	b213fce89b	metal : improve F32, F16 and BF16 mat-vec multiplication (#16057 ) * metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci b6506	2025-09-18 12:33:45 +03:00
Jhen-Jie Hong	e00f3fd8ff	metal : avoid call free for non-owned buffer (#16067 ) b6505	2025-09-18 10:06:48 +03:00
Georgi Gerganov	f2f28380ea	metal : handle nil cv during pipeline creation (#16065 ) ggml-ci b6504	2025-09-18 10:03:24 +03:00
Chenguang Li	62c3b645c5	CANN: Remove print (#16044 ) Signed-off-by: noemotiovon <757486878@qq.com> b6503	2025-09-18 09:26:33 +08:00
Reese Levine	d304f459d8	GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (#16018 ) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * some f32 tests passing * Disable set_rows until it's implemented * f32 add all tests passing * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Add templated addition, clean up code * Get addition and multiplication working * Implement rms_norm * Add get_rows implementation * Add new get_rows files * Refactor use of wg size entry * Fix compilation * Try manually unrolled q4_0 quant * Revert "Try manually unrolled q4_0 quant" This reverts commit `77f8b96515`. * Move to constant max wg size * Check for tensor size in supports_op * Vectorize f32 and change default workgroup size * Move f32 get_rows from < 4 to % 4 != 0 * fix linter errors * Add in-place tests --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> b6502	2025-09-17 13:09:40 -07:00
Georgi Gerganov	0320ac5264	metal : refactor + optimize v2 (#15995 ) * metal : improve naming * metal : refactor device ggml-ci * cont : props ggml-ci * metal : apply ggml_mem_ranges_t ggml-ci * metal : remove GGML_METAL_USE_BF16 ggml-ci * metal : refactor device buffer ggml-ci * cont : fix naming * metal : sync before destroying the backend ggml-ci * metal : refactor context ggml-ci * metal : migrate ggml-metal.m to ggml-metal.cpp ggml-ci * metal : adjust ops API ggml-ci * metal : use C++ to store piplienes ggml-ci * metal : migrate ops to separate functions ggml-ci * metal : add ggml_metal_library_t ggml-ci * metal : improve naming ggml-ci * metal : cleanp ggml-ci * metal : add support for GGML_OP_LOG ggml-ci * metal : fix error handling ggml-ci b6501	2025-09-17 20:38:12 +03:00
Aleksander Grygier	a7a98e0fff	SvelteKit-based WebUI (#14839 ) b6500	2025-09-17 19:29:13 +02:00
Xuan-Son Nguyen	8f8f2274ee	convert : add Llama4ForCausalLM (#16042 ) * convert : add Llama4ForCausalLM * handle swa * half working version * fix use_kq_norm * fix use_kq_norm b6499	2025-09-17 19:18:21 +02:00
Johannes Gäßler	c959b676be	CUDA: fix FA occupancy, optimize tile kernel (#15982 ) b6498	2025-09-17 15:32:42 +02:00
David Ribeiro Alves	cd08fc3ecc	common : Fix corrupted memory error on json grammar initialization (#16038 ) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void>>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 #7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 #8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const, common_chat_templates_inputs const&) chat.cpp:1992 #9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const, common_chat_templates_inputs const&) chat.cpp:2074 #10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void>>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) #1 0x000100873910 in std::sys::pal::unix:🧵:Thread:🆕:h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING b6497	2025-09-17 11:08:02 +03:00
Eve	cb5bb6cc05	vulkan: automatically remove unsupported devices (#15976 ) * remove unsupported vulkan devices * make this happen during selection instead * pass by reference b6496	2025-09-17 09:35:37 +02:00
Daniel Bevenius	a91d035b90	ci : revert back to macos-13 for macOS-latest-cmake-x64 (#16040 ) This commit reverts the change of the runs-on parameter for the macOS-latest-cmake-x64 job back to macos-13 that was make in Commit `51abc96bdc` ("ci : update macos-latest* jobs to use macos-latest (#15938)"). The motivation for this is that using macos-latest will cause an ARM based runner to be used, and not an x64 based runner. Refs: https://github.com/ggml-org/llama.cpp/pull/15938#issuecomment-3300805127	2025-09-17 09:34:09 +02:00
Jie Fu (傅杰)	745cbcf2fe	llama-quant : fix the verification of attention layers for encoder-decoder models (#16023 ) Signed-off-by: Jie Fu <jiefu@tencent.com> b6494	2025-09-17 09:30:55 +02:00
Jie Fu (傅杰)	1cbd80f8cf	examples : support encoder-decoder models in the simple example (#16002 ) Signed-off-by: Jie Fu <jiefu@tencent.com> b6493	2025-09-17 10:29:00 +03:00
Shane A	85286f3548	model : add OLMo3 support (#16015 ) * Add HF to gguf conversion logic for Olmo3 * Add Olmo3 implementation * Update rope comment * Fix indentation Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b6492	2025-09-17 09:01:58 +02:00
Chenguang Li	d5fabe3682	CANN: Optimize ggml_cann_set_device (#15935 ) * CANN: Fix ggml_cann_set_device to avoid redundant device switches - Added a check to skip aclrtSetDevice if the current device is already set. - Prevents unnecessary context switches while keeping thread/device consistency. * CANN: add device default id b6491	2025-09-17 14:33:08 +08:00
jacekpoplawski	8ff206097c	llama-bench: add --n-cpu-moe support (#15952 ) * llama-bench: add --n-cpu-moe support Support --n-cpu-moe in llama-bench the same way it is supported by llama-server. b6490	2025-09-16 16:17:08 +02:00
Daniel Bevenius	77475530b8	ci : use macos-latest for arm64 webgpu build (#16029 ) This commit updates the runs-on field for the macOS arm64 webgpu build job to use macos-latest instead of just latest. The motivation for this is that this job can wait for a runner to pick up the job for a very long time, sometimes over 7 hours. This is an attempt to see if this change can help reduce the wait time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17754163447/job/50454257570?pr=16004	2025-09-16 15:27:52 +02:00
Daniel Bevenius	3913f8730e	ggml : fix padding in timestep embedding kernels (#15932 ) * ggml : remove adding extra dim timestep embedding This commit updates the ggml_timestep_embedding function to no longer add an extra dimension when the specified dimension is odd. The motivation for this change is that this introduces an unnecessary dimension when the dimension is odd, which caused an issue in the kernels which were not expecting this extra dimension and it resulted in uninitialized memory for the second to last dimension. * ggml-cuda : fix padding in timestep embedding kernel This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension. * ggml-metal : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel * ggml-opencl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-sycl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-vulkan : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-cpu : fix padding in timestep embedding function This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension. b6488	2025-09-16 15:25:57 +02:00
Daniel Bevenius	76888d202e	ci : upload xcframework artifact from ios-xcode-build job (#16010 ) This commit updates the github workflows build.yml file to include steps for uploading and downloading the xcframework artifact. The macos-latest-swift job now depends on the ios-xcode-build job and downloads the xcframework artifact produced by it. The motivation for this changes is that it takes a long time to build the xcframework and we are currently doing this twice in the workflow. With this change, we only build it once and reuse the artifact.	2025-09-16 13:41:38 +02:00
Bowen Han	f1fbffb5c0	fix: apply clang-format to CUDA macros (#16017 ) clang-format previously broke long CUDA macros (e.g. __launch_bounds__) into unreadable line breaks inside template declarations, such as: template<int D, int ncols, int nwarps, int VKQ_stride, typename KQ_acc_t, bool use_logit_softcap> __launch_bounds__(nwarps*ggml_cuda_get_physical_warp_size(), 1) This change adjusts formatting rules so that CUDA macros remain consistent and aligned with the surrounding template syntax.	2025-09-16 08:59:19 +02:00
Daniel Bevenius	51abc96bdc	ci : update macos-latest* jobs to use macos-latest (#15938 ) * ci : update macos-latest* jobs to use macos-latest This commit updates the jobs that are named macos-latest* to use the macos-latest label instead explicit versions. The motivation for this is that there is currently a mixuture of versions in this workflow and there are jobs that are failing because they require a newer version. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17644792595/job/50140010907#step:5:1759 * ci : add xcodebuild -downloadPlatform iOS command	2025-09-16 05:57:16 +02:00
Yuri Khrustalev	07808ebb07	cmake : Do not install tools on iOS targets (#15903 ) b6484	2025-09-16 09:54:44 +07:00
Aman Gupta	6d758839ff	Add LLaDA-7b-MoE diffusion model (#16003 ) b6483	2025-09-16 10:38:28 +08:00
Jake Karnes	3d4053f77f	CUDA: fix im2col_3d to respect non-contiguous inputs (views) (#15956 ) * fix im2col_3d to respect non-contiguous inputs (views) The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides. This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged. * use ggml_element_size() for src strides Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b6482	2025-09-16 00:28:31 +02:00
Diego Devesa	dc381aa9a6	docker : enable rocWMMA in ROCm images, add gfx1151 (#15997 )	2025-09-15 23:38:52 +02:00
Diego Devesa	10d197409b	releases : switch to rocWMMA develop branch, add gfx1151 (#15992 ) * releases : switch to rocWMMA develop branch, add gfx1151 * remove unused variable ROCM_VERSION b6480	2025-09-15 23:38:42 +02:00
yael-works	b907255f4b	SYCL: Add COUNT_EQUAL operator support (#15991 ) * SYCL: Add COUNT_EQUAL operator support (rebased on master) * SYCL: remove duplicate op_count_equal definition * tests: remove test_count_equal_typed and use test_count_equal for all cases * tests: keep only I32 case for COUNT_EQUAL as suggested * tests: keep only I32 case for COUNT_EQUAL as requested b6479	2025-09-15 18:51:35 +02:00
Nikolay Popov	28c39da7c6	llama-run: Fix model download on Windows (#15988 ) * llama-run: Fix model download on Windows * fix SSL error (SSL peer certificate or SSH remote key was not OK) * fix program crash on std::filesystem::rename * llama-run: create a separate method to utilize RAII * llama-run: handle rename exception b6478	2025-09-15 11:08:30 +01:00
Aman Gupta	106220562a	CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (#15926 ) b6477	2025-09-15 17:35:11 +08:00
ddh0	a68f31edd7	fix KLD percentile output (#15999 ) In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably). b6476	2025-09-15 09:54:57 +02:00
Sigbjørn Skjæret	b8e09f08b9	model : add grok-2 support (#15539 ) * add grok-2 support * type fix * type fix * type fix * "fix" vocab for invalid sequences * fix expert tensor mapping and spaces in vocab * add chat template * fix norm tensor mapping * rename layer_out_norm to ffn_post_norm * ensure ffn_post_norm is mapped * fix experts merging * remove erroneous FFN_GATE entry * concatenate split tensors and add more metadata * process all expert layers and try cat instead of hstack * add support for community BPE vocab * fix expert feed forward length and ffn_down concat * commit this too * add ffn_up/gate/down, unsure if sequence is right * add ffn_gate/down/up to tensor names * correct residual moe (still not working) * mess-- * fix embedding scale being applied twice * add built in chat template * change beta fast for grok if default value * remove spm vocab in favor of community bpe vocab * change attention temp length metadata type to integer * update attention temp length metadata * remove comment * replace M_SQRT2 with std::sqrt(2) * add yarn metadata, move defaults to hparams b6475	2025-09-14 23:00:59 +02:00
Sigbjørn Skjæret	6c019cb04e	server : only attempt to enable thinking if using jinja (#15967 ) b6474	2025-09-14 21:17:04 +02:00
Georgi Gerganov	9dcd200d57	metal : remove memory pools (#15966 ) * metal : remove mem pool usage ggml-ci * metal : remove mem pool implementation ggml-ci * metal : take into account the actual allocated memory of the tensor ggml-ci * cont : use ggml_backend_buft_get_alloc_size ggml-ci * cont : improve, comments ggml-ci * cont : add functions for the extra tensor sizes * metal : add comments ggml-ci * metal : implement .get_alloc_size for the rest of the buffer types ggml-ci * metal : remove ggml_metal_heap ggml-ci b6473	2025-09-14 22:02:32 +03:00
Adam	0fa154e350	rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series (#15994 ) * rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.1/reference/system-requirements.html#rdna-os states the Radeon RX 9000 series is supported support from Ubuntu 24.04.2, and the dockerfile is using 24.04 which is ROCm 6.4. This fixed the `ROCm error: invalid device function` I was getting when trying to use the rocm container.	2025-09-14 20:43:54 +02:00
Ruben Ortlam	261e6a20ff	Vulkan: Clean up mul_mm shader (#15987 ) * vulkan: move mul_mm dequantization steps into a separate file and functions * improve mul_mm vector load code * fix debug mode issues and warnings b6471	2025-09-14 16:56:28 +02:00
lcy	a0e13dcbe5	build: fix the build failures of Windows HIP release job (#15984 ) * build: fix the cache keys for Windows HIP release job Update the cache keys to include the HIP SDK version, preventing the use of outdated ROCm installation caches. * build: sync changes from release.yml to build.yml - Update HIP SDK version to 25.Q3 and ROCm version to 6.4.2 - Update the cache keys to reflect the new versions * build: remove Windows HIP release for gfx1151 since the current stable rocWMMA does not support gfx1151. b6470	2025-09-14 07:20:35 -07:00
Georgi Gerganov	a14bd35014	metal : fix kernel requirements (#15983 ) * metal : fix kernel requirements ggml-ci * cont : fix supports_op * cont : fix supports_op for ARGMAX b6469	2025-09-14 15:33:22 +03:00
Radoslav Gerganov	918b26f197	rpc : fix regression when --device is used (#15981 ) Fix regression introduced with commit `50f4281a6`	2025-09-14 12:28:18 +03:00
Diego Devesa	9ecb884346	releases : update ROCM, add gfx1200, gfx1201, gfx1151 (#15972 ) * releases : update ROCM, add gfx1200, gfx1201, gfx1151 * releases : set target to 13.3 for macos-x64 * add hipblaslt.dll to release * add hipblaslt/library to release	2025-09-14 02:21:59 -07:00
Radoslav Gerganov	d1c6f11f47	doc : update documentation for --tensor-split (#15980 ) * doc : update documentation for --tensor-split * Update tools/main/README.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update tools/main/README.md Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-09-14 12:10:07 +03:00
Aaron Teo	6380d6a3e7	ggml-zdnn: rm user mapped buffers (#15965 ) * ggml-zdnn: rm user mapped buffers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm dead code Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt to fix missing extra data buffer free Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-14 13:37:03 +08:00
Jeff Bolz	aa0c461efe	vulkan: fix failing dequant shaders (#15862 ) * vulkan: fix failing dequant shaders * add missing const	2025-09-13 17:29:43 +02:00
Jeff Bolz	b9c9c9f789	vulkan: initialize vulkan-hpp to allow using extension function pointers (#15705 ) Use this to query register count for shader compiles on NVIDIA. Currently this is only for performance debug, but it could eventually be used in some heuristics like split_k.	2025-09-13 17:23:30 +02:00
Diego Devesa	50f4281a6f	llama : allow using iGPUs with --device (#15951 ) * llama : allow using iGPUs with --device * mtmd : allow iGPU * rpc-server : allow iGPU	2025-09-13 16:49:49 +02:00
Georgi Gerganov	55758b00ca	metal : refactor kernel loading (#15964 ) * metal : refactor bin kernels loading ggml-ci * metal : refactor rms kernel loading ggml-ci * ci : try to add memory leaks check ggml-ci * ci : try to enable memory leak detection for Mac * cont : seems to be working	2025-09-13 16:24:22 +03:00
Georgi Gerganov	f161463a54	metal : allow ops to run concurrently (#15929 ) * metal : run graphs ops concurrently ggml-ci * cont : add flags for debugging and disabling concurrency ggml-ci * cont : refactor and handle fusing ggml-ci * cont : simplify - no need to use GPU address ggml-ci * cont : prepare mem ranges for reuse + add ggml-metal-common.cpp ggml-ci * cont : avoid redundant keywords in cpp [no ci] * metal : reorder graph for better concurrency ggml-ci * metal : fix race on mem pool buffers ggml-ci * cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE ggml-ci * cont : refactor, optimize, add comments ggml-ci * cont : refactor ggml-metal.m ggml-ci * minor : update logs [no ci]	2025-09-13 13:54:28 +03:00
Georgi Gerganov	84d7b2fca1	metal : fix memory leaks (#15962 ) ggml-ci	2025-09-13 12:45:04 +03:00
Aaron Teo	40be51152d	ggml-zdnn: fix #15414 , activate FP16 and BF16 acceleration and incorrect zTensor free (#15839 )	2025-09-13 02:39:52 +08:00

1 2 3 4 5 ...

6507 Commits