llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-03 09:22:01 +00:00

Author	SHA1	Message	Date
Frankie Robertson	cd2f37b304	Avoid using __fp16 on ARM with old nvcc (#10616 ) b4258	2024-12-04 01:41:37 +01:00
Benson Wong	da6aac91f1	Add docs for creating a static build (#10268 ) (#10630 ) * Add notes for a static build * Update docs/build.md --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-04 01:40:36 +01:00
piDack	01e6d9bb71	clip : add sycl support (#10574 ) Co-authored-by: piDack <pcdack@hotmail.co> b4256	2024-12-04 01:26:37 +01:00
Jeff Bolz	cc98896db8	vulkan: optimize and reenable split_k (#10637 ) Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders. b4255	2024-12-03 20:29:54 +01:00
Xuan Son Nguyen	91c36c269b	server : (web ui) Various improvements, now use vite as bundler (#10599 ) * hide buttons in dropdown menu * use npm as deps manager and vite as bundler * fix build * fix build (2) * fix responsive on mobile * fix more problems on mobile * sync build * (test) add CI step for verifying build * fix ci * force rebuild .hpp files * cmake: clean up generated files pre build b4254	2024-12-03 19:38:44 +01:00
Georgi Gerganov	1cd3df46bd	scripts : remove amx sync ggml-ci b4253	2024-12-03 20:04:49 +02:00
Georgi Gerganov	c505471857	sync : ggml	2024-12-03 20:04:49 +02:00
mahorozte	e9e661bd59	CUDA: remove unnecessary warp reduce in FA (ggml/1032) * kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>	2024-12-03 20:04:49 +02:00
PAB	efb6ae9630	feat: add `GGML_UNARY_OP_ARGMAX` Metal kernel (ggml/1019) * implemented argmax kernel * tpig -> tgpig * change to strides * contiguous assertions * kernel working and tested * argmax simd parallel implementation * added 2 new tests for argmax in test-backend-ops * cosmit * added 3 tests cases for perf eval * add test_argmax in make_test_cases_perf * Update test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-03 20:04:49 +02:00
PAB	667d70d170	metal : add `GGML_OP_CONV_TRANSPOSE_1D` kernels (ggml/1026) * wip * wip implementation f32 * kernel conv transpose 1d f32 working * initial commit	2024-12-03 20:04:49 +02:00
Xuan Son Nguyen	3b4f2e33e2	llama : add missing LLAMA_API for llama_chat_builtin_templates (#10636 ) b4248	2024-12-03 12:54:30 +01:00
Nikolaos Pothitos	82bca2257b	readme : add option, update default value, fix formatting (#10271 ) * readme : document --no-display-prompt * readme : update default prompt context size * readme : remove unnecessary indentation Indenting a line with four spaces makes Markdown treat that section as plain text. * readme : indent commands under bullets * readme : indent commands in lettered list	2024-12-03 12:50:08 +02:00
Georgi Gerganov	0115df2f65	metal : small-batch mat-mul kernels (#10581 ) * metal : small-batch mat-mul kernels ggml-ci * metal : add rest of types ggml-ci * metal : final adjustments ggml-ci * metal : add comments ggml-ci b4246	2024-12-03 11:52:33 +02:00
Georgi Gerganov	515d4e5372	github : minify link [no ci] (revert) this doesn't work as expected	2024-12-03 11:21:43 +02:00
Georgi Gerganov	844e2e1fee	github : minify link [no ci]	2024-12-03 11:20:35 +02:00
Georgi Gerganov	70b98fadbc	server : fix default draft model parameters (#10586 ) * server : force F16 KV cache for the draft model ggml-ci * server : fix draft params ggml-ci * server : various params fixes ggml-ci b4243	2024-12-03 11:20:00 +02:00
Xuan Son Nguyen	642330ac7c	llama : add enum for built-in chat templates (#10623 ) * llama : add enum for supported chat templates * use "built-in" instead of "supported" * arg: print list of built-in templates * fix test * update server README b4242	2024-12-02 22:10:19 +01:00
Georgi Gerganov	8648c52101	make : deprecate (#10514 ) * make : deprecate ggml-ci * ci : disable Makefile builds ggml-ci * docs : remove make references [no ci] * ci : disable swift build ggml-ci * docs : remove obsolete make references, scripts, examples ggml-ci * basic fix for compare-commits.sh * update build.md * more build.md updates * more build.md updates * more build.md updates * Update Makefile Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-12-02 21:22:53 +02:00
haopeng	64ed2091b2	server: Add "tokens per second" information in the backend (#10548 ) * add cmake rvv support * add timings * remove space * update readme * fix * fix code * remove empty line * add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b4240	2024-12-02 14:45:54 +01:00
Akarshan Biswas	991f8aabee	SYCL: Fix and switch to GGML_LOG system instead of fprintf (#10579 ) * Switched to GGML_LOG * Fix missing semicolon b4239	2024-12-02 15:04:11 +08:00
Georgi Gerganov	4cb003dd8d	contrib : refresh (#10593 ) * contrib : refresh * contrib : expand [no ci] * contrib : expand test-backend-ops instructions * contrib : add CODEOWNERS * prs : update template to not have checkbox [no ci]	2024-12-02 08:53:27 +02:00
Juk Armstrong	917786f43d	Add `mistral-v1`, `mistral-v3`, `mistral-v3-tekken` and `mistral-v7` chat template types (#10572 ) * Templates: `mistral-v1`, `mistral-v2`, `mistral-v3`, `mistral-v3-tekken` * Changed system message logic and added tests for all 4 * Invalid `system_message` instead of `content` fixed * Removed tab-indented lines * Added template code and test for `mistral-v7` * Added all tests. Fixed bug with `tmpl == "llama2"` test. * Replaced tabs with spaces. * Removed `'mistral-v2'` option as no (open) models ever used it * Removed all references to 'v2' template from comments * Update llama.cpp Fixed `trim_assistant_message` bug	2024-12-01 23:09:49 +01:00
Georgi Gerganov	5e1ed95583	grammars : add English-only grammar (#10612 )	2024-12-01 21:37:54 +02:00
Wang Qin	5c7a5aa0c3	ci: add error handling for Python venv creation in run.sh (#10608 )	2024-12-01 20:11:42 +02:00
Diego Devesa	3420909dff	ggml : automatic selection of best CPU backend (#10606 ) * ggml : automatic selection of best CPU backend * amx : minor opt * add GGML_AVX_VNNI to enable avx-vnni, fix checks b4234	2024-12-01 16:12:41 +01:00
alek3y	86dc11c5bc	server : bind to any port when specified (#10590 ) b4233	2024-12-01 13:33:12 +02:00
Georgi Gerganov	6acce39710	readme : update the usage section with examples (#10596 ) * readme : update the usage section with examples * readme : more examples	2024-12-01 11:25:17 +02:00
Wang Qin	43957ef203	build: update Makefile comments for C++ version change (#10598 ) b4231	2024-12-01 04:19:44 +01:00
Adrien Gallouët	0c39f44d70	ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (#10567 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b4230	2024-11-30 09:13:18 -08:00
Georgi Gerganov	3e0ba0e604	readme : remove old badge	2024-11-30 10:09:21 +02:00
Georgi Gerganov	abadba05be	readme : refresh (#10587 ) * readme : refresh * readme : move section [no ci] * readme : clarify [no ci] * readme : fixes [no ci] * readme : more fixes [no ci] * readme : simplify [no ci] * readme : clarify GGUF	2024-11-30 09:47:07 +02:00
Eve	0533e7fb38	vulkan: Dynamic subgroup size support for Q6_K mat_vec (#10536 ) * subgroup 64 version with subgroup add. 15% faster scalable version tested for subgroup sizes 16-128 * check for subgroup multiple of 16 and greater than 16 * subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45) * force 16 sequential threads per block * make 16 subgroup size a constant b4227	2024-11-30 08:00:02 +01:00
Diego Devesa	7cc2d2c889	ggml : move AMX to the CPU backend (#10570 ) * ggml : move AMX to the CPU backend --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b4226	2024-11-29 21:54:58 +01:00
Xuan Son Nguyen	b782e5c7d4	server : add more test cases (#10569 ) * server : add split model test * add test speculative * add invalid cases	2024-11-29 21:48:56 +01:00
Robert Collins	3a8e9af402	imatrix : support combine-only (#10492 ) * imatrix-combine-only idea * ensured that behavior consistent with log b4224	2024-11-29 19:21:37 +02:00
Diego Devesa	a3a3048e7a	cleanup UI link list (#10577 ) * cleanup UI link list * sort list alphabetically * add missing licenses	2024-11-29 17:45:08 +01:00
Georgi Gerganov	f0678c5ff4	ggml : fix I8MM Q4_1 scaling factor conversion (#10562 ) ggml-ci b4222	2024-11-29 16:25:39 +02:00
Shupei Fan	4b3242bbea	ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (#10580 ) b4221	2024-11-29 14:49:02 +01:00
Alberto Cabrera Pérez	0f77aae560	sycl : offload of get_rows set to 0 (#10432 ) b4220	2024-11-29 20:38:45 +08:00
Alberto Cabrera Pérez	266b8519ee	sycl : Reroute permuted mul_mats through oneMKL (#10408 ) This PR fixes the failing MUL_MAT tests for the sycl backend. b4219	2024-11-29 09:49:43 +00:00
Chenguang Li	938f608742	CANN: RoPE operator optimization (#10563 ) * [cann] RoPE operator optimization * [CANN]Code Formatting --------- Co-authored-by: noemotiovon <noemotiovon@gmail.com> b4218	2024-11-29 14:46:55 +08:00
Jeff Bolz	f095a649ec	vulkan: get the first command buffer submitted sooner (#10499 ) This is an incremental improvement over #9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space. With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU. b4217	2024-11-29 07:18:02 +01:00
Ting Lou	678d7994f4	llava: return false instead of exit (#10546 ) b4216	2024-11-29 01:09:46 +01:00
Georgi Gerganov	dc22344088	ggml : remove redundant copyright notice + update authors b4215	2024-11-28 20:46:40 +02:00
Georgi Gerganov	4c0a95b107	llama : add missing model types b4214	2024-11-28 20:45:07 +02:00
Xuan Son Nguyen	6c59567689	server : (tests) don't use thread for capturing stdout/stderr, bump openai client library (#10568 ) * server : (tests) don't use thread for capturing stdout/stderr * test: bump openai to 1.55.2 * bump openai to 1.55.3	2024-11-28 19:17:49 +01:00
Johannes Gäßler	890719311b	common: fix warning message when no GPU found (#10564 ) b4212	2024-11-28 18:15:25 +01:00
Random Fly	7281cf13ad	docs: fix outdated usage of llama-simple (#10565 ) b4211	2024-11-28 16:03:11 +01:00
Diego Devesa	e90688edd0	ci : fix tag name in cuda and hip releases (#10566 ) b4210	2024-11-28 15:58:54 +01:00
Georgi Gerganov	76b27d29c2	ggml : fix row condition for i8mm kernels (#10561 ) ggml-ci b4209	2024-11-28 14:56:37 +02:00

1 2 3 4 5 ...

4258 Commits