llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-31 08:51:55 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	307e09cd85	Merge branch 'gguf' into gguf-write-single-pass	2023-08-17 21:51:15 +03:00
Georgi Gerganov	e426b3cfc8	gguf.py : fix vertical alignment	2023-08-17 21:50:01 +03:00
Georgi Gerganov	5484737d58	llama : fix tensor name grepping during quantization ggml-ci	2023-08-17 21:40:51 +03:00
Georgi Gerganov	57eaadb853	llama : throw error if gguf fails to init from file ggml-ci	2023-08-17 21:32:14 +03:00
klosax	b3cc182990	llama.cpp : typo	2023-08-17 20:27:50 +02:00
Georgi Gerganov	acaa98234a	convert.py : fix HF tensor permuting / unpacking ggml-ci	2023-08-17 21:06:45 +03:00
klosax	78e1e57862	quantize-stats.cpp : .bin --> .gguf	2023-08-17 19:18:24 +02:00
klosax	fb11dd3f92	common.h : .bin --> .gguf	2023-08-17 19:16:35 +02:00
Georgi Gerganov	e72c8c2124	ggml : fix bug in gguf_set_kv ggml-ci	2023-08-17 20:13:48 +03:00
M. Yusuf Sarıgöz	4dbce7d009	gguf : rm file_type key and method	2023-08-17 20:02:38 +03:00
M. Yusuf Sarıgöz	1d93d04ce2	gguf : refactor pth to gguf conversion script	2023-08-17 19:58:27 +03:00
Georgi Gerganov	899f9a5350	llama : fix lambda capture ggml-ci	2023-08-17 19:49:45 +03:00
Georgi Gerganov	93f285bdf1	gptneox : move as a WIP example	2023-08-17 19:49:45 +03:00
M. Yusuf Sarıgöz	f71704177f	gguf : rename h5 to hf (for HuggingFace)	2023-08-17 19:49:15 +03:00
Georgi Gerganov	81a2c2a6f4	llama : fix llama_model_loader memory leak	2023-08-17 19:49:02 +03:00
M. Yusuf Sarıgöz	9f02694c91	gguf : refactor gptneox conversion script	2023-08-17 19:45:06 +03:00
Georgi Gerganov	dd9e2fc988	ci : update ".bin" to ".gguf" extension ggml-ci	2023-08-17 19:32:14 +03:00
Georgi Gerganov	c3b739374e	editorconfig : ignore models folder ggml-ci	2023-08-17 19:17:25 +03:00
M. Yusuf Sarıgöz	22c61c5b45	gguf : style fixes in simple conversion script	2023-08-17 19:05:43 +03:00
Georgi Gerganov	6d66ef96eb	Merge branch 'master' into gguf	2023-08-17 19:04:59 +03:00
Georgi Gerganov	11bf4366c2	llama : sync with recent PRs on master	2023-08-17 19:03:15 +03:00
M. Yusuf Sarıgöz	2f8fc92d86	gguf : fix conflicts	2023-08-17 18:51:14 +03:00
Georgi Gerganov	8ace03ad3d	convert.py : better always have n_head_kv and default it to n_head	2023-08-17 18:47:06 +03:00
klosax	d646c4efce	convert.py : n_head_kv optional and .gguf file extension	2023-08-17 17:20:36 +02:00
Georgi Gerganov	dd016cc246	Revert "ci : disable CI temporary to not waste energy" This reverts commit `7e82d25f40`.	2023-08-17 17:23:16 +03:00
Georgi Gerganov	2ddd9681d6	convert.py : update to support GGUF output	2023-08-17 17:22:43 +03:00
Georgi Gerganov	e0429d38e4	convert-new.py : output gguf (#2635 ) * convert-new.py : output gguf (WIP) * convert-new.py : add gguf key-value pairs * llama : add hparams.ctx_train + no longer print ftype * convert-new.py : minor fixes * convert-new.py : vocab-only option should work now * llama : fix tokenizer to use llama_char_to_byte * tests : add new ggml-vocab-llama.gguf * convert-new.py : tensor name mapping * convert-new.py : add map for skipping tensor serialization * convert-new.py : convert script now works * gguf.py : pick some of the refactoring from #2644 * convert-new.py : minor fixes	2023-08-17 17:19:52 +03:00
M. Yusuf Sarıgöz	5f97a48fc1	gguf : single pass for writing tensors + refactoring writer	2023-08-17 16:57:50 +03:00
M. Yusuf Sarıgöz	dce07c3121	gguf : single pass for writing tensors + refactoring writer	2023-08-17 16:48:49 +03:00
Kerfuffle	8dae7ce684	Add --cfg-negative-prompt-file option for examples (#2591 ) Add --cfg-negative-prompt-file option for examples master-8dae7ce	2023-08-17 07:29:44 -06:00
klosax	d6fd53afd6	llama.cpp : use ggml_elements()	2023-08-17 15:24:35 +02:00
klosax	5a0a2c5685	llama.cpp : print actual model size	2023-08-17 15:18:16 +02:00
M. Yusuf Sarıgöz	f31e9230ad	gguf : single pass for writing tensors + refactoring writer	2023-08-17 15:19:30 +03:00
Georgi Gerganov	a73ccf1aa3	llama : replace (permute + reshape + view_1d) with (view_3d) (#2538 ) ggml-ci master-a73ccf1	2023-08-17 10:47:09 +03:00
drbh	7cf54e1f74	tests : adds simple llama grammar tests (#2618 ) * adds simple llama grammar tests * fix lint and add Makefile * 0 terminate code_points * avoid dangling pointers in candidate cleanup * cleanup grammar at end of test master-7cf54e1	2023-08-17 10:41:01 +03:00
Shouzheng Liu	a872a2b28e	ggml-alloc : fix discrepency between measure&eval (#2639 ) The GGML memory allocator consistently places a tensor within the optimal-fit memory block, which is the smallest block capable of accommodating the tensor's size. During the measurement phase, the final block is generously sized, ensuring it never qualifies as the optimal-fit block as long as there exists another block capable of accommodating the tensor. Nevertheless, in the evaluation phase, the last block is constrained in size and could potentially qualify as the optimal-fit block. Consequently, there exists the possibility of a tensor being allocated to a different region during evaluation, leading to more memory fragmentation in our scratch buffer. This recent commit guarantees uniform behavior of the allocator across both the measurement and evaluation phases, eliminating discrepancies between the two. master-a872a2b	2023-08-17 10:35:53 +03:00
M. Yusuf Sarıgöz	42f8fe1927	examples/gguf : no need to keep q option for quantization any more	2023-08-17 08:56:42 +03:00
Kolen Cheung	0919a0f73d	cmake : install ggml-meta.metal if LLAMA_METAL (#2449 ) master-0919a0f	2023-08-16 23:09:49 +03:00
Jhen-Jie Hong	ed53db86c3	metal : print error of load pipeline state (#2564 ) * metal : print error of load pipeline state * metal : return null if load pipeline failed	2023-08-16 23:09:03 +03:00
Shouzheng Liu	fc8ef549e5	metal : enable ggml-alloc (#2627 ) * metal: enable ggml-alloc Make ggml-alloc work with concurrently dispatch. * style-fix Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> master-fc8ef54	2023-08-16 23:08:28 +03:00
Shouzheng Liu	bf83bff674	metal : matrix-matrix multiplication kernel (#2615 ) * metal: matrix-matrix multiplication kernel This commit removes MPS and uses custom matrix-matrix multiplication kernels for all quantization types. This commit also adds grouped-query attention to support llama2 70B. * metal: fix performance degradation from gqa Integers are slow on the GPU, and 64-bit divides are extremely slow. In the context of GQA, we introduce a 64-bit divide that cannot be optimized out by the compiler, which results in a decrease of ~8% in inference performance. This commit fixes that issue by calculating a part of the offset with a 32-bit divide. Naturally, this limits the size of a single matrix to ~4GB. However, this limitation should suffice for the near future. * metal: fix bugs for GQA and perplexity test. I mixed up ne02 and nb02 in previous commit. master-bf83bff	2023-08-16 23:07:04 +03:00
Georgi Gerganov	5ec18934ad	convert-new.py : pick #2427 for HF 70B support	2023-08-16 20:16:15 +03:00
Georgi Gerganov	c8ee87f141	gguf.py : merge all files in gguf.py	2023-08-16 19:55:49 +03:00
Georgi Gerganov	88b5769487	gguf : deduplicate (#2629 ) * gguf : better type names * dedup : CPU + Metal is working * ggml : fix warnings about unused results * llama.cpp : fix line feed and compiler warning * llama : fix strncpy warning + note token_to_str does not write null * llama : restore the original load/save session implementation Will migrate this to GGUF in the future * convert-llama-h5-to-gguf.py : support alt ctx param name * ggml : assert when using ggml_mul with non-F32 src1 * examples : dedup simple --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>	2023-08-16 19:25:29 +03:00
Georgi Gerganov	758ff1bbb5	llama : refactor model loading code (#2620 ) * llama : style formatting + remove helper methods * llama : fix quantization using gguf tool * llama : simplify gguf_file_saver * llama : fix method names * llama : simplify write_header() * llama : no need to pass full file loader to the file saver just gguf_ctx * llama : gguf_file_saver write I32 * llama : refactor tensor names (#2622) * gguf: update tensor names searched in quantization * gguf : define tensor names as constants * gguf : initial write API (not tested yet) * gguf : write to file API (not tested) * gguf : initial write API ready + example * gguf : fix header write * gguf : fixes + simplify example + add ggml_nbytes_pad() * gguf : minor * llama : replace gguf_file_saver with new gguf write API * gguf : streaming support when writing files * gguf : remove oboslete write methods * gguf : remove obosolete gguf_get_arr_xxx API * llama : simplify gguf_file_loader * llama : move hparams and vocab from gguf_file_loader to llama_model_loader * llama : merge gguf-util.h in llama.cpp * llama : reorder definitions in .cpp to match .h * llama : minor simplifications * llama : refactor llama_model_loader (WIP) wip : remove ggml_ctx from llama_model_loader wip : merge gguf_file_loader in llama_model_loader * llama : fix shape prints * llama : fix Windows build + fix norm_rms_eps key * llama : throw error on missing KV paris in model meta data * llama : improve printing + log meta data * llama : switch print order of meta data --------- Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>	2023-08-16 14:34:03 +03:00
klosax	ea5615a03a	convert-llama-h5-to-gguf.py : clarify the reverse permute	2023-08-16 11:23:15 +02:00
klosax	4a1741aa2d	gptneox-main.cpp : add tensor data layout	2023-08-15 19:56:19 +02:00
klosax	2ae0e985b3	convert-llama-7b-pth-to-gguf.py : add tensor data layout	2023-08-15 19:55:13 +02:00
klosax	66756c82af	convert-llama-h5-to-gguf.py : add tensor data layout	2023-08-15 19:54:33 +02:00
klosax	b6056c3db8	gguf.py : add tensor data layout	2023-08-15 19:53:44 +02:00

1 2 3 4 5 ...

1200 Commits