Commit Graph

1200 Commits

Author SHA1 Message Date
Georgi Gerganov
307e09cd85 Merge branch 'gguf' into gguf-write-single-pass 2023-08-17 21:51:15 +03:00
Georgi Gerganov
e426b3cfc8 gguf.py : fix vertical alignment 2023-08-17 21:50:01 +03:00
Georgi Gerganov
5484737d58 llama : fix tensor name grepping during quantization
ggml-ci
2023-08-17 21:40:51 +03:00
Georgi Gerganov
57eaadb853 llama : throw error if gguf fails to init from file
ggml-ci
2023-08-17 21:32:14 +03:00
klosax
b3cc182990 llama.cpp : typo 2023-08-17 20:27:50 +02:00
Georgi Gerganov
acaa98234a convert.py : fix HF tensor permuting / unpacking
ggml-ci
2023-08-17 21:06:45 +03:00
klosax
78e1e57862 quantize-stats.cpp : .bin --> .gguf 2023-08-17 19:18:24 +02:00
klosax
fb11dd3f92 common.h : .bin --> .gguf 2023-08-17 19:16:35 +02:00
Georgi Gerganov
e72c8c2124 ggml : fix bug in gguf_set_kv
ggml-ci
2023-08-17 20:13:48 +03:00
M. Yusuf Sarıgöz
4dbce7d009 gguf : rm file_type key and method 2023-08-17 20:02:38 +03:00
M. Yusuf Sarıgöz
1d93d04ce2 gguf : refactor pth to gguf conversion script 2023-08-17 19:58:27 +03:00
Georgi Gerganov
899f9a5350 llama : fix lambda capture
ggml-ci
2023-08-17 19:49:45 +03:00
Georgi Gerganov
93f285bdf1 gptneox : move as a WIP example 2023-08-17 19:49:45 +03:00
M. Yusuf Sarıgöz
f71704177f gguf : rename h5 to hf (for HuggingFace) 2023-08-17 19:49:15 +03:00
Georgi Gerganov
81a2c2a6f4 llama : fix llama_model_loader memory leak 2023-08-17 19:49:02 +03:00
M. Yusuf Sarıgöz
9f02694c91 gguf : refactor gptneox conversion script 2023-08-17 19:45:06 +03:00
Georgi Gerganov
dd9e2fc988 ci : update ".bin" to ".gguf" extension
ggml-ci
2023-08-17 19:32:14 +03:00
Georgi Gerganov
c3b739374e editorconfig : ignore models folder
ggml-ci
2023-08-17 19:17:25 +03:00
M. Yusuf Sarıgöz
22c61c5b45 gguf : style fixes in simple conversion script 2023-08-17 19:05:43 +03:00
Georgi Gerganov
6d66ef96eb Merge branch 'master' into gguf 2023-08-17 19:04:59 +03:00
Georgi Gerganov
11bf4366c2 llama : sync with recent PRs on master 2023-08-17 19:03:15 +03:00
M. Yusuf Sarıgöz
2f8fc92d86 gguf : fix conflicts 2023-08-17 18:51:14 +03:00
Georgi Gerganov
8ace03ad3d convert.py : better always have n_head_kv and default it to n_head 2023-08-17 18:47:06 +03:00
klosax
d646c4efce convert.py : n_head_kv optional and .gguf file extension 2023-08-17 17:20:36 +02:00
Georgi Gerganov
dd016cc246 Revert "ci : disable CI temporary to not waste energy"
This reverts commit 7e82d25f40.
2023-08-17 17:23:16 +03:00
Georgi Gerganov
2ddd9681d6 convert.py : update to support GGUF output 2023-08-17 17:22:43 +03:00
Georgi Gerganov
e0429d38e4 convert-new.py : output gguf (#2635)
* convert-new.py : output gguf (WIP)

* convert-new.py : add gguf key-value pairs

* llama : add hparams.ctx_train + no longer print ftype

* convert-new.py : minor fixes

* convert-new.py : vocab-only option should work now

* llama : fix tokenizer to use llama_char_to_byte

* tests : add new ggml-vocab-llama.gguf

* convert-new.py : tensor name mapping

* convert-new.py : add map for skipping tensor serialization

* convert-new.py : convert script now works

* gguf.py : pick some of the refactoring from #2644

* convert-new.py : minor fixes
2023-08-17 17:19:52 +03:00
M. Yusuf Sarıgöz
5f97a48fc1 gguf : single pass for writing tensors + refactoring writer 2023-08-17 16:57:50 +03:00
M. Yusuf Sarıgöz
dce07c3121 gguf : single pass for writing tensors + refactoring writer 2023-08-17 16:48:49 +03:00
Kerfuffle
8dae7ce684 Add --cfg-negative-prompt-file option for examples (#2591)
Add --cfg-negative-prompt-file option for examples
master-8dae7ce
2023-08-17 07:29:44 -06:00
klosax
d6fd53afd6 llama.cpp : use ggml_elements() 2023-08-17 15:24:35 +02:00
klosax
5a0a2c5685 llama.cpp : print actual model size 2023-08-17 15:18:16 +02:00
M. Yusuf Sarıgöz
f31e9230ad gguf : single pass for writing tensors + refactoring writer 2023-08-17 15:19:30 +03:00
Georgi Gerganov
a73ccf1aa3 llama : replace (permute + reshape + view_1d) with (view_3d) (#2538)
ggml-ci
master-a73ccf1
2023-08-17 10:47:09 +03:00
drbh
7cf54e1f74 tests : adds simple llama grammar tests (#2618)
* adds simple llama grammar tests

* fix lint and add Makefile

* 0 terminate code_points

* avoid dangling pointers in candidate cleanup

* cleanup grammar at end of test
master-7cf54e1
2023-08-17 10:41:01 +03:00
Shouzheng Liu
a872a2b28e ggml-alloc : fix discrepency between measure&eval (#2639)
The GGML memory allocator consistently places a tensor within the
optimal-fit memory block, which is the smallest block capable of
accommodating the tensor's size. During the measurement phase, the final
block is generously sized, ensuring it never qualifies as the
optimal-fit block as long as there exists another block capable of
accommodating the tensor. Nevertheless, in the evaluation phase, the
last block is constrained in size and could potentially qualify as the
optimal-fit block. Consequently, there exists the possibility of a
tensor being allocated to a different region during evaluation, leading
to more memory fragmentation in our scratch buffer.

This recent commit guarantees uniform behavior of the allocator across
both the measurement and evaluation phases, eliminating discrepancies
between the two.
master-a872a2b
2023-08-17 10:35:53 +03:00
M. Yusuf Sarıgöz
42f8fe1927 examples/gguf : no need to keep q option for quantization any more 2023-08-17 08:56:42 +03:00
Kolen Cheung
0919a0f73d cmake : install ggml-meta.metal if LLAMA_METAL (#2449) master-0919a0f 2023-08-16 23:09:49 +03:00
Jhen-Jie Hong
ed53db86c3 metal : print error of load pipeline state (#2564)
* metal : print error of load pipeline state

* metal : return null if load pipeline failed
2023-08-16 23:09:03 +03:00
Shouzheng Liu
fc8ef549e5 metal : enable ggml-alloc (#2627)
* metal: enable ggml-alloc

Make ggml-alloc work with concurrently dispatch.

* style-fix

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-fc8ef54
2023-08-16 23:08:28 +03:00
Shouzheng Liu
bf83bff674 metal : matrix-matrix multiplication kernel (#2615)
* metal: matrix-matrix multiplication kernel

This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.

* metal: fix performance degradation from gqa

Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.

* metal: fix bugs for GQA and perplexity test.

I mixed up ne02 and nb02 in previous commit.
master-bf83bff
2023-08-16 23:07:04 +03:00
Georgi Gerganov
5ec18934ad convert-new.py : pick #2427 for HF 70B support 2023-08-16 20:16:15 +03:00
Georgi Gerganov
c8ee87f141 gguf.py : merge all files in gguf.py 2023-08-16 19:55:49 +03:00
Georgi Gerganov
88b5769487 gguf : deduplicate (#2629)
* gguf : better type names

* dedup : CPU + Metal is working

* ggml : fix warnings about unused results

* llama.cpp : fix line feed and compiler warning

* llama : fix strncpy warning + note token_to_str does not write null

* llama : restore the original load/save session implementation

Will migrate this to GGUF in the future

* convert-llama-h5-to-gguf.py : support alt ctx param name

* ggml : assert when using ggml_mul with non-F32 src1

* examples : dedup simple

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-16 19:25:29 +03:00
Georgi Gerganov
758ff1bbb5 llama : refactor model loading code (#2620)
* llama : style formatting + remove helper methods

* llama : fix quantization using gguf tool

* llama : simplify gguf_file_saver

* llama : fix method names

* llama : simplify write_header()

* llama : no need to pass full file loader to the file saver

just gguf_ctx

* llama : gguf_file_saver write I32

* llama : refactor tensor names (#2622)

* gguf: update tensor names searched in quantization

* gguf : define tensor names as constants

* gguf : initial write API (not tested yet)

* gguf : write to file API (not tested)

* gguf : initial write API ready + example

* gguf : fix header write

* gguf : fixes + simplify example + add ggml_nbytes_pad()

* gguf : minor

* llama : replace gguf_file_saver with new gguf write API

* gguf : streaming support when writing files

* gguf : remove oboslete write methods

* gguf : remove obosolete gguf_get_arr_xxx API

* llama : simplify gguf_file_loader

* llama : move hparams and vocab from gguf_file_loader to llama_model_loader

* llama : merge gguf-util.h in llama.cpp

* llama : reorder definitions in .cpp to match .h

* llama : minor simplifications

* llama : refactor llama_model_loader (WIP)

wip : remove ggml_ctx from llama_model_loader

wip : merge gguf_file_loader in llama_model_loader

* llama : fix shape prints

* llama : fix Windows build + fix norm_rms_eps key

* llama : throw error on missing KV paris in model meta data

* llama : improve printing + log meta data

* llama : switch print order of meta data

---------

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
2023-08-16 14:34:03 +03:00
klosax
ea5615a03a convert-llama-h5-to-gguf.py : clarify the reverse permute 2023-08-16 11:23:15 +02:00
klosax
4a1741aa2d gptneox-main.cpp : add tensor data layout 2023-08-15 19:56:19 +02:00
klosax
2ae0e985b3 convert-llama-7b-pth-to-gguf.py : add tensor data layout 2023-08-15 19:55:13 +02:00
klosax
66756c82af convert-llama-h5-to-gguf.py : add tensor data layout 2023-08-15 19:54:33 +02:00
klosax
b6056c3db8 gguf.py : add tensor data layout 2023-08-15 19:53:44 +02:00