Commit Graph

319 Commits

Author SHA1 Message Date
Georgi Gerganov
f39e6075cf llama : add llm_build_kqv helper
ggml-ci
2023-10-29 22:45:03 +02:00
Georgi Gerganov
c9121fdd0f llama : remove obsolete comments in build graphs 2023-10-29 21:44:19 +02:00
Georgi Gerganov
a104abea48 llama : simplify falcon Q, K, V computation 2023-10-29 21:24:25 +02:00
Georgi Gerganov
31a12f3d03 llama : fix llm_build_k_shift to use n_head_kv instead of n_head 2023-10-29 21:17:46 +02:00
Georgi Gerganov
5990861938 llama : remove obsolete offload names 2023-10-29 21:11:20 +02:00
Georgi Gerganov
3e0462594b llama : add llm_build_kv_store helper
ggml-ci
2023-10-29 21:09:34 +02:00
Georgi Gerganov
909d64471b llama : fix offloading after recent changes 2023-10-29 20:38:49 +02:00
Georgi Gerganov
38728a0be0 llama : add llm_build_k_shift helper
ggml-ci
2023-10-29 19:23:07 +02:00
Georgi Gerganov
dbf836bb64 llama : add llm_build_ffn helper function (#3849)
ggml-ci
2023-10-29 18:47:46 +02:00
Georgi Gerganov
7db9c96d8a llama : add llm_build_norm helper function
ggml-ci
2023-10-29 15:48:48 +02:00
Georgi Gerganov
210e6e5d02 llama : remove obsolete map for layer counting 2023-10-29 13:39:04 +02:00
Georgi Gerganov
79ad734417 llama : comment
ggml-ci
2023-10-29 13:27:53 +02:00
Georgi Gerganov
761087932b llama : add functional header 2023-10-29 13:26:32 +02:00
Georgi Gerganov
8925cf9ef8 llama : add layer index to all tensor names 2023-10-29 13:22:15 +02:00
Georgi Gerganov
1e9c5443c2 llama : refactor tensor offloading as callback 2023-10-29 13:05:10 +02:00
Georgi Gerganov
da936188d8 llama : move refact in correct place + optimize graph input 2023-10-29 11:48:58 +02:00
Georgi Gerganov
739b85c985 llama : try to fix build 2023-10-29 11:25:32 +02:00
Georgi Gerganov
25cfbf6776 llama : fix non-CUDA build 2023-10-29 11:12:03 +02:00
Georgi Gerganov
b4ad03b3a7 llama : try to optimize offloading code 2023-10-29 10:33:11 +02:00
Georgi Gerganov
79617902ea llama : fix res_norm offloading 2023-10-29 09:20:35 +02:00
Georgi Gerganov
e14aa46151 llama : do tensor offload only with CUDA 2023-10-29 08:03:46 +02:00
Georgi Gerganov
0dc05b8433 llama : factor graph input into a function 2023-10-29 07:52:43 +02:00
Georgi Gerganov
4e98897ede llama : support offloading result_norm + comments 2023-10-29 07:36:07 +02:00
Georgi Gerganov
51c4f9ee9f llama : comments 2023-10-28 22:50:08 +03:00
Georgi Gerganov
3af8771389 llama : update offload log messages to print node index 2023-10-28 22:36:44 +03:00
Georgi Gerganov
83d2c43791 llama : offload rest of the models
ggml-ci
2023-10-28 22:30:54 +03:00
Georgi Gerganov
38aca9e1ab llama : factor out tensor offloading outside the build call (wip)
ggml-ci
2023-10-28 21:22:31 +03:00
Georgi Gerganov
8b2420d249 llama : factor out ggml-alloc from graph graph build functions
ggml-ci
2023-10-28 19:54:28 +03:00
Kerfuffle
bd6d9e2059 llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747)
* Allow quantizing k-quants to fall back when tensor size incompatible

* quantizing: Add warning when tensors were incompatible with k-quants

Clean up k-quants state passing a bit
2023-10-28 14:54:24 +03:00
Georgi Gerganov
fdee152e4e starcoder : add GPU offloading (#3827)
* starcoder : do not GPU split 1D bias tensors

* starcoder : offload layers to GPU

ggml-ci
2023-10-28 12:06:08 +03:00
cebtenzzre
6d459cbfbe llama : correctly report GGUFv3 format (#3818) 2023-10-27 17:33:53 -04:00
Georgi Gerganov
2f9ec7e271 cuda : improve text-generation and batched decoding performance (#3776)
* cuda : prints wip

* cuda : new cublas gemm branch for multi-batch quantized src0

* cuda : add F32 sgemm branch

* cuda : fine-tune >= VOLTA params + use MMQ only for small batches

* cuda : remove duplicated cuBLAS GEMM code

* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros

* build : add compile option to force use of MMQ kernels
2023-10-27 17:01:23 +03:00
Marcus Dunn
5be6c803fa llama : remove token functions with context args in favor of model (#3720)
* added `llama_model_token_*` variants to all the `llama_token_*` functions.

* added `LLAMA_API`

* formatting

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* removed old `llama_token` functions

* changed 3 more functions to take in model

- `llama_token_get_text`
- `llama_token_get_score`
- `llama_token_get_type`

* added back docs

* fixed main.cpp

* changed token functions to use new model variants

* changed token functions to use new model variants

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-23 22:40:03 +03:00
goerch
9e70cc0322 Add test for MPT tokenization (#3728)
* Add test for MPT tokenization

* Revert code motion

* Remove unnecessary restriction in test case

* Clarify logic in conversion
2023-10-22 21:21:42 +02:00
Kerfuffle
a5e7dbd614 llama : validate special token ids are in range when loading GGUF model (#3635)
* Add validation for special token ids to llama.cpp

Small optimization for llama_byte_to_token SPM mode

* Fix BPE newline check, only I could break something so simple

* Killll meeeeee

* Account for GGUF_KEY_KEY only setting when the key exists

* Minor code cleanups.

* Fix convert.py error msg when added tokens are out of range

* Make gguf SpecialVocab vocab size-aware

Update conversion scripts accordingly

* Avoid a string copy

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-22 21:14:56 +03:00
Georgi Gerganov
d1031cf49c sampling : refactor init to use llama_sampling_params (#3696)
* sampling : refactor init to use llama_sampling_params

* llama : combine repetition, frequency and presence penalties in 1 call

* examples : remove embd-input and gptneox-wip

* sampling : rename penalty params + reduce size of "prev" vector

* sampling : add llama_sampling_print helper

* sampling : hide prev behind API and apply #3661

ggml-ci
2023-10-20 21:07:23 +03:00
Herman Semenov
f439e506e8 ggml : fix rope + llama minor optimizations (#3560)
* Minor fixes and fixed memleak

* Using const auto references in range-based loop C++17
2023-10-20 13:02:12 +03:00
Georgi Gerganov
0e89203b51 speculative : add tree-based sampling example (#3624)
* sampling : one sequence per sampling context

ggml-ci

* speculative : add tree-based sampling support

ggml-ci

* speculative : reuse the n_parallel CLI param

* speculative : refactor sampling

* examples : fix build after sampling refactoring

ggml-ci

* batched : fix n_seq_id

* sampling : fix malloc

ggml-ci

* swift : fix build

ggml-ci

* swift : try to fix build

ggml-ci

* prompts : add assistant.txt

* common : add llama_batch_add() and llama_batch_clear() helpers

* speculative : minor refactor

ggml-ci

* minor : comments + rename

ggml-ci

* speculative : fix off-by-one for n_drafted

* speculative : fix the n_drafted fix + p constants
2023-10-18 16:21:57 +03:00
slaren
cb33f43a2a fix embeddings when using CUDA (#3657) 2023-10-17 22:24:50 +02:00
Georgi Gerganov
e1675d133c llama : avoid fprintf in favor of LLAMA_LOG (#3538) 2023-10-17 22:34:26 +03:00
staviq
1a159553f9 tokenizer : special token handling (#3538)
* Rewrite special token handling from #1931

* shorten param name, add st verification by type

* use offsets instead of copy by substr

* formatting, remove copying iterator on delete

* llama : normalize code-style

* swift fix

* print pfx/sfx if verb, main: split pfx input sfx

* dont add space when using special tokens

* minor : comment + spacing

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-17 18:11:01 +03:00
cebtenzzre
11bff29045 MPT : support GQA for replit-code-v1.5 (#3627) 2023-10-15 09:32:06 +03:00
Daniel Bevenius
2a4bcbacea llama : remove n_threads from llama_decode_internal (#3614)
This commit removes `n_threads` from the `llama_decode_internal`
functions doc comment as it does not exist anymore.

It looks like this parameter was removed in
Commit 16bc66d947 ("llama.cpp : split
llama_context_params into model and context params").

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-10-13 13:33:16 +03:00
goerch
233fc1c69f Minor improvements in GPT2 tokenizer (#3567)
* Fixing minor bugs in bpe_gpt2_preprocess

* Don't add bos token in test
2023-10-10 18:59:52 +02:00
Xingchen Song(宋星辰)
02d2875def llm : add bloom models (#3553)
* feat: Support bloom models

* fix(bloom): fix model size

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-10 17:48:21 +03:00
Jan Ploski
f5f9121de1 llm : add MPT support (#3417)
* CUDA: added support for ggml_clamp (see also: https://github.com/ggerganov/ggml/issues/545)

* mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt

* mpt : protect against "clip_qkv": null in mpt-7b

* mpt : quick fix to avoid "Strange model" warning when quantizing MPT models

* mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)

* mpt : standardized all tensor names to follow GGUF spec

* mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code

* mpt : fixed comment s/gptneox/mpt/

* mpt : remove tabs, trailing whitespace

* mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt

* mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252

* comment out n_past instead of marking it unused

* mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"]

* mpt : remove unused tokenizer_json in convert script

* ggml : remove obsolete n_past assert in ggml_alibi

* llama : print clam_kqv and max_alibi_bias hparams

---------

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-10 10:50:23 +03:00
Georgi Gerganov
fcca0a7004 refact : fix convert script + zero out KV cache to avoid nans (#3523)
* refact : fix convert script + zero out KV cache to avoid nans

* ggml : silu(-inf) should never happen

* metal : assert various kernel requirements
2023-10-09 14:32:17 +03:00
Georgi Gerganov
db3abcc114 sync : ggml (ggml-backend) (#3548)
* sync : ggml (ggml-backend)

ggml-ci

* zig : add ggml-backend to the build
2023-10-08 20:19:14 +03:00
Kerfuffle
63d3b06a43 llama : fix missing break in Persimmon arch case statements (#3535) 2023-10-08 08:22:17 +03:00
cebtenzzre
f1782c68de quantize : fail fast on write errors (#3521) 2023-10-07 11:41:52 +03:00