Commit Graph

1471 Commits

Author SHA1 Message Date
Georgi Gerganov
792d1a1b16 llama : minor 2023-10-30 11:34:47 +02:00
Georgi Gerganov
f39e6075cf llama : add llm_build_kqv helper
ggml-ci
2023-10-29 22:45:03 +02:00
Georgi Gerganov
c9121fdd0f llama : remove obsolete comments in build graphs 2023-10-29 21:44:19 +02:00
Georgi Gerganov
a104abea48 llama : simplify falcon Q, K, V computation 2023-10-29 21:24:25 +02:00
Georgi Gerganov
31a12f3d03 llama : fix llm_build_k_shift to use n_head_kv instead of n_head 2023-10-29 21:17:46 +02:00
Georgi Gerganov
5990861938 llama : remove obsolete offload names 2023-10-29 21:11:20 +02:00
Georgi Gerganov
3e0462594b llama : add llm_build_kv_store helper
ggml-ci
2023-10-29 21:09:34 +02:00
Georgi Gerganov
909d64471b llama : fix offloading after recent changes 2023-10-29 20:38:49 +02:00
Georgi Gerganov
38728a0be0 llama : add llm_build_k_shift helper
ggml-ci
2023-10-29 19:23:07 +02:00
Georgi Gerganov
dbf836bb64 llama : add llm_build_ffn helper function (#3849)
ggml-ci
2023-10-29 18:47:46 +02:00
Georgi Gerganov
7db9c96d8a llama : add llm_build_norm helper function
ggml-ci
2023-10-29 15:48:48 +02:00
Georgi Gerganov
210e6e5d02 llama : remove obsolete map for layer counting 2023-10-29 13:39:04 +02:00
Georgi Gerganov
79ad734417 llama : comment
ggml-ci
2023-10-29 13:27:53 +02:00
Georgi Gerganov
761087932b llama : add functional header 2023-10-29 13:26:32 +02:00
Georgi Gerganov
8925cf9ef8 llama : add layer index to all tensor names 2023-10-29 13:22:15 +02:00
Georgi Gerganov
1e9c5443c2 llama : refactor tensor offloading as callback 2023-10-29 13:05:10 +02:00
Georgi Gerganov
da936188d8 llama : move refact in correct place + optimize graph input 2023-10-29 11:48:58 +02:00
Georgi Gerganov
739b85c985 llama : try to fix build 2023-10-29 11:25:32 +02:00
Georgi Gerganov
25cfbf6776 llama : fix non-CUDA build 2023-10-29 11:12:03 +02:00
Georgi Gerganov
b4ad03b3a7 llama : try to optimize offloading code 2023-10-29 10:33:11 +02:00
Georgi Gerganov
79617902ea llama : fix res_norm offloading 2023-10-29 09:20:35 +02:00
Georgi Gerganov
e14aa46151 llama : do tensor offload only with CUDA 2023-10-29 08:03:46 +02:00
Georgi Gerganov
0dc05b8433 llama : factor graph input into a function 2023-10-29 07:52:43 +02:00
Georgi Gerganov
4e98897ede llama : support offloading result_norm + comments 2023-10-29 07:36:07 +02:00
Georgi Gerganov
51c4f9ee9f llama : comments 2023-10-28 22:50:08 +03:00
Georgi Gerganov
3af8771389 llama : update offload log messages to print node index 2023-10-28 22:36:44 +03:00
Georgi Gerganov
83d2c43791 llama : offload rest of the models
ggml-ci
2023-10-28 22:30:54 +03:00
Georgi Gerganov
38aca9e1ab llama : factor out tensor offloading outside the build call (wip)
ggml-ci
2023-10-28 21:22:31 +03:00
Georgi Gerganov
5946d98fc8 metal : disable kernel load log 2023-10-28 21:22:01 +03:00
Georgi Gerganov
8b2420d249 llama : factor out ggml-alloc from graph graph build functions
ggml-ci
2023-10-28 19:54:28 +03:00
Erik Scholz
ff3bad83e2 flake : update flake.lock for newer transformers version + provide extra dev shell (#3797)
* flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)
2023-10-28 16:41:07 +02:00
Aarni Koskela
82a6646e02 metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793)
* Try cwd for ggml-metal if bundle lookup fails

When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`.  In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.

Follows up on #1782

* Update ggml-metal.m

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1440
2023-10-28 15:43:01 +03:00
Georgi Gerganov
ba231e8a6d issues : change label from bug to bug-unconfirmed (#3748) 2023-10-28 15:35:26 +03:00
Georgi Gerganov
8a2f2fea29 convert : ignore tokens if their IDs are within [0, vocab_size) (#3831) 2023-10-28 06:25:15 -06:00
Kerfuffle
bd6d9e2059 llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747)
* Allow quantizing k-quants to fall back when tensor size incompatible

* quantizing: Add warning when tensors were incompatible with k-quants

Clean up k-quants state passing a bit
b1437
2023-10-28 14:54:24 +03:00
Georgi Gerganov
ee1a0ec9cb llama : add option for greedy sampling with probs (#3813)
* llama : add option for greedy sampling with probs

* llama : add comment about llama_sample_token_greedy() missing probs

* sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs
b1436
2023-10-28 14:23:11 +03:00
Henk Poley
177461104b common : print that one line of the syntax help *also* to standard output (#3823) b1435 2023-10-28 13:16:33 +03:00
Georgi Gerganov
fdee152e4e starcoder : add GPU offloading (#3827)
* starcoder : do not GPU split 1D bias tensors

* starcoder : offload layers to GPU

ggml-ci
b1434
2023-10-28 12:06:08 +03:00
Kerfuffle
41aee4df82 speculative : ensure draft and target model vocab matches (#3812)
* speculative: Ensure draft and target model vocab matches

* Tolerate small differences when checking dft vs tgt vocab
b1433
2023-10-28 00:40:07 +03:00
cebtenzzre
6d459cbfbe llama : correctly report GGUFv3 format (#3818) b1432 2023-10-27 17:33:53 -04:00
Thibault Terrasson
c8d6a1f34a simple : fix batch handling (#3803) b1431 2023-10-27 08:37:41 -06:00
Georgi Gerganov
2f9ec7e271 cuda : improve text-generation and batched decoding performance (#3776)
* cuda : prints wip

* cuda : new cublas gemm branch for multi-batch quantized src0

* cuda : add F32 sgemm branch

* cuda : fine-tune >= VOLTA params + use MMQ only for small batches

* cuda : remove duplicated cuBLAS GEMM code

* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros

* build : add compile option to force use of MMQ kernels
b1430
2023-10-27 17:01:23 +03:00
Georgi Gerganov
34b2a5e1ee server : do not release slot on image input (#3798) b1429 2023-10-26 22:54:17 +03:00
Georgi Gerganov
6961c4bd0b batched-bench : print params at start b1428 2023-10-25 10:26:27 +03:00
Georgi Gerganov
cc44877486 log : disable pid in log filenames b1427 2023-10-25 10:09:16 +03:00
cebtenzzre
ad93962657 server : add parameter -tb N, --threads-batch N (#3584) (#3768)
Co-authored-by: Michael Coppola <m18coppola@gmail.com>
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
b1426
2023-10-24 23:10:43 +03:00
Georgi Gerganov
1717521cdb server : do not block system prompt update (#3767)
* server : do not block system prompt update

* server : update state machine logic to process system prompts

* server : minor
b1425
2023-10-24 23:08:20 +03:00
Georgi Gerganov
b2f7e04bd3 sync : ggml (conv ops + cuda MSVC fixes) (#3765)
ggml-ci
b1424
2023-10-24 21:51:20 +03:00
John Smith
abd21fc99f cmake : add missed dependencies (#3763) b1423 2023-10-24 20:48:45 +03:00
Georgi Gerganov
2b4ea35e56 cuda : add batched cuBLAS GEMM for faster attention (#3749)
* cmake : add helper for faster CUDA builds

* batched : add NGL arg

* ggml : skip nops in compute_forward

* cuda : minor indentation

* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)

* Apply suggestions from code review

These changes plus:

```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```

are needed to compile with ROCM. I haven't done performance testing, but it seems to work.

I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.

* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define

* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases

* cuda : reduce mallocs in cublasGemmBatchedEx branch

* cuda : add TODO for calling cublas from kernel + using mem pool

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
b1422
2023-10-24 16:48:37 +03:00