llama.cpp/flake.nix at c63bb1d16a70c03440671b76954bb767513cead8

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-01 09:01:57 +00:00

Files

Shouzheng Liu bf83bff674 metal : matrix-matrix multiplication kernel (#2615 )

* metal: matrix-matrix multiplication kernel

This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.

* metal: fix performance degradation from gqa

Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.

* metal: fix bugs for GQA and perplexity test.

I mixed up ne02 and nb02 in previous commit.

2023-08-16 23:07:04 +03:00

3.1 KiB

Raw Blame History

View Raw

3.1 KiB Raw Blame History

3.1 KiB

Raw Blame History