llama.cpp/ggml-common.h at 5cdb371731caa2c41fcca42d4d2d43f94f6883b4

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-07 09:57:00 +00:00

Files

Kawrakow 44ca159faf 1.5 bit: we can do even better (#5999 )

* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2024-03-11 17:53:15 +02:00

118 KiB

Raw Blame History

View Raw

118 KiB Raw Blame History

118 KiB

Raw Blame History