mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-30 08:42:00 +00:00

Files

slaren 16bc66d947 llama.cpp : split llama_context_params into model and context params (#3301 )

* llama.cpp : split llama_context_params into model and context params

ggml-ci

* fix metal build

* fix freq_base/scale default to model value

* llama-bench : keep the same model between tests when possible

* move n_threads to llama_context_params, add n_threads_batch

* fix mpi build

* remove kv_size(), cuda scratch fixes

* remove low-vram option

* add n_threads_batch to system info, refactor to get_system_info()

* add documentation about --threads-batch to the READMEs

* llama-bench fix

* main : fix rope freq/scale warning

* llama.cpp : add llama_get_model
common : add llama_tokenize from model

* remove duplicated ctx/model functions

ggml-ci

* cuda : print total VRAM used

2023-09-28 22:42:38 +03:00

CMakeLists.txt

cmake : install targets (#2256 )

2023-07-19 10:01:11 +03:00

perplexity.cpp

llama.cpp : split llama_context_params into model and context params (#3301 )

2023-09-28 22:42:38 +03:00

README.md

readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (#3340 )

2023-09-27 18:30:36 +03:00

README.md

perplexity

TODO

Llama 2 70B Scorechart

Quantization	Model size (GiB)	Perplexity	Delta to fp16
Q4_0	36.20	3.5550	3.61%
Q4_1	40.20	3.5125	2.37%
Q5_0	44.20	3.4744	1.26%
Q2_K	27.27	3.7339	8.82%
Q3_K_S	27.86	3.7019	7.89%
Q3_K_M	30.83	3.5932	4.72%
Q3_K_L	33.67	3.5617	3.80%
Q4_K_S	36.39	3.4852	1.57%
Q4_K_M	38.54	3.4725	1.20%
Q5_K_S	44.20	3.4483	0.50%
Q5_K_M	45.41	3.4451	0.40%
Q6_K	52.70	3.4367	0.16%
fp16	128.5	3.4313	-