convert : various script cleanups/fixes + merges and special token handling (#2842)

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

* convert: Fix permute calls and method/func definitions

* Cleanups for gguf-py

* Minor types cleanups.

* Initial implementation of handling merges and special tokens

* convert: Handle special tokens and merges in vocab only mode

convert: Vocab only mode no longer requires loading model tensors

* gguf: Refactor tensor name mapping

* convert: Fix type hint for special_token_types in SpecialVocab

* Use common special vocab handling in various conversion scripts

* First pass at implementing suggested changes

* Second pass

* gguf: SpecialVocab: Fix issue with special token content not in a dict

gguf: SpecialVocab: Allow skipping handling of merges

* convert-falcon-hf-to-gguf: Support --vocab-only option, bail out if no tokenizer.json

* convert-gptneox-hf-to-gguf and convert: Only handle merges for BPE tokenizer

* gguf: SpecialVocab: Actually set load_merges in object

* Uniform args parsing and vocab only mode for convert examples

* convert.py: Set gpt2 as tokenizer model when using BPE

* Squish last type warning in gguf.py - yay!

This commit is contained in:

Kerfuffle

2023-08-30 02:25:50 -06:00

committed by

GitHub

parent ad9ddcff6e

commit dc07dc492e

10 changed files with 728 additions and 748 deletions

									
										1

gguf-py/pyproject.toml
									
												View File
												
				@@ -5,6 +5,7 @@ description = "Write ML models in GGUF for GGML"

				authors = ["GGML <ggml@ggml.ai>"]

				packages = [

				    {include = "gguf"},

				    {include = "gguf/py.typed"},

				]

				readme = "README.md"

				homepage = "https://ggml.ai"

convert : various script cleanups/fixes + merges and special token handling (#2842)

1 gguf-py/pyproject.toml Unescape Escape View File

1

gguf-py/pyproject.toml

View File