convert : support loading vocab from fast tokenizer config (#3633)

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-27 08:21:30 +00:00

* Add HFVocab into convert.py

* Update convert.py

* Update convert.py

* add bytes_to_unicode function

* change add_meta_vocab fucntion

* remove debug code

* remove byte_encoder

* Add newline between classes

* Check tokenizer.json when tokenizer.model is not exist.

* Move transformers dependency to local code

* Add error context with 'raise from'

* Add fast tokenizer option to BpeVocab

* Update convert.py

* Add VocabLoader and remove *Vocab class

* Add transformers dependency

* remove added tokens and check newline token to decide spm or bpe

* Update convert.py

* Add special token type

* Update convert.py

* Update convert.py

* Update convert.py

* Fix typo in convert.py

* Fix when params.n_vocab < tokenizer vocab size

* update vocab class

* change funtion name

* Remove unused variable/functions, add types to class variable and methods, delete blank liens

* fix flake8 warnings

* code style cleanup

* make mypy happy

* change exception

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>

This commit is contained in:

wonjun Jang

2023-12-14 17:09:34 +09:00

committed by

GitHub

parent 0353a18401

commit 873637afc7

2 changed files with 168 additions and 156 deletions

1

requirements.txt

View File

@@ -1,3 +1,4 @@
 numpy==1.24.4
 sentencepiece==0.1.98
 transformers>=4.34.0
 gguf>=0.1.0

convert : support loading vocab from fast tokenizer config (#3633)

1 requirements.txt Unescape Escape View File

1

requirements.txt

View File