llama.cpp/tests/test-tokenizer-random.py at 2ab977282b02ccd6783fbbaec393c96886cf33b1

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-01 09:01:57 +00:00

Files

jaime-m-p 02c1ecad07 Tokenizer WPM fixes (#7500 )

* Update random test: add_bos_token.
* Update random test: add WPM models for testing.
* Build vocab.special_tokens_cache using vocab token types.
* Fix and improve WPM preprocessing.
  - Fix unicode edge case combinations.
  - Split by whitspace in the same pass.
* Discard all tokens when no matching found.

2024-05-28 21:46:34 +02:00

12 KiB

Raw Blame History

View Raw

12 KiB Raw Blame History

12 KiB

Raw Blame History