mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-10-30 08:42:00 +00:00 
			
		
		
		
	Detokenizer fixes (#8039)
* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()
* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint
* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check
* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
			
			
This commit is contained in:
		| @@ -904,6 +904,7 @@ extern "C" { | ||||
|     /// @param tokens The tokens pointer must be large enough to hold the resulting tokens. | ||||
|     /// @return Returns the number of tokens on success, no more than n_tokens_max | ||||
|     /// @return Returns a negative number on failure - the number of tokens that would have been returned | ||||
|     /// @param add_special Allow to add BOS and EOS tokens if model is configured to do so. | ||||
|     /// @param parse_special Allow tokenizing special and/or control tokens which otherwise are not exposed and treated | ||||
|     ///                      as plaintext. Does not insert a leading space. | ||||
|     LLAMA_API int32_t llama_tokenize( | ||||
| @@ -918,15 +919,31 @@ extern "C" { | ||||
|     // Token Id -> Piece. | ||||
|     // Uses the vocabulary in the provided context. | ||||
|     // Does not write null terminator to the buffer. | ||||
|     // User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens. | ||||
|     // User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix') | ||||
|     // @param special If true, special tokens are rendered in the output. | ||||
|     LLAMA_API int32_t llama_token_to_piece( | ||||
|               const struct llama_model * model, | ||||
|                            llama_token   token, | ||||
|                                   char * buf, | ||||
|                                int32_t   length, | ||||
|                                int32_t   lstrip, | ||||
|                                   bool   special); | ||||
|  | ||||
|     /// @details Convert the provided tokens into text (inverse of llama_tokenize()). | ||||
|     /// @param text The char pointer must be large enough to hold the resulting text. | ||||
|     /// @return Returns the number of chars/bytes on success, no more than text_len_max. | ||||
|     /// @return Returns a negative number on failure - the number of chars/bytes that would have been returned. | ||||
|     /// @param remove_special Allow to remove BOS and EOS tokens if model is configured to do so. | ||||
|     /// @param unparse_special If true, special tokens are rendered in the output. | ||||
|     LLAMA_API int32_t llama_detokenize( | ||||
|         const struct llama_model * model, | ||||
|                const llama_token * tokens, | ||||
|                          int32_t   n_tokens, | ||||
|                             char * text, | ||||
|                          int32_t   text_len_max, | ||||
|                             bool   remove_special, | ||||
|                             bool   unparse_special); | ||||
|  | ||||
|     /// Apply chat template. Inspired by hf apply_chat_template() on python. | ||||
|     /// Both "model" and "custom_template" are optional, but at least one is required. "custom_template" has higher precedence than "model" | ||||
|     /// NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template | ||||
|   | ||||
		Reference in New Issue
	
	Block a user
	 jaime-m-p
					jaime-m-p