* minor : code style
* server : fix prompt similarity calculation
* server : initial host-memory prompt caching
* cont
* server : refactor
* cont
* cont : make the server task of the slot const
* cont : minor [no ci]
* server : cache prompts and checkpoints only for completion tasks
* server : improve prompt caching logic
* cont : fix check for number of cached prompts [no ci]
* server : improve caching logic, add -cram CLI arg
* server : print prompt mismatch info
* cont : better naming [no ci]
* server : improve prompt cache loading logic
* server : add option to debug the slot contents (#16482)
* server : add option to debug the slot contents
* Update tools/server/server.cpp
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* server : add option to disable prompt cache
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
- Use server_tokens in more places in server and util.cpp
- Convert most functions that used llama_tokens to server_tokens
- Modify input tokenizer to handle JSON objects as subprompts
- Break out MTMD prompt parsing into utility function
- Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types
- Add capability to model endpoint to indicate if client can send multimodal data
- Add tests.