* feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Multimodal Support in llama.cpp
This directory provides multimodal capabilities for llama.cpp. Initially intended as a showcase for running LLaVA models, its scope has expanded significantly over time to include various other vision-capable models. As a result, LLaVA is no longer the only multimodal architecture supported.
Important
Multimodal support can be viewed as a sub-project within
llama.cpp. It is under very heavy development, and breaking changes are expected.
The naming and structure related to multimodal support have evolved, which might cause some confusion. Here's a brief timeline to clarify:
- #3436: Initial support for LLaVA 1.5 was added, introducing
llava.cppandclip.cpp. Thellava-clibinary was created for model interaction. - #4954: Support for MobileVLM was added, becoming the second vision model supported. This built upon the existing
llava.cpp,clip.cpp, andllava-cliinfrastructure. - Expansion & Fragmentation: Many new models were subsequently added (e.g., #7599, #10361, #12344, and others). However,
llava-clilacked support for the increasingly complex chat templates required by these models. This led to the creation of model-specific binaries likeqwen2vl-cli,minicpmv-cli, andgemma3-cli. While functional, this proliferation of command-line tools became confusing for users. - #12849:
libmtmdwas introduced as a replacement forllava.cpp. Its goals include providing a single, unified command-line interface, improving the user/developer experience (UX/DX), and supporting both audio and image inputs. - #13012:
mtmd-cliwas added, consolidating the various model-specific CLIs into a single tool powered bylibmtmd.
Pre-quantized models
See the list of pre-quantized model here
How it works and what is mmproj?
Multimodal support in llama.cpp works by encoding images into embeddings using a separate model component, and then feeding these embeddings into the language model.
This approach keeps the multimodal components distinct from the core libllama library. Separating these allows for faster, independent development cycles. While many modern vision models are based on Vision Transformers (ViTs), their specific pre-processing and projection steps can vary significantly. Integrating this diverse complexity directly into libllama is currently challenging.
Consequently, running a multimodal model typically requires two GGUF files:
- The standard language model file.
- A corresponding multimodal projector (
mmproj) file, which handles the image encoding and projection.
What is libmtmd?
As outlined in the history, libmtmd is the modern library designed to replace the original llava.cpp implementation for handling multimodal inputs.
Built upon clip.cpp (similar to llava.cpp), libmtmd offers several advantages:
- Unified Interface: Aims to consolidate interaction for various multimodal models.
- Improved UX/DX: Features a more intuitive API, inspired by the
Processorclass in the Hugging Facetransformerslibrary. - Flexibility: Designed to support multiple input types (text, audio, images) while respecting the wide variety of chat templates used by different models.
How to obtain mmproj
Multimodal projector (mmproj) files are specific to each model architecture.
For the following models, you can use convert_hf_to_gguf.py with --mmproj flag to get the mmproj file:
- Gemma 3 ; See the guide here - Note: 1B variant does not have vision support
- SmolVLM (from HuggingFaceTB)
- SmolVLM2 (from HuggingFaceTB)
- Pixtral 12B - only works with
transformers-compatible checkpoint - Qwen 2 VL and Qwen 2.5 VL (from Qwen)
- Mistral Small 3.1 24B
- InternVL 2.5 and InternVL 3 from OpenGVLab (note: we don't support conversion of
InternVL3-*-hfmodel, only non-HF version is supported ;InternLM2Modeltext model is not supported)
For older models, please refer to the relevant guide for instructions on how to obtain or create them:
NOTE: conversion scripts are located under tools/mtmd/legacy-models