mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-10-30 08:42:00 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			187 lines
		
	
	
		
			6.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			187 lines
		
	
	
		
			6.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Granite Vision
 | |
| 
 | |
| Download the model and point your `GRANITE_MODEL` environment variable to the path.
 | |
| 
 | |
| ```bash
 | |
| $ git clone https://huggingface.co/ibm-granite/granite-vision-3.2-2b
 | |
| $ export GRANITE_MODEL=./granite-vision-3.2-2b
 | |
| ```
 | |
| 
 | |
| 
 | |
| ### 1. Running llava surgery v2.
 | |
| First, we need to run the llava surgery script as shown below:
 | |
| 
 | |
| `python llava_surgery_v2.py -C -m $GRANITE_MODEL`
 | |
| 
 | |
| You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.
 | |
| 
 | |
| ```bash
 | |
| $ ls $GRANITE_MODEL | grep -i llava
 | |
| llava.clip
 | |
| llava.projector
 | |
| ```
 | |
| 
 | |
| We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
 | |
| ```python
 | |
| import os
 | |
| import torch
 | |
| 
 | |
| MODEL_PATH = os.getenv("GRANITE_MODEL")
 | |
| if not MODEL_PATH:
 | |
|     raise ValueError("env var GRANITE_MODEL is unset!")
 | |
| 
 | |
| encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
 | |
| projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))
 | |
| 
 | |
| assert len(encoder_tensors) > 0
 | |
| assert len(projector_tensors) > 0
 | |
| ```
 | |
| 
 | |
| If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
 | |
| 
 | |
| 
 | |
| ### 2. Creating the Visual Component GGUF
 | |
| Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below.
 | |
| 
 | |
| ```bash
 | |
| $ ENCODER_PATH=$PWD/visual_encoder
 | |
| $ mkdir $ENCODER_PATH
 | |
| 
 | |
| $ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
 | |
| $ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
 | |
| ```
 | |
| 
 | |
| Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`.
 | |
| 
 | |
| ```json
 | |
| {
 | |
|     "_name_or_path": "siglip-model",
 | |
|     "architectures": [
 | |
|       "SiglipVisionModel"
 | |
|     ],
 | |
|     "image_grid_pinpoints": [
 | |
|         [384,384],
 | |
|         [384,768],
 | |
|         [384,1152],
 | |
|         [384,1536],
 | |
|         [384,1920],
 | |
|         [384,2304],
 | |
|         [384,2688],
 | |
|         [384,3072],
 | |
|         [384,3456],
 | |
|         [384,3840],
 | |
|         [768,384],
 | |
|         [768,768],
 | |
|         [768,1152],
 | |
|         [768,1536],
 | |
|         [768,1920],
 | |
|         [1152,384],
 | |
|         [1152,768],
 | |
|         [1152,1152],
 | |
|         [1536,384],
 | |
|         [1536,768],
 | |
|         [1920,384],
 | |
|         [1920,768],
 | |
|         [2304,384],
 | |
|         [2688,384],
 | |
|         [3072,384],
 | |
|         [3456,384],
 | |
|         [3840,384]
 | |
|     ],
 | |
|     "mm_patch_merge_type": "spatial_unpad",
 | |
|     "hidden_size": 1152,
 | |
|     "image_size": 384,
 | |
|     "intermediate_size": 4304,
 | |
|     "model_type": "siglip_vision_model",
 | |
|     "num_attention_heads": 16,
 | |
|     "num_hidden_layers": 27,
 | |
|     "patch_size": 14,
 | |
|     "layer_norm_eps": 1e-6,
 | |
|     "hidden_act": "gelu_pytorch_tanh",
 | |
|     "projection_dim": 0,
 | |
|     "vision_feature_layer": [-24, -20, -12, -1]
 | |
| }
 | |
| ```
 | |
| 
 | |
| At this point you should have something like this:
 | |
| ```bash
 | |
| $ ls $ENCODER_PATH
 | |
| config.json             llava.projector         pytorch_model.bin
 | |
| ```
 | |
| 
 | |
| Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`.
 | |
| ```bash
 | |
| $ python convert_image_encoder_to_gguf.py \
 | |
|     -m $ENCODER_PATH \
 | |
|     --llava-projector $ENCODER_PATH/llava.projector \
 | |
|     --output-dir $ENCODER_PATH \
 | |
|     --clip-model-is-vision \
 | |
|     --clip-model-is-siglip \
 | |
|     --image-mean 0.5 0.5 0.5 \
 | |
|     --image-std 0.5 0.5 0.5
 | |
| ```
 | |
| 
 | |
| This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.`
 | |
| 
 | |
| 
 | |
| ### 3. Creating the LLM GGUF.
 | |
| The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
 | |
| 
 | |
| First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
 | |
| ```bash
 | |
| $ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
 | |
| ```
 | |
| 
 | |
| ```python
 | |
| import os
 | |
| import transformers
 | |
| 
 | |
| MODEL_PATH = os.getenv("GRANITE_MODEL")
 | |
| if not MODEL_PATH:
 | |
|     raise ValueError("env var GRANITE_MODEL is unset!")
 | |
| 
 | |
| LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
 | |
| if not LLM_EXPORT_PATH:
 | |
|     raise ValueError("env var LLM_EXPORT_PATH is unset!")
 | |
| 
 | |
| tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
 | |
| 
 | |
| # NOTE: granite vision support was added to transformers very recently (4.49);
 | |
| # if you get size mismatches, your version is too old.
 | |
| # If you are running with an older version, set `ignore_mismatched_sizes=True`
 | |
| # as shown below; it won't be loaded correctly, but the LLM part of the model that
 | |
| # we are exporting will be loaded correctly.
 | |
| model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)
 | |
| 
 | |
| tokenizer.save_pretrained(LLM_EXPORT_PATH)
 | |
| model.language_model.save_pretrained(LLM_EXPORT_PATH)
 | |
| ```
 | |
| 
 | |
| Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
 | |
| ```bash
 | |
| $ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
 | |
| ...
 | |
| $ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
 | |
| ```
 | |
| 
 | |
| 
 | |
| ### 4. Quantization
 | |
| If you want to quantize the LLM, you can do so with `llama-quantize` as you would any other LLM. For example:
 | |
| ```bash
 | |
| $ ./build/bin/llama-quantize $LLM_EXPORT_PATH/granite_llm.gguf $LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf Q4_K_M
 | |
| $ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf
 | |
| ```
 | |
| 
 | |
| Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32.
 | |
| 
 | |
| 
 | |
| ### 5. Running the Model in Llama cpp
 | |
| Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
 | |
| 
 | |
| ```bash
 | |
| $ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \
 | |
|     --mmproj $VISUAL_GGUF_PATH \
 | |
|     -c 16384 \
 | |
|     --temp 0
 | |
| ```
 | 
