BF16 requires special handling in this script
while it's a 2-bytes data, but view is 1-byte by default.
Switch to correct view before attempting byteswapping.
With this change correctly byteswapping models like
Meta-Llama-3-8B-Instruct-bf16-GGUF
should be possible.
* model: EmbeddingGemma sentence-transformers dense linear projections support
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.
See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
- converting model with dense-layers is optional
- introduced dense config params
* Update convert_hf_to_gguf.py
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* fixed formatting issues
* Update src/llama-graph.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* - removed pooling_type_opt, always allow overriding pooling_type
- asserts checking dense features dims
* fix python lint
* fix ubuntu gcc build warning
* - fixed thread-safety test
- moved asserts to load_hparams
* - tidying up code
- simplifying graph-context expecting both dense weights
* minor : add TODO
---------
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat: Add granite-docling conversion using trillion pretokenizer
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add granite-docling vocab pre enum
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use granite-docling pre
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add clip_is_idefics3
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Allow multi-token boundary sequences for image templating
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tiling support for idefices3 in clip.cpp
This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Partial support for full templating for idefics3 in mtmd
There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Fully working image preprocessing for idefics3 w/ resize and slicing
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the longest side instead of size * scale_factor
For Granite Docling, these come out to the same value, but that was just a
conicidence.
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow batch encoding and remove clip_is_idefics3
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove unnecessary conditionals for empty token vectors
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use image_manipulation util
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* add test model
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* First attempt
* No permute during convert (fixes qk tensors), proper norm application.
* RoPE = NeoX
* Coherence!
* Migrate xielu params from tensors to hyperparameters
* Simple CUDA kernel
* Revert stupid LLM refactorings
* Chat template support
* configchecker / flake8 errors
* Reorder unary.cu
* I do conclude that LLMs are, in fact, stupid.
* Fix after merge
* Final newline
* Make xIELU an UNARY_OP
* Final newline
* Correctly account for parameter shift
* Argh.
* Update ggml/src/ggml-cpu/unary-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Refactor: remove unused methods, inline and factorize softplus, add const modifiers
* Revert CUDA changes, implement xIELU as a separate OP
* Pesky newline
* Add float2half / half2float for F16 inputs/outputs
* CUDA variants, attempt 2
* Actually, attempt 3
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Missing convert header
* Proper formula and reference for xIELU in the comments.
* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add tensor mappings for Apertus to global list instead
* Fix lazy on scalars
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add comment about the constraints on positive/negative alpha
* Change `softplus` to `ggml_softplus`
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* add grok-2 support
* type fix
* type fix
* type fix
* "fix" vocab for invalid sequences
* fix expert tensor mapping and spaces in vocab
* add chat template
* fix norm tensor mapping
* rename layer_out_norm to ffn_post_norm
* ensure ffn_post_norm is mapped
* fix experts merging
* remove erroneous FFN_GATE entry
* concatenate split tensors and add more metadata
* process all expert layers and try cat instead of hstack
* add support for community BPE vocab
* fix expert feed forward length and ffn_down concat
* commit this too
* add ffn_up/gate/down, unsure if sequence is right
* add ffn_gate/down/up to tensor names
* correct residual moe (still not working)
* mess--
* fix embedding scale being applied twice
* add built in chat template
* change beta fast for grok if default value
* remove spm vocab in favor of community bpe vocab
* change attention temp length metadata type to integer
* update attention temp length metadata
* remove comment
* replace M_SQRT2 with std::sqrt(2)
* add yarn metadata, move defaults to hparams
* feat: Add python-side constants and conversion for adapter.lora.invocation_string
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add c++ side constants for adapter.lora.invocation_string
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse invocation string for adapters from GGUF
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(python): Update conversion to alora_invocation_tokens
This is the preferred method in PEFT which is the source of ground truth
https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(cpp): Update to alora_invocation_tokens on c++ side
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add C APIs to get alora invocation token array from lora
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Initial implementation of alora cache logic in server
This does not yet do the part to identify the invocation tokens and only
apply the lora adapter afterwards, but it does seem to produce correct
results if the invocation tokens are the beginning of the uncached input.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Identify alora invocation sequences
This currently limits to a single enabled alora per slot. Multiple aloras
with different invocation sequences would be possible, but it would require
a more complex integration of the adapter toggling and is not really a well
studied case for alora since it's unclear if one alora can reuse cache from
previous prefill computed with a different alora.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Only reuse cache for tokens before the alora invocation start
This is a bit of an edge case, but theoretically a user could try the same
query with the alora disabled (just using the base model), then retry with
the alora. The cached tokens from the first pass should be invalid.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Handle un-cached tokens that come before the alora activation
The solution is to only fill up to the token before the invocation start in
the batch if there are any tokens to be prefilled between those pulled from
cache and the invocation start. When this is detected, the alora is
temporarily disabled with a scale of 0.0, then immediately re-enabled after
it has been initialized for the internal graph. Since the batch does not
complete the prompt tokens, the remaining prompt tokens are handled in the
next task, pulling all of the non-alora tokens from cache and proceeding
with prefill for the alora tokens.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use || instead of 'or'
Too much python 🤦
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix off-by-one for limiting cached tokens to before alora start
This was the cause of the inconsistent results from the dummy test script
with and without the turn that runs the prompt without the adapter before
running it with the adapter.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Support backwards-compatibility for "invocation_string" in adapter_config.json
While this has been replaced in the PEFT PR in favor of
alora_invocation_tokens, the existing adapters in the ibm-granite org on HF
use "invocation_string," so this will enable backwards compatibility and
enable testing now (before PEFT PR changes have percolated everywhere).
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove duplicate logging
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
This commit add support for the EmbeddingGemma 300m. This model supports
sliding window attention (SWA) and a new swq_type is introduced to
support symmetric SWA masking.
This commit also extracts the code from the function
llama_is_masked_swa in llama-impl.h, so that the logic can be shared
by both llm_graph_input_attn_no_cache::set_input and
llama_kv_cache::set_input_kq_mask.
With this commit the EmbeddingGemma 300m model can be converted to
to GGUF and used with llama.cpp.
Once the model has been uploaded to HuggingFace it can be used like
this:
```console
./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0
```
* gguf-py: implement byteswapping for Q4_0
This is needed to byteswap Mistral model.
Also restore original shapes after byteswapping tensors.
It is not needed at the moment, but do it in case
they'd be used in future.
* Rework byteswapping code in gguf-py
Move out details from byteswapping tensor blocks code
* convert : fix tensor naming conflict for llama 4 vision
* convert ok
* support kimi vision model
* clean up
* fix style
* fix calc number of output tokens
* refactor resize_position_embeddings
* add test case
* rename build fn
* correct a small bug
* wip lfm2 vision model
* Fix conv weight
* Implement dynamic resolution
* Fix cuda
* support LFM2-VL-450M
* happy CI
* Remove extra `ggml_conv` and put others into the right place
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add support for Llada-8b: diffusion model
* Add README
* Fix README and convert_hf_to_gguf
* convert_hf_to_gguf.py: address review comments
* Make everything in a single example
* Remove model-specific sampling
* Remove unused argmax
* Remove braced initializers, improve README.md a bit
* Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps
* Remove adding the mask token
* Move add_add_bos_token to set_vocab
* use add_bool in gguf_writer.py
* support smallthinker
* support 20b softmax, 4b no sliding window
* new build_moe_ffn_from_probs, and can run 4b
* fix 4b rope bug
* fix python type check
* remove is_moe judge
* remove set_dense_start_swa_pattern function and modify set_swa_pattern function
* trim trailing whitespace
* remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* better whitespace
Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* use GGML_ASSERT for expert count validation
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Improve null pointer check for probs
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* use template parameter for SWA attention logic
* better whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* move the creation of inp_out_ids before the layer loop
* remove redundant judge for probs
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* imatrix : allow processing multiple chunks per batch
* perplexity : simplify filling the batch
* imatrix : fix segfault when using a single chunk per batch
* imatrix : use GGUF to store imatrix data
* imatrix : fix conversion problems
* imatrix : use FMA and sort tensor names
* py : add requirements for legacy imatrix convert script
* perplexity : revert changes
* py : include imatrix converter requirements in toplevel requirements
* imatrix : avoid using designated initializers in C++
* imatrix : remove unused n_entries
* imatrix : allow loading mis-ordered tensors
Sums and counts tensors no longer need to be consecutive.
* imatrix : more sanity checks when loading multiple imatrix files
* imatrix : use ggml_format_name instead of std::string concatenation
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* quantize : use unused imatrix chunk_size with LLAMA_TRACE
* common : use GGUF for imatrix output by default
* imatrix : two-way conversion between old format and GGUF
* convert : remove imatrix to gguf python script
* imatrix : use the function name in more error messages
* imatrix : don't use FMA explicitly
This should make comparisons between the formats easier
because this matches the behavior of the previous version.
* imatrix : avoid returning from void function save_imatrix
* imatrix : support 3d tensors with MUL_MAT
* quantize : fix dataset name loading from gguf imatrix
* common : move string_remove_suffix from quantize and imatrix
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* imatrix : add warning when legacy format is written
* imatrix : warn when writing partial data, to help guess dataset coverage
Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.
* imatrix : avoid loading model to convert or combine imatrix
* imatrix : avoid using imatrix.dat in README
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Support diffusion models: Add Dream 7B
* Move diffusion to examples
* Move stuff to examples. Add patch to not use kv-cache
* Address review comments
* Make sampling fast
* llama: remove diffusion functions
* Add basic timings + cleanup
* More cleanup
* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length
* fixup!
* Review: move everything to diffusion-cli for now
* Add PLaMo-2 model using hybrid memory module
* Fix z shape
* Add cmath to include from llama-vocab.h
* Explicitly dequantize normalization weights before RoPE apply
* Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing
* Use ATTN_K/Q_NORM for k,q weights to prevent quantization
* Remove SSM_BCDT that is not used from anywhere
* Do not duplicate embedding weights for output.weight
* Fix tokenizer encoding problem for multibyte strings
* Apply suggestion from @CISC
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up
* Remove unnecessary part for Grouped Query Attention
* Fix how to load special token id to gguf
* Remove unused tensor mapping
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations
* Update src/llama-vocab.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix plamo2 tokenizer session to prevent multiple calls of build()
---------
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba
(and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : use std::find for seq_nodes in llama_rs_cache
* llama : state checkpoints for recurrent models
* llama : correctly handle more edge cases for the rs cache
* llama : rename many llama_kv_cache_* functions
* llama : remove useless return value for some llama_cache_* functions
* llama : rethink recurrent state cell counts
* llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.
* llama : gracefully fail when not finding hybrid slot
* llama : support Jamba
* llama : fix BERT inference without KV cache
* convert-hf : check for unprocessed Jamba experts
* convert-hf : support Mini-Jamba conversion
* llama : fix Jamba quantization sanity checks
* llama : sequence-length-aware batch splitting
* llama : use equal-sequence-length sub-batches for recurrent models
* ggml : simplify SSM-related operators
* llama : make recurrent state slot allocation contiguous
* llama : adapt internal uses of batches to llama_ubatch
* llama : fix batch split output count for embeddings
* llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.
* llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.
* llama : avoid copies for simple batch splits
* llama : use im2col and mul_mat to perform convolution for Mamba
This removes the need for ggml_ssm_conv!!!
But performance seems slighly worse on my system,
especially for prompt processing.
Maybe ggml_mul_mat isn't optimized for small row sizes?
More performance testing is necessary until GGML_OP_SSM_CONV is removed.
* ggml : make ggml_ssm_scan not modify its source tensors
* llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.
* llama : fix .base() compilation error on Windows
* llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it,
and this makes Mamba's conv step slightly faster.
* llama : rename llama_cache to llama_past
This can be changed back later if the name change is wrong.
I was renaming the functions anyway to generalize kv-cache-related
functions to hybrid and recurrent model architectures.
I think llama_past is a better name than llama_cache for a combined
kv cache and recurrent state cache, because the states it contains
pretty much always come before the newly-added ones for any particular
sequence. Also 'llama_past_clear' sounds more obvious in what it does
than 'llama_kv_cache_clear'. The future is what the models generate.
(For embeddings, the kv cache isn't really used anyway)
Still, I'm open to better suggestions.
* examples : replace llama_kv_cache_seq_* with llama_past_seq_*
* mamba : fix non-contiguous usage of ggml_silu
* llama : initial Mamba-2 support
* ggml : SIMD ggml_ssm_scan for Mamba-2
* ggml : improve ggml_mul speed when masking recurrent states
* llama : support running Mamba-Codestral-7B-v0.1
* llama : fix Mamba-2 conv state saving
* ggml : make the ggml_mul fast broadcast path more consistently formatted
* llama : remove unused variable
* llama : add missing break
* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
* llama : session saving and reloading for hybrid models
* convert_hf : fix Jamba conversion
* llama : fix mixed signedness comparison
* llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
* llama : begin renaming llama_past back to llama_kv_cache
* llama : avoid redundant state copy for Mamba 1 and 2
* metal : attempt to adapt SSM_SCAN for Mamba-2
* metal : fix SSM_SCAN pipeline scope
* metal : use log and exp instead of log1pf and expf in SSM_SCAN
* metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
* metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
* metal : fix SSM_SCAN state head offset
* metal : fix wrong number of tokens per sequence in SSM_SCAN
* ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
* ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
* convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models
to avoid some reshapes.
Not sure if it's a good idea,
but it makes the graph slightly cleaner.
* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
* convert : fix flake8 lint
* llama : remove implicit recurrent state rollbacks
* llama : partially apply clang-format style
* metal : fix confusion between ; and ,
* metal : add missing args for nb references in ssm_scan_f32_group
* metal : single-user mamba2 inference works
* kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
* convert : avoid AutoConfig for Mamba and Mamba2 hparams
* kv-cache : allow context shift for recurrent models
* graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
* ggml : fix mamba2 ssm scan when compiled with SVE
* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
* cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
* feat: Add conversion for Bamba models
This is borrowed and adapted from the original implementation
https://github.com/ggml-org/llama.cpp/pull/10810
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add Granite 4 conversion
This is a manual copy from my draft branch
https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Plumb bamba through llama-arch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add bamba to llama_arch_is_hybrid_recurrent
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add optional mamba ssm_in bias tensor
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add template specialization for get_arr to load a vector<uint32_t> for layer index arr in hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Use an explicit bool to determine mamaba vs mamba2
This allows other architectures like bamba and granitemoehybrid to use
mamab2 without a growing architecture `if` statement inside the mamba
implementation.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Isolate mamba(2) and granite attention layer building in static methods
This will allow these layer-builder methods to be used from other build
structs without complex inheritance.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use per-layer sizes in granite build_attention_layer
Also no need to pass in kv cache since it's already in the inp_attn
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: First (broken) pass at end-to-end Bamba implementation
It generates (garbage) tokens! Still lots of debugging to do.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Only do Granite multipliers if set
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Pull granite ffn portion into a static function and reuse in hybrid
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(py): Allow gguf duplicate keys if they match by value and type
This is helpful for hybrid models that want to do gguf param setting by
calling multiple parent classes without needing to make those parent
classes try/except on every attempt to set a gguf value.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor(py): Simplify granitemoehybrid conversion to use parents better
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add GRANITE_MOE_HYBRID through llama-arch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Support GRANITE_MOE_HYBRID in llama-model
This re-uses the Bamba code paths heavily and simply adds the missing parts
for loading MoE and the shared expert.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Fix flake8 errors
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix recurrent cache get after rebase
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins
The challenge here is to give both the non-hybrid classes (llm_build_mamba
and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access
to the same intermediate "base class" functionality (build_mamba*_layer,
build_granite_attention_layer) without running into trouble with diamond
inheritance of llm_graph_context. Due to the non-trivial initialization
that happens in llm_graph_context, diamond inheritance results in multiple
initializations of the common base which cause problems around the unique
ptrs. I wanted to get away from `self->` everywhere, but this is still a
bit cleaner than making those methods static I think.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Implement the full copy-paste version to duplicate the layer builders
This follows the pattern where the type of input is pinned to the type of
memory and that is used to dispatch to the correct version of `build_rs` /
`build_attn`. There's a lot of code duplication that can hopefully be
pulled into common functions in the graph later.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid
I've got back-and-forth a lot about how/if to try to implement reuse of the
"child model" layer types for hybrid models. At the end of the day, I think
hybrid models are their own beast and even if their layers are inspired by
other models, they should maintain control of their own layer building (in
other words, the copy-paste method). Given that, the name should reflect
that this is not a generic hybrid model builder, but rather a granite-
specific hybrid model builder that can do MoE (granite 4) or dense (bamba).
As part if this, I also cleaned up dangling comments from previous attempts
at using static methods for reusability.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* memory : correctly handle failure in apply()
ggml-ci
* style: Remove TODO for adding first hybrid models to the switch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* docs: Fix comment about duplicate key check
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Conform to standard way of initializing inp_out_ids
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* convert : fix jamba conv1d shape squeezing
* fix: Fix input initialization in granite_hybrid after removal of hybrid inputs
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use llm_graph_context_mamba in llm_build_granite_hybrid
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins
The key is for the mixin classes (llm_graph_context_mamba,
llm_graph_context_granite) to use virtual inheritance from
llm_graph_context. This allows the common members to exist only once in the
class hierarchy. The downside is that llm_graph_context will be
re-initialized once for each parent (ie 2x for single mixin, 3x for two
mixins, etc...).
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).
* model : add Jamba to Mamba-specific hparams printing
* fix: Fix input setup after upstream merge
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* jamba : remove redundant nullptr initializations
* model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Add support for dense FFN in GraniteMoeHybrid
This was already partially supported via reusing the granite ffn builder,
and there may be models that leverage this architecture going forward. The
naming is a bit odd, but in the transformers version, it reuses the same
model class and simply has zero regular experts and a single shared expert
(which is the same as a single dense FFN).
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add support for dense FFN tensor names on c++ side
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use child inputs for Falcon H1 after merge resolution
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary prefix on tensor constants
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model : make falcon-h1 use shared mamba2 layer builder
* memory : avoid referring to KV in recurrent cache logs
* fix: Revert order changes for Falcon H1 to stay consistent with upstream
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
* refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid
The only key difference is the use of rope which is now set via
rope_finetuned in the hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove use of diamond inheritance
Per PR discussion, it's simpler to keep this with basic inheritance and not
introduce the complexity of virtual inheritance and multiple inheritance
https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Log mamba params for Granite Hybrid
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unused ssm_in_b
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv
This matches how recurrent vs attention heads are identified for Jamba
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unused template expansion for get_arr
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Review cleanup in convert_hf_to_gguf
The gist is to be explicit about which base class is being used with the
multiple inheritance setup
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Undo hidden warnings about duplicate identical keys in add_key_value
After further discussion, this encourages sloppy overwriting in the model
converters
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: If not using ROPE, context is "infinite"
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* doc: Add a comment outlining expected duplicate key warnings
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary duplicate keys in converter
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
(thanks for the sharp eyes and patience!)
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba
(and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : use std::find for seq_nodes in llama_rs_cache
* llama : state checkpoints for recurrent models
* llama : correctly handle more edge cases for the rs cache
* llama : rename many llama_kv_cache_* functions
* llama : remove useless return value for some llama_cache_* functions
* llama : rethink recurrent state cell counts
* llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.
* llama : gracefully fail when not finding hybrid slot
* llama : support Jamba
* llama : fix BERT inference without KV cache
* convert-hf : check for unprocessed Jamba experts
* convert-hf : support Mini-Jamba conversion
* llama : fix Jamba quantization sanity checks
* llama : sequence-length-aware batch splitting
* llama : use equal-sequence-length sub-batches for recurrent models
* ggml : simplify SSM-related operators
* llama : make recurrent state slot allocation contiguous
* llama : adapt internal uses of batches to llama_ubatch
* llama : fix batch split output count for embeddings
* llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.
* llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.
* llama : avoid copies for simple batch splits
* ggml : make ggml_ssm_scan not modify its source tensors
* llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.
* llama : fix .base() compilation error on Windows
* llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it,
and this makes Mamba's conv step slightly faster.
* mamba : fix non-contiguous usage of ggml_silu
* llama : session saving and reloading for hybrid models
* convert_hf : fix Jamba conversion
* llama : fix mixed signedness comparison
* llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
* llama : begin renaming llama_past back to llama_kv_cache
* llama : remove implicit recurrent state rollbacks
* llama : partially apply clang-format style
* convert : fix jamba conv1d shape squeezing
* graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).
* model : add Jamba to Mamba-specific hparams printing
* jamba : remove redundant nullptr initializations
* model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model : make falcon-h1 use shared mamba2 layer builder
* memory : avoid referring to KV in recurrent cache logs
* gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* llama : initial Mamba-2 support
* ggml : SIMD ggml_ssm_scan for Mamba-2
* ggml : improve ggml_mul speed when masking recurrent states
* llama : support running Mamba-Codestral-7B-v0.1
* llama : fix Mamba-2 conv state saving
* ggml : make the ggml_mul fast broadcast path more consistently formatted
* llama : remove unused variable
* llama : add missing break
* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
* llama : avoid redundant state copy for Mamba 1 and 2
* metal : attempt to adapt SSM_SCAN for Mamba-2
* metal : fix SSM_SCAN pipeline scope
* metal : use log and exp instead of log1pf and expf in SSM_SCAN
* metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
* metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
* metal : fix SSM_SCAN state head offset
* metal : fix wrong number of tokens per sequence in SSM_SCAN
* ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
* ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
* convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models
to avoid some reshapes.
Not sure if it's a good idea,
but it makes the graph slightly cleaner.
* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
* convert : fix flake8 lint
* metal : fix confusion between ; and ,
* metal : add missing args for nb references in ssm_scan_f32_group
* metal : single-user mamba2 inference works
* kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
* convert : avoid AutoConfig for Mamba and Mamba2 hparams
* kv-cache : allow context shift for recurrent models
* graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
* ggml : fix mamba2 ssm scan when compiled with SVE
* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
* cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
* mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* cuda : graceful fallback for Mamba-1 models with weird embd size