9 Commits

Author SHA1 Message Date
Yakine Tahtah
a00e47e422 mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) (#22101)
* mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)

Conformer encoder with Shaw relative position encoding,
QFormer projector, log-mel spectrogram with frame stacking.

Encoder uses GLU gating, folded batch norm, and SSM depthwise
conv. QFormer compresses encoder output via windowed
cross-attention (window=15, queries=3) into the LLM embedding
space.

Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank,
dynamic range compression, 2x frame stacking (80->160 mel).

GGUF converter handles batch norm folding at export time,
fused K/V split, and Conv1d weight reshaping.

Tested against HF transformers reference: token-for-token match
on 30s/60s audio clips with greedy decoding.

* mtmd: rename gs_ prefixed tensors to generic/architecture names

* mtmd: use tensor_mapping.py for all granite_speech tensors

* convert: fold GraniteSpeechTextModel into GraniteModel

* mtmd: replace n_layer hack with explicit has_standard_layers flag

* mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech

* mtmd: align KEY_A_ define spacing

* convert: register GraniteModel for GraniteSpeechForConditionalGeneration

* convert: fix ty type-check for GraniteSpeechMmprojModel registration

* mtmd: align TN_ define spacing

* mtmd: use generic layer loop for granite speech tensor loading

* mtmd: merge qformer_proj_layer into clip_layer

* mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs

* mtmd: granite_speech add comment explaining why build_attn is not used

* mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata

* gguf: add spacing between granite_speech tensor mapping blocks

* mtmd: make generic audio layer_norm_eps read optional

* mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps

* mtmd: align defines and struct fields in clip-impl.h and clip-model.h

* mtmd: fix alignment and ordering issues across granite speech files

* convert: granite_speech use filter_tensors instead of modify_tensors for skipping
2026-05-06 14:40:59 +02:00
Stephen Cox
547765a93e mtmd: add Gemma 4 audio conformer encoder support (#21421)
* mtmd: add Gemma 4 audio conformer encoder support

Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer.

Architecture:
- 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm
- Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
- Full self-attention with sinusoidal RPE and sliding window mask (24)
- Logit softcapping at 50.0, ClippableLinear clamping
- Output: 1024 → 1536 → RMSNorm → multimodal embedder

Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):
- HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
- Standard periodic Hann window (320 samples), zero-padded to FFT size
- Semicausal left-padding (frame_length/2 samples)
- Frame count matched to PyTorch (unfold formula)
- No pre-emphasis, no Whisper-style normalization
- Mel cosine similarity vs PyTorch: 0.9998

Key fixes:
- Tensor loading dedup: prevent get_tensor() from creating duplicate
  entries in ctx_data. Fixed with std::set guard.
- ClippableLinear clamp_info loading moved after per-layer tensors.
- Sliding window mask (24 positions) matching PyTorch context_size.
- Skip Whisper normalization for Gemma4 mel output.

Tested on E2B and E4B with CPU and Vulkan backends.
Transcribes: "Glad to see things are going well and business is starting
to pick up" (matching ground truth).

Ref: #21325
2026-04-12 14:15:26 +02:00
Xuan-Son Nguyen
871f1a2d2f mtmd: add more sanity checks (#21047) 2026-03-27 11:00:52 +01:00
Marcel Petrick
92f7da00b4 chore : correct typos [no ci] (#20041)
* fix(docs): correct typos found during code review

Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>

* Update docs/backend/CANN.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8"

This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-05 08:50:21 +01:00
Tarek Dakhran
ccbc84a537 mtmd: mtmd_audio_streaming_istft (#18645)
Change is decoupled from https://github.com/ggml-org/llama.cpp/pull/18641.

[LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B)
needs streaming istft for generating output audio.

* add streaming ISTFT class (`mtmd_audio_streaming_istft`) with overlap-add for audio reconstruction
* replace global audio cache with per-instance cache, the model requires
  two independent caches, for preprocessing (audio input) and for istft
  (audio output).
* unified templated FFT/IFFT implementation supporting both forward and inverse transforms
2026-01-06 21:00:29 +01:00
Xuan-Son Nguyen
8ea958d4d9 model : add ASR support for LFM2-Audio-1.5B (conformer) (#18106)
* ASR with LFM2-Audio-1.5B

* Set rope_theta

* Fix comment

* Remove rope_theta setting

* Address PR feedback

* rename functions to conformer

* remove some redundant ggml_cont

* fix missing tensor

* add prefix "a." for conv tensors

* remove redundant reshape

* clean up

* add test model

---------

Co-authored-by: Tarek Dakhran <tarek@liquid.ai>
2025-12-19 00:18:01 +01:00
Xuan-Son Nguyen
96a181a933 mtmd: refactor audio preprocessing (#17978)
* mtmd: refactor audio preprocessing

* refactor

Co-authored-by: Tarek <tdakhran@users.noreply.github.com>

* wip

* wip (2)

* improve constructor

* fix use_natural_log

* fix padding for short input

* clean up

* remove need_chunking

---------

Co-authored-by: Tarek <tdakhran@users.noreply.github.com>
2025-12-15 14:16:52 +01:00
Xuan-Son Nguyen
10961339b2 mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code
2025-05-28 22:35:22 +02:00
Xuan-Son Nguyen
797990c4bc mtmd : add ultravox audio input (#13623)
* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()
2025-05-22 20:42:48 +02:00