llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-11 11:34:10 +00:00

Author	SHA1	Message	Date
Gilad S.	5207d120ea	model : don't crash on unsupported architecture (#22742 ) * model: don't crash on unsupported architecture * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9048	2026-05-06 18:51:21 +02:00
fl0rianr	a0101225bc	common: do not fit to unknown device memory (#22614 ) * common: do not fit to unknown device memory Signed-off-by: Florian Reinle <f.reinle@otec.de> * common: preserve host fallback for non-GPU fit devices Signed-off-by: Florian Reinle <f.reinle@otec.de> * common: keep unknown GPU fit memory at zero Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de> b9047	2026-05-06 17:03:45 +02:00
Georgi Gerganov	a290ce6266	gguf-py : bump version to 0.19.0 (#22664 ) * gguf-py : bump version to 0.19.0 * bump poetry --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> gguf-v0.19.0	2026-05-06 14:46:14 +02:00
Yakine Tahtah	a00e47e422	mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) (#22101 ) * mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stacking. Encoder uses GLU gating, folded batch norm, and SSM depthwise conv. QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space. Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, 2x frame stacking (80->160 mel). GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping. Tested against HF transformers reference: token-for-token match on 30s/60s audio clips with greedy decoding. * mtmd: rename gs_ prefixed tensors to generic/architecture names * mtmd: use tensor_mapping.py for all granite_speech tensors * convert: fold GraniteSpeechTextModel into GraniteModel * mtmd: replace n_layer hack with explicit has_standard_layers flag * mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech * mtmd: align KEY_A_ define spacing * convert: register GraniteModel for GraniteSpeechForConditionalGeneration * convert: fix ty type-check for GraniteSpeechMmprojModel registration * mtmd: align TN_ define spacing * mtmd: use generic layer loop for granite speech tensor loading * mtmd: merge qformer_proj_layer into clip_layer * mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs * mtmd: granite_speech add comment explaining why build_attn is not used * mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata * gguf: add spacing between granite_speech tensor mapping blocks * mtmd: make generic audio layer_norm_eps read optional * mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps * mtmd: align defines and struct fields in clip-impl.h and clip-model.h * mtmd: fix alignment and ordering issues across granite speech files * convert: granite_speech use filter_tensors instead of modify_tensors for skipping b9045	2026-05-06 14:40:59 +02:00
David Huggins-Daines	750141969c	feat: migrate to PEP 621 and add uv support (#21907 ) * feat: migrate to PEP 621 and add uv support * fix: remove upper bound on protobuf * remove poetry.lock and uv.lock * fix/add torch dependency version and markers * fix dev-dependency deprecation warning * gguf-py : update python version requirement to 3.10 --------- Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2026-05-06 14:04:10 +02:00
Daniel Bevenius	a736e6c0ac	convert : ignore non-language tensors for Gemma4Model (#22753 ) * convert : ignore non-language tensors for Gemma4Model This commit adds a check to make sure only text language tensors are handled in filter_tensors. The motivation is that currently when trying to convert a Gemma4 model the following error occurs: ```console (venv) $ ./convert-gemma.sh INFO:hf-to-gguf:Loading model: gemma-4-E2B-it INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors' INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only INFO:hf-to-gguf:Exporting model... INFO:hf-to-gguf:rope_freqs.weight, torch.float32 --> F32, shape = {256} Traceback (most recent call last): File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module> main() File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main model_instance.write() File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write self.prepare_tensors() File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)): File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors yield from super().modify_tensors(data_torch, name, bid) File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors yield from super().modify_tensors(data_torch, name, bid) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors new_name = self.map_tensor_name(name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name raise ValueError(f"Can not map tensor {name!r}") ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight' ``` * add forgotten embed_vision and embed_audio * improve --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-06 13:50:44 +02:00
Aleksander Grygier	e3e3f8e46a	webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719 ) * refactor: Remove Google favicon utility * fix: MCP Server favicon * refactor: Cleanup * refactor: MCP Server Information * fix: Fix MCP Settings UI * refactor: Cleanup	2026-05-06 11:12:27 +02:00
zzzzwc	f08f20a0e3	ggml-cpu: fuse RMS_NORM + MUL on CPU backend (#22423 ) b9041	2026-05-06 15:41:14 +08:00
viggy	07eaf919ed	add tabindex and aria-hidden (#22699 )	2026-05-06 09:21:58 +02:00
Sigbjørn Skjæret	74d6248f71	convert : add filter_tensors method to pre-filter tensors (#22597 ) * add filter_tensors classmethod * remove language_model * fix parts validation	2026-05-06 08:06:05 +02:00
fl0rianr	2ca1161bd7	ggml : use `CL_DEVICE_GLOBAL_MEM_SIZE` as memory estimate for OpenCL --fit (#22688 ) * ggml : report estimated OpenCL memory for --fit Signed-off-by: Florian Reinle <f.reinle@otec.de> * ggml : estimated OpenCL memory backend integrated Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de> b9038	2026-05-05 22:12:48 -07:00
Trivikram Reddy	bbeb89d76c	Hexagon: Process M-tail rows on HMX instead of HVX (#22724 ) * hex-mm: process m-tail rows on HMX instead of HVX * hmx-mm: unroll and optimize padded activation loop --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> b9037	2026-05-05 09:43:03 -07:00
lhez	ff806a110d	opencl: refactor Adreno q4_0 (#22335 ) * opencl: refactor adreno q4_0 gemm/gemv dispatch * opencl: refactor q4_0 gemm/gemv loading, use consistent names * opencl: use consistent name for adreno q8_0 gemm/gemv * opencl: use consistent names for adreno q4_0 gemm/gemv * opencl: simplify adreno q4_0 set_tensor * opencl: refactor q4_0 get_tensor	2026-05-05 09:38:57 -07:00
Radoslav Gerganov	d5003b6e4d	rpc : use graph uid instead of graph cache (#22701 ) Store the last graph uid and compare against it to determine if the same graph is being computed.	2026-05-05 13:47:13 +03:00
Adrien Gallouët	2635ac76e8	common : fix missing-noreturn warnings when compiling with clang 21 (#22702 ) common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3719 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3726 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3733 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3740 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3747 \| [](common_params & /params/, int /value/) { \| ^ Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-05 13:16:25 +03:00
Georgi Gerganov	70a8309114	sync : ggml b9033	2026-05-05 13:15:59 +03:00
Georgi Gerganov	c91faf997f	ggml : bump version to 0.11.0 (ggml/1478)	2026-05-05 13:15:59 +03:00
Adrien Gallouët	bf76ac77be	common : only load backends when required (#22290 ) * common : only load backends when required Signed-off-by: Adrien Gallouët <angt@huggingface.co> * llama : call ggml_backend_load_all() directly from llama_backend_init() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add ggml_backend_load_all() where llama_backend_init() is not used Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> b9031	2026-05-05 09:23:50 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	a09a00e502	vendor : update cpp-httplib to 0.43.3 (#22686 ) b9030	2026-05-05 09:04:57 +02:00
Georgi Gerganov	2bacb1eb77	server : validate --tools CLI argument against known tool names (#22538 ) Previously, unknown tool names passed via --tools were silently ignored. Now the server validates each tool name at startup and exits with an error if an unrecognized tool is specified, listing the available tools. Assisted-by: llama.cpp:local pi b9029	2026-05-05 06:35:27 +03:00
Georgi Gerganov	d6e7b033a4	llama : add option to save memory in device buffers (#22679 ) * llama : add option to save memory in device buffers * tests : extend llama-save-load-state b9028	2026-05-05 06:35:07 +03:00
Sigbjørn Skjæret	fa595462ca	graph : handle non-contiguous Q/K/V in mul_mat_aux (#22630 ) * qkv may not always be contiguous * cont : make the cont conditional --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-05 06:34:44 +03:00
Ismail	a817a22bc6	ggml : implement fast walsh-hadamard transform for kv rotation (#21352 ) (#22631 ) b9026	2026-05-05 10:05:05 +08:00
Charles Xu	eff06702b2	kleidiai : update to v1.24.0 and use release archive (#22549 ) b9025	2026-05-04 22:13:31 +03:00
leonardHONG	e77056f9b2	CUDA: use fastdiv for batch index split in get_rows (#22650 )	2026-05-04 16:24:05 +02:00
Xuan-Son Nguyen	935a340292	server: implement /models?reload=1 (#21848 ) b9023	2026-05-04 16:23:26 +02:00
Shakhnazar Sailaukan	d8794eecd5	examples: refactor diffusion generation (#22590 ) * examples: refactor diffusion generation * renamed enum values b9022	2026-05-04 20:19:30 +08:00
JusteLeo	36a694c965	webui : fix circular dependency between chat.service.ts and models.svelte.ts (#22625 )	2026-05-04 13:38:10 +02:00
Piotr Wilkin (ilintar)	a4701c98f7	common/autoparser: fixes for newline handling / forced tool calls (#22654 ) * chat/autoparser: the fixes * Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls. * Trim whitespace on apply instead b9020	2026-05-04 13:18:11 +02:00
Xuan-Son Nguyen	994118a183	model: move `load_hparams` and `load_tensors` to per-model definition (#22004 ) * git-friendly migration * add build_graph * nits * exclude old code from build * wip * add llm_arch_model_i * prepare downstream functions * nits * nits * wip * wip * add back create_tensor_qkv * fix files missing include * enforce one llm_build per arch * cmake: use glob * missing model params * nits * wip * wip (2) * wip (3) * test-llama-archs is happy * improve switch case * move more stuff into llm_arch_model_i * fix downstream code * nits * nits (2) * fix order * llama_model_base * LLAMA_LOAD_LOCALS * small fix * fix build errors * auto * rm migration script and ifdef b9019	2026-05-04 12:36:59 +02:00
Evan Huus	c84e6d6db5	server: Add a simple get_datetime server tool (#22649 ) b9018	2026-05-04 12:19:41 +02:00
Nick Towle	fa8feaed34	webui: restore missing settings (#22666 )	2026-05-04 09:04:07 +02:00
Georgi Gerganov	846262d787	docs : update speculative decoding parameters after refactor (#22397 ) (#22539 ) * docs : update speculative decoding parameters after refactor (#22397) Update docs/speculative.md to reflect the new parameter naming scheme introduced in PR #22397: - Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min - Replace --spec-ngram-size-n/m with per-implementation variants - Add documentation for all new --spec-ngram-- parameters - Update all example commands Assisted-by: llama.cpp:local pi pi : add rule to use gh CLI for GitHub resources Assisted-by: llama.cpp:local pi * docs : run llama-gen-docs * arg : fix typo b9016	2026-05-04 08:52:07 +03:00
Atomic-Germ	6dcd824fce	vulkan: delete dead GGML_VK_MAX_NODES def (#22621 ) b9015	2026-05-04 07:49:29 +02:00
Chen Yuan	d4b0c22f9e	ggml-webgpu: add layer norm ops (#22406 ) * shader(norm): add layer norm ops * shader(norm): stablize floating point computation with Kahan summation and handle mixed types * shader(norm): remove the non-contiguous strides * shader(norm): use the original implementation rather than the kahan summation b9014	2026-05-03 20:52:53 -07:00
Aldehir Rojas	e48034dfc9	common : determine generation prompt using longest common prefix (#22657 )	2026-05-04 00:18:23 +02:00
Julien Denize	048a490f76	convert : Mistral format yarn apply_scale support (#22612 ) * [BUGFIX] Mistral format apply_scale support. * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix misunderstood boolean parameters --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9012	2026-05-03 21:51:21 +02:00
JM Robles	db44417b02	convert : apply Q/K RoPE permutation in NVFP4 repack path (#22611 ) Llama-architecture q_proj/k_proj weights need an axis-0 row permutation to match GGML's RoPE convention. The BF16 path applies this in LlamaModel.modify_tensors via LlamaModel.permute, but the NVFP4 path bypasses modify_tensors and writes weights directly through ModelBase._repack_nvfp4. Without the permutation, attention heads end up scrambled at inference and the model produces gibberish. This change overrides _repack_nvfp4 on LlamaModel and applies the same permutation to both the nibble-packed weight and the per-block scale before delegating to ModelBase._repack_nvfp4 via super(). Reuses the existing LlamaModel.permute static helper and respects the existing undo_permute flag, so subclasses (Mistral, Granite, Llama4, etc.) inherit the fix automatically. Verified on TinyLlama-1.1B reproducer: perplexity drops from 4419 (gibberish) to 43.9, matching the BF16-dequantized baseline (44.0). Also verified end-to-end on ALIA-40b-instruct-2601 (BSC, Llama architecture) with multilingual generation in Spanish/Catalan/Basque/ Galician all coherent with the fix applied. Co-authored-by: Chema <chema@montevive.ai>	2026-05-03 18:22:00 +03:00
lucy	d05fe1d7da	fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus entirely) (#22533 ) * fix: CUDA device PCI bus ID detection for multi-GPU de-dupe * HIP, MUSA macros --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b9010	2026-05-02 22:19:25 +02:00
Georgi Gerganov	0754b7b6fe	server : avoid checkpoint data host copies (#22558 ) * server : avoid checkpoint data host copies * llama : refactor llama_io_read_i b9009	2026-05-02 18:03:25 +03:00
JusteLeo	09294365a9	ggml-virtgpu: fix circular dependency in headers (#22557 ) b9008	2026-05-02 21:28:50 +08:00
Csaba Kecskemeti	63d93d1733	convert : disable uint types (#18908 )	2026-05-02 09:05:59 +03:00
Shawn Gu	c5a3bc39b1	opencl: Adreno optimization for MoE - MxFP4 (#22301 ) * MoE Mxfp4 CLC kernel added, router reorder on GPU * Pass test-backend-ops for MoE mxfp4 Adreno CLC * remove putenv in llama-model.cpp * fix indent style and whitespace * opencl: remove unnecessary headers * opencl: do not save cl_program objects * opencl: remove unnecessary assert * fix precision issue --------- Co-authored-by: Li He <lih@qti.qualcomm.com> b9006	2026-05-01 23:02:24 -07:00
Johannes Gäßler	9dbb372610	Github: update issue templates (#22594 )	2026-05-02 07:56:13 +02:00
Georgi Gerganov	228e836344	sync : ggml b9004	2026-05-02 08:55:29 +03:00
Georgi Gerganov	ed23489f42	ggml : bump version to 0.10.2 (ggml/1474)	2026-05-02 08:55:29 +03:00
Georgi Gerganov	457e2288c9	sync : ggml b9002	2026-05-02 07:22:35 +03:00
Georgi Gerganov	e8ec7ab058	ggml : try fix win32 build (whisper/0)	2026-05-02 07:22:35 +03:00
Yiwei Shao	1a03cf47f6	hexagon: hmx flash attention (#22347 ) * hmx: extract shared interleave headers and unify matmul batched * hmx: add HMX-accelerated flash attention for prefill * hmx: replace asm wrappers with Q6_ intrinsics in hmx-utils.h Switches three single-instruction helpers from inline asm to the matching Q6_ intrinsics, matching the style established by aizip `f8737609a` and used by the upstream PR #21554 hmx-matmul-ops.c rewrite: hmx_set_output_scales asm "bias=mxmem2" -> Q6_bias_mxmem2_A hmx_load_tile_pair_fp16 asm packet -> Q6_activation_hf_mxmem_RR + Q6_weight_hf_mxmem_RR hmx_consume_accumulator_fp16 asm "mxmem=acc" -> Q6_mxmem_AR_after_hf hmx_load_tiles_fp16 stays on inline asm: it uses ":deep" activation streaming, and the mixed Q6_activation_hf_mxmem_RR_deep + non-deep Q6_weight_hf_mxmem_RR pair fails the HMX backend constraint check ("activate weight pair (1) exceeds limit (1)"). The asm bundle keeps both halves in one VLIW packet and avoids the diagnostic. Functionally equivalent — same instructions emitted; the Q6_ intrinsics just give the compiler more visibility for scheduling. * hmx: drop the duplicate interleave_fp16_weight_chunk_to_tiles * hmx: apply upstream optimization to hmx-flash-attn-ops.c apply restrict, __builtin_assume, and pointer accumulation to the three HMX workers (qk_dot, o_update, o_norm) and the matching inline HMX loops in op_hmx_flash_attn_ext. * hmx: unify interleave helper * hmx: multi-thread Q load / O store and enable prefill FA dispatch Extract inline Q-load and O-store loops into worker_pool-parallel helpers (fa_phase_q_load, fa_phase_o_store) so HVX threads split the F32↔F16 conversion work across row ranges. Also relax the softmax threading gate from n_row_vec_cnt >= n_threads to >= 2, which was unnecessarily forcing single-thread fallback when n_rows_g < 512. On the dispatch side, remove the ne[2] != 1 guard that blocked multi-head (prefill) FA from reaching the HTP backend — GQA is already handled internally by both the HMX and HVX flash-attention paths. * hmx: relax matmul pipeline gate to cover k > n shapes (e.g. FFN_down) * hmx: optimize FA softmax mask phase (no-ALiBi fast path + GQA dedup) * hmx: Add an asm memory clobber at the phase boundary to prevent reorder bug * [experimental]: fp16 softmax (EXP2_HF) to accelerate fa Bake log2(e) into qk_scale and use hvx_exp2_hf directly for P and m_diff (base-2 consistent, matches htp-ops-lib). ~22 ALU ops for 64 lanes vs ~44 for the F32 round-trip path. * hmx flash-attn: refine cost model coefficients based on profiling data * hmx flash-attn: replace asm clobber with targeted volatile reads on vtcm_d_tiles * hmx flash-attn: fix prefill correctness (dst indexing, softmax reduce, V stride) * hmx flash-attn: fix p_tiles dual-tile OOB race; enable MT + pipeline * hmx flash-attn: preserve additive mask bias in no-ALiBi fast path The no-ALiBi fast path (max_bias==0) was skipping mask add entirely on the assumption that mask values are only {0, -inf}. This is wrong when the mask carries additive positional bias — those terms were silently dropped. Keep the slope-mul skip (slope≡1.0) but add mask back so the bias survives; vmux still clamps below -16 to -inf. Also add HMX FA coverage to test-backend-ops: prefill shapes (nb=64, nb=32) × {mask on/off} × {ALiBi on/off} × {softcap on/off}, F16 KV, hs ∈ {64, 128}. * hmx: fix softcap+EXP2_HF interaction, tighten matmul pipeline gate, add FA tests - flash-attn: when EXP2_HF is on AND logit_softcap is active, fold log2(e) into the post-tanh multiplier (v_cap) instead of pre-baking it into qk_scale. Pre-baking shifted the tanh knee from x≈c to x≈c/log2(e) and produced numerically wrong softcapped outputs whenever both knobs were enabled. - flash-attn softmax (fa_softmax_thread): replace the union+memcpy scalar extract pattern with HVX vmux-based per-row accumulators on rowmax/rowsum. Add hvx_vec_get_f16 helper in hvx-base.h. Functional parity, less scalar code, clearer hf/qf16 lane-format contract. - matmul (hmx_mat_mul_permuted_qk_0_d16a32): pick pipeline vs sequential layout based on whether the chunker actually yields >=2 n-chunks, instead of the static (m>=128 && n>=256) gate. Avoids paying for output double-buffer + worker dispatch when there is no HMX/HVX overlap to gain (e.g. shapes that collapse to one n-chunk). - tests: add HMX flash-attention coverage over the {mask, ALiBi (max_bias), logit_softcap} cross-product for the prefill path — head_dim 64/128, GQA 4×4, kv=512/nb=64 plus a kv=113/nb=32 non-aligned case. * [Help Wanted]: refactor D matrix computation into separate function for clarity and maintainability * format code * hexagon: looks like -O3 is causing issues with the large code base, switch to -O2 and -flto instead * hexagon: use hex_ prefix for swap_ptr * hexagon: move vtcm_seq_alloc into vtcm-utils.h More vtcm allocator updates are coming so it makes sense to start the separate hdr for it. * hmx-utils: add hmx_prefix for layout converters * hmx-mm: move main hmx_mm functions to the end, remove unused fwd decls, etc * hmx-mm: remove unused qweight_fetch_task_state_t and minor alignment fixes * hmx-fa: minor alignment fixes * hmx-fa: move hmx_flash_atten into hmx-ops.h * hmx-fa: remove redundant workpool pointer in the hmx_fa_ctx, plus minor alignment updates * hmx-fa: minor alignment and simplifications * hexagon: move FA_EXP_F16 option to hostside CMake file * hmx-fa: use hvx_vec_splat_f16 instead of fp16_to_bits * hmx-fa: add hvx_splat_u16/u8 and use that in the fa instead custom hvx_fill * hmx-fa: some more alignment updates in the core fa function * hmx-fa: keep slopes in vtcm in fp16 Saves malloc/free and removes the need for float -> fp16 downcast on every use. * hexagon: consistent noinline usage (after static) * hex-hmx: consistent use FARF_HIGH to enable debug output * hmx-utils: no need for always_inline attr * hex-hmx: consistent noinline usage (static noinline ...) * hex-hmx: simplify init_col_scales * hexagon: fix editorconfig errors * hmx-mm: minor alignment fixes --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> b9000	2026-05-01 20:29:13 -07:00
ddh0	b97ebdc98f	llama-quant : fix `--tensor-type` when default `qtype` is overriden (#22572 ) fix #22544 (my fault!) Credit to @Anai-Guo, ref #22559 - since that one was closed due to the new contributor policy I am taking the liberty of re-submitting that PR here. b8999	2026-05-01 19:55:55 +02:00

1 2 3 4 5 ...

9048 Commits