llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-01 14:44:05 +00:00

Author	SHA1	Message	Date
Masashi Yoshimura	6da7168312	ggml-webgpu: Add fused RMS_NORM + MUL (#21983 ) * fused rms_norm_mul + mul * Add GGML_WEBGPU_DISABLE_FUSION for being able to disable kernel fusion. * Decouple num_fused_ops from webgpu_context; misc cleanup * Fix eps handling and remove disable_fusion. * Fix not to use c++20 initializers. b8891	2026-04-22 10:51:40 -07:00
Piotr Wilkin (ilintar)	8bccdbbff9	chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs (#22217 ) * chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs * Fix ty errors. * Fix flake8 err b8890	2026-04-22 18:10:56 +02:00
Georgi Gerganov	bcb5eeb645	speculative-simple : add checkpoint support (#22227 ) * speculative-simple : add checkpoint support * cont : fix build b8889	2026-04-22 15:44:45 +03:00
Akarshan Biswas	225088ea76	sycl: Improve mul_mat_id memory efficiency and add BF16 fast path (#22119 ) * sycl: size mul_mat_id staging buffers by routed rows Previously src1_contiguous/dst_contiguous in ggml_sycl_mul_mat_id were sized to ggml_nelements(src1/dst), which over-allocates when ne12 > 1 and can fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero for MoE models (notably with --cpu-moe). Size them by the actual number of routed rows (ids->ne[1] * n_ids) instead. * sycl: add bf16 mul_mat fast path via DNNL When src0 is BF16 (commonly the case for lm_head / output.weight), the existing f16 path is skipped because bf16 isn't covered, and the f32 fallback dequantizes the entire src0 slab to f32 in a single pool alloc (row_diff*ne00 floats). For large-vocab models this can reach several GB and fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero. Add a bf16xbf16 -> f32 DNNL matmul fast path that uses the bf16 storage in place and only materializes a small src1 bf16 conversion buffer. bf16 matmul accumulates in f32, so it's correct even when the op requests GGML_PREC_F32 (as lm_head does). - gemm.hpp: map bfloat16 to dnnl::memory::data_type::bf16. - convert.{hpp,cpp}: expose ggml_get_to_bf16_sycl for f32/f16/bf16 -> bf16. - ggml-sycl.cpp: take the bf16 path early in ggml_sycl_op_mul_mat_sycl when DNNL and GGML_SYCL_HAS_BF16 are both available. b8888	2026-04-22 20:32:56 +08:00
Xuan-Son Nguyen	82d3f4d3b2	mtmd: also support LLAMA_ROPE_TYPE_NONE (#22242 ) b8887	2026-04-22 12:16:29 +02:00
Xuan-Son Nguyen	17f6245168	server: ignore reasoning content from transcription api (#21905 ) b8886	2026-04-22 12:10:50 +02:00
manayang	7bfe60fdf9	mtmd, llama : Update HunyuanVL vision-language model support (#22037 ) * mtmd, llama : add HunyuanVL vision-language model support - add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support - add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder - add HunyuanVL-specific M-RoPE position encoding for image tokens - add GGUF conversion for HunyuanVL vision and text models - add smoke test in tools/mtmd/tests.sh * fix: fix HunyuanVL XD-RoPE h/w section order * fix: Remove redundant code * convert : fix HunyuanOCR / HunyuanVL conversion - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF - successfully and produce correct inference output on Metal (F16 / Q8_0). * clip : fix -Werror=misleading-indentation in bilinear resize * fix CI: convert_hf_to_gguf type check error - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown \| None`. --------- Co-authored-by: wendadawen <wendadawen@tencent.com> b8885	2026-04-22 11:58:43 +02:00
Ethan Turner	750579ff14	common: Refactoring sampler parameters (#20429 ) (#22233 ) This change refactors the reasoning_budget_message parameter from the common params into the sampling parameters specifically. It also removes the reasoning_budget common parameter and standardizes on the existing reasoning_budget_tokens parameter in the sampling configuration. Issue: https://github.com/ggml-org/llama.cpp/issues/20429 Original PR: https://github.com/ggml-org/llama.cpp/pull/20297 b8884	2026-04-22 10:40:19 +02:00
Piotr Wilkin (ilintar)	134d6e54d4	common/chat, server: refactor, move all conversion functions to common, add tests (#20690 ) * Refactor conversion functions b8883	2026-04-22 10:28:45 +02:00
Chen Yuan	ca7f7b7b94	ggml-webgpu(shader): support conv2d kernels. (#21964 ) * ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log * Merge with upstream * Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants * Update Unary wgsl EXP and EXPM1 for f16 stability * Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization * Fix numerical percision for unary sqrt when working with f16 * Fix NaN canonicalization for packed integers using f16 * Update err threshold for binary div ops when using f16 * backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend * clean: uncomment existing code logs * clean: clean the unncessary debug info * Refactor and generalize dequant helpers * Remove deprecated quant structs * Refactor shader defines to reduce repetition * Remove error override for F16 type * fix: fix the accidential removal of the proper initialization of ctx * clean: clean legacy and format code * fix: did not modify tests ops * shader(conv2d): add conv2d shader kernels and pass f32 and f16 tests * shader(conv2d): fix the out of bounds memory access in the weight indexing * shader(conv2d): clean unused variables and optimize the computation * merge: use the new entries function * clean: address the formatting issues * clean: address the warning issues * clear: clean the shader editorconfig-checker issues * clear: clean the shader editorconfig-checker with utf-8 --------- Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv> b8882	2026-04-21 20:18:57 -07:00
Aparna M P	0dedb9ef7a	hexagon: add support for FILL op (#22198 ) Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> b8881	2026-04-21 16:24:20 -07:00
Masashi Yoshimura	2799d933b5	ggml-webgpu: reset CPU/GPU profiling time when freeing context (#22050 ) * Reset the CPU/GPU profiling time when freeing context. * move GPU profiling time from global context to webgpu_context. b8880	2026-04-21 16:05:21 -07:00
Xuan-Son Nguyen	04fe84b69d	server: allow cancel loading model (#21814 )	2026-04-22 00:26:09 +02:00
Shreya Jain	5a4cd6741f	Hexagon: DAIG op (#22195 ) * hexagon: Add DIAG op * hexagon: add HVX support and DMA double buffering * hexagon: fix fatal error * hexagon: remove as many pragma(s) as possible b8878	2026-04-21 14:16:04 -07:00
Mengsheng Wu	2248799a58	hexagon: fix missing v79 entry in libggml-htp.inf (#22194 )	2026-04-21 13:53:44 -07:00
Paul Dubs	72d693e4fb	spec : reset i_last when low acceptance streak occurs (#22168 ) By resetting i_last to zero, we will include the current context when rebuilding the speculative map. b8876	2026-04-21 21:29:07 +03:00
Kwa Jie Hao	98d2d2884e	mtmd: Add support for Reka Edge 2603 (#21616 ) * feat: (vocab) fix stray text appended in llama_decode_text Remove accidental concatenation of the full `text` string when formatting UNK_BYTE hex escapes. Only the closing "]" should be appended. * feat(mtmd): add Yasa2 vision encoder support Add a Yasa2 (ConvNeXtV2-based) vision encoder for reka-edge: - Register PROJECTOR_TYPE_YASA2 and tensor name definitions - Add yasa2_block/yasa2_stage model structs - Implement graph builder with ConvNeXt stages, GRN, adaptive pooling - Wire into clip.cpp switch statements and mtmd.cpp init_vision - Use mtmd_image_preprocessor_fixed_size for image preprocessing * feat(chat): add reka-edge template handler (tools, thinking) - Add chat-reka.cpp/h implementing PEG-based parser for reka-edge format - Add Reka-Edge.jinja chat template - Detect reka-edge template in try_specialized_template() - Add LLAMA_EXAMPLE_MTMD to chat-template-file arg * feat: add reka vlm to gguf conversion script Converts Reka Yasa2 hf checkpoints to GGUF format: - Text decoder: Llama-arch with tiktoken/BPE vocab - Mmproj (--mmproj): ConvNeXt vision backbone + language_projection - Generates 2D sincos positional embeddings for vision encoder * test: add Reka Edge chat template and parser tests - test-chat-template: oracle tests comparing Jinja engine output vs common_chat_templates_apply for text, tools, thinking, images, video - test-chat: PEG parser tests for Reka Edge format, round-trip tests for image/video content parts, common path integration tests * scripts: add Reka Edge mixed quantization helper Q4_0 base quantization with Q8_0 override for the last 8 transformer blocks (layers 24-31) via --tensor-type regex. * fix: adapt chat-reka and tests to upstream API - Use autoparser::generation_params (not templates_params) - Add p.prefix(generation_prompt) to PEG parser - Simplify reasoning parser to match LFM2 pattern - Remove image/video oracle tests (unsupported by oaicompat parser; no other multimodal models test this path) * fix: avoid duplicate tensor loading in yasa2 vision encoder TN_YASA_PATCH_W and TN_PATCH_EMBD both resolve to "v.patch_embd.weight", causing the same tensor to be loaded twice into ctx_data and overflowing the memory pool. Reuse the tensors already loaded by the common section. * chore: update image pre-processing settings The reka-edge model depends on the following settings in an older fork of llama.cpp: 1. Fixed square resize 2. BICUBIC 3. add_padding=false In current llama.cpp, this means setting: - image_resize_algo = RESIZE_ALGO_BICUBIC - image_resize_pad = false * chore: remove reka gguf conversion script * chore: remove reka quantization script * chore: remove unnecessary changes from PR scope This commit removes a couple of unnecessary changes for the PR scope: 1. BPE decoder bug fix - this affects reka edge because there's a bug in our tokenization that doesn't represent <think> tokens as special tokens. However this isn't meant to be a thinking model so when run with --reasoning off the edge case does not affect us 2. --chat-template-file support from llama-mtmd-cli - the focus is on llama-server and the reka edge gguf contains the necessary metadata to detect the chat template 3. reka edge oracle test cases - no other model has similar test cases, so I removed it for standardization * chore: remove unnecessary ggml_cast This commit removes unnecessary ggml_cast after updating the reka vlm -> gguf conversion script on hugging face. * chore: remove redundant code * chore: remove unnecessary ggml_cont calls This commit removes all ggml_cont calls except the four that precede ggml_reshape_3d/ggml_reshape_4d. Those are necessary because ggml_reshape recomputes strides assuming contiguous layout and asserts ggml_is_contiguous. Other operations (ggml_mean, ggml_add, ggml_mul etc.) use stride-based indexing and handle non-contiguous inputs correctly and so we are ok to remove ggml_cont for those. * chore: remove unnecessary ggml_repeat calls This commit removes unnecessary ggml_repeat calls because the underlying ops already broadcast automatically. Every ggml_repeat in yasa2.cpp was expanding a smaller tensor to match a larger one's shape before passing both to an elementwise op (ggml_add, ggml_sub, ggml_mul, or ggml_div). This is unnecessary because all four of these ops already support broadcasting internally. * chore: restore ggml_cont needed for cpu operations * refactor: locate reka chat template handler in chat.cpp * chore: remove unnecessary warmup tokens * chore: add code comments on image_resize_pad * chore: remove custom reka parsing code * chore: revert common/chat.cpp * Uncomment debug logging for PEG input parsing --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> b8875	2026-04-21 20:02:49 +02:00
Georgi Gerganov	84652b80cf	arg : add --spec-default (#22223 ) b8874	2026-04-21 19:52:02 +03:00
Zijun Yu	52f1096f21	openvino: driver setup, CI split, thread safety, and NPU optimizations (#21944 ) * Thread safety per request only * Fix ROPE yarn case * Fix sticky stateful config * Use i4/i8 directly for symmetric quant * Use weightless caching * Add WeightlessCacheAttribute to reduce NPU memory usage * Gelu tanh support (#125) * Imrope support (#126) * fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free * add GPU,NPU support in OV Dockerfile * add build-openvino.yml ci * Fix sticky stateful config * add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml * fix thread-safety of shared runtime context * rope type abstraction for frontend translations * fix editorconfig --------- Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com> Co-authored-by: Dan Hoffman <dhoff749@gmail.com> Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com> b8873	2026-04-21 18:58:34 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	606fa42f5d	vendor : update cpp-httplib to 0.43.1 (#22143 ) * vendor : update cpp-httplib to 0.43.0 * vendor : update cpp-httplib to 0.43.0 b8872	2026-04-21 22:45:48 +08:00
Georgi Gerganov	7fc1c4ef78	metal : workaround macOS GPU interactivity watchdog (#22216 ) b8871	2026-04-21 17:24:55 +03:00
Jeff Bolz	82209efb7e	vulkan: Support F16 OP_FILL (#22177 ) b8870	2026-04-21 11:01:56 +02:00
Xuan-Son Nguyen	9998d88bc8	mtmd: correct mtmd_decode_use_mrope() (#22188 ) b8869	2026-04-21 10:53:37 +02:00
Georgi Gerganov	cd03ec7642	llama-ext : fix exports (#22202 ) b8868	2026-04-21 11:04:46 +03:00
Georgi Gerganov	4889afba5f	sync : ggml	2026-04-21 11:04:21 +03:00
Georgi Gerganov	041fe83d74	ggml : bump version to 0.10.0 (ggml/1463)	2026-04-21 11:04:21 +03:00
Georgi Gerganov	cfe9838d26	fit-params : refactor + add option to output estimated memory per device (#22171 ) * fit-params : add option to output estimated memory per device * cont : minor * cont : refactor * cont : move fit params implementation to libcommon * cont : header * cont : headers * cont : codeowners	2026-04-21 09:54:36 +03:00
xris99	ff6b1062af	server : fix hardcoded proxy connection timeout in router mode (#18760 ) (#22003 ) Fixes: https://github.com/ggml-org/llama.cpp/issues/18760 Co-authored-by: Christian <christian@example.com> b8864	2026-04-21 06:41:14 +02:00
leonardHONG	97895129e5	ggml-cuda: flush legacy pool on OOM and retry (#22155 ) * ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com> b8863	2026-04-20 23:30:38 +02:00
Xuan-Son Nguyen	86f8daacfe	mtmd: correct get_n_pos / get_decoder_pos (#22175 ) b8862	2026-04-20 23:29:19 +02:00
Georgi Gerganov	cf8b0dbda9	server : remove /api endpoints (#22165 ) * server : remove /api endpoints * cont : remove /api/tags b8861	2026-04-20 20:41:19 +03:00
Gaurav Garg	fd6ae4ca1c	Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (#22129 ) * Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments b8860	2026-04-20 18:25:39 +02:00
Johannes Gäßler	fb19f94c71	TP: fix 0-sized tensor slices, AllReduce fallback (#21808 ) * TP: fix 0-sized tensor slices, AllReduce fallback * fix layer structure <-> GPU count aliasing * add missing std::fill * fix CUDA device set, max ggml ctx size b8859	2026-04-20 18:09:39 +02:00
pl752	7f251fdbce	ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) (#21636 ) * Implemented optimized q1_0 dot for x86 and generic * Removed redundant helper definition * Removed two redundant instructions from AVX q1_0 dot * Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback * Style cleanup around AVX q1_0 dot * Replaced explicitly unrolled blocks with inner for loop for q1_0 * Replaced scalar ARM q1_0 impl with new generic one b8858	2026-04-20 19:02:54 +03:00
neha-ha	a6cc43c286	ggml-webgpu: updated matrix-vector multiplication (#21738 ) * merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> b8857	2026-04-20 07:37:17 -07:00
Xuan-Son Nguyen	a678916623	mtmd: refactor mtmd_decode_use_mrope (#22161 )	2026-04-20 14:45:11 +02:00
SamareshSingh	81df3f7cfa	fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102 ) * llama: fix crash in print_info for GLM-DSA when vocab_only is set * addressed code review comments * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b8855	2026-04-20 10:32:46 +03:00
Georgi Gerganov	de71b5f81c	server : refactor "use checkpoint" logic (#22114 ) b8854	2026-04-20 08:42:37 +03:00
Katostrofik	788fcbc5dd	[SYCL] Fix reorder MMVQ assert on unaligned vocab sizes (#22035 ) * [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K asserted that block_num_y was a multiple of 16 subgroups. Models with a vocab size not divisible by 16 (for example HY-MT at 120818) aborted on model load when the output projection tripped the assert. I replaced the assert with padding: block_num_y now rounds up to a whole number of subgroup-sized workgroups. The kernel already has the row bounds check (`if (row >= nrows) return;`) so the extra padded threads early-exit cleanly. Row values are uniform across a subgroup so the collective reduce stays safe. For aligned vocab sizes the padded block_num_y equals the old value, so the kernel launch is identical and there is no regression. Thanks to @arthw for flagging the relationship to #21527. Fixes #22020. AI assisted coding, tested on Intel B70 hardware. * sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel target where WARP_SIZE is 16, but makes the relationship to subgroup size explicit. Per review by @NeoZhangJianyu on #22035. Assisted by Claude. b8853	2026-04-20 08:39:45 +03:00
Yes You Can Have Your Own	9d49acb2a7	server: rename --clear-idle to --cache-idle-slots (#21741 ) b8852	2026-04-20 08:30:24 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	e365e658f0	vendor : update cpp-httplib to 0.42.0 (#21781 ) b8851	2026-04-20 06:41:43 +08:00
Johannes Gäßler	4eac5b4509	CUDA: refactor mma data loading for AMD (#22051 ) * CUDA: refactor mma data loading for AMD * fix CDNA MMQ occupancy * fix CDNA3 mma * fix RDNA3 compile b8850	2026-04-19 18:26:59 +02:00
Aldehir Rojas	d5b780a676	common/autoparser : allow space after tool call (#22073 ) b8849	2026-04-19 13:28:35 +02:00
uvos	471540ae8a	HIP: Remove unesscary NCCL_CHECK (#21914 ) b8848	2026-04-19 12:59:44 +02:00
Xuan-Son Nguyen	19124078be	mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) (#22082 ) * mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos * fix build b8847	2026-04-19 11:57:21 +02:00
Gaurav Garg	bcdcc1044f	ggml : reduce CPU overhead in meta backend (#22041 ) * cache subgraph splits when cgraph is unchanged Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively. Assign uid to every sub-graph so that CUDA's fast uid check path hits too. * Address review comments * Keep the scope as is * Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code. * Address review comments * Update ggml/src/ggml-backend-meta.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-backend-meta.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b8846	2026-04-19 12:48:35 +03:00
Sigbjørn Skjæret	037bfe38d0	ci : install spirv-headers for vulkan-cross (#22109 )	2026-04-19 10:32:08 +03:00
Dowon	8685e7b075	convert : support sentence-transformer 5.4 config files (#22087 ) * convert : support sentence-transformer 5.4 config files * fix: embeddinggemma * fix: mapping Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix: pooling_mode Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-19 10:25:39 +03:00
texasich	09b4efa95f	cmake: remove CMP0194 policy to restore MSVC builds (#21934 ) #21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds. Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform. Reported-by: oobabooga Refs: #21630 Co-authored-by: texasich <texasich@users.noreply.github.com> b8843	2026-04-19 10:25:05 +03:00
Sascha Rogmann	455d8e4be8	server : speculative checkpointing (#19493 ) * server : speculative decoding using checkpoints * server : fix draft check with checkpoints * server : rename spec vars * server : log levels * server : refactored spec logic to speculative.cpp * server : renamed spec checkpoints option * server : fix spec checkpoints, logging * speculative : checkpoints with draft model, logging * server : n_tokens_cur and create_checkpoint in draft * server : fix server_speculative_callback (slot.id) * spec : fix ngram-map/begin idx_last_check * spec : init ckpt (begin() wasn't called) * chore: update webui build output * server : restore sampler in spec checkpoint and clear mem * cont : avoid --spec-use-checkpoints argument * cont : remove server_prompt_checkpoint_with_size * spec : rename (leave_draft_state) * cont : clean-up * cont : do not ignore partial drafts even if the are short * cont : spec callback owned by session * cont : simplify * cont : avoid empty speculative session * cont : simplify * cont : simplify * cont : enable mtmd speculative decoding * cont : keep the spec sampler alive * cont : simplify * cont : fix nullptr deref + draft checkpoints * cont : remove common_speculative_accept_response * cont : remove callback * cont : simplify * cont : minor * cont : simplify * cont : fix accepted number --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b8842	2026-04-19 10:24:06 +03:00

1 2 3 4 5 ...

8891 Commits