Commit Graph

  • 439c3b5021 cont : init child samplers + modify child logic Georgi Gerganov 2026-01-09 10:52:10 +02:00
  • 59dda88aae Merge branch 'master' into HEAD Georgi Gerganov 2026-01-09 09:35:12 +02:00
  • f5f8812f7c server : use different seeds for child completions (#18700) b7682 Georgi Gerganov 2026-01-09 09:33:50 +02:00
  • c0d99e65d2 add eagle3 support for Qwen3 series models ruixiangw 2026-01-08 23:49:06 +00:00
  • 8ece3836b4 common: support remote preset (#18520) b7681 Xuan-Son Nguyen 2026-01-08 22:35:40 +01:00
  • 046d5fd44e llama: use host memory if device reports 0 memory (#18587) b7680 Aaron Teo 2026-01-09 05:34:56 +08:00
  • 480160d472 ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (#18628) b7679 Masashi Yoshimura 2026-01-09 01:36:42 +09:00
  • 15bff84bf5 ggml webgpu: initial flashattention implementation (#18610) b7678 Reese Levine 2026-01-08 08:23:39 -08:00
  • 0fca4308f7 Initial plan copilot/sub-pr-18695 copilot-swe-agent[bot] 2026-01-08 15:16:59 +00:00
  • 2524c26164 vulkan: fix push constant size for quantize_q8_1 (#18687) b7677 Jeff Bolz 2026-01-08 08:40:58 -06:00
  • cb14b06995 vulkan: optimize ssm_scan (#18630) b7676 Jeff Bolz 2026-01-08 08:16:54 -06:00
  • 5eb799a6c0 scripts : pr2wt.sh reset to remote head Georgi Gerganov 2026-01-08 16:04:19 +02:00
  • 55abc39355 vendor : update cpp-httplib to 0.30.0 (#18660) b7675 Adrien Gallouët 2026-01-08 13:53:54 +01:00
  • f2f6c88067 scripts : support chaining commands in pr2wt.sh (#18671) Georgi Gerganov 2026-01-08 13:40:23 +02:00
  • 945bf10627 metal : add MoE kernel specialization for ne20=5 (#18667) b7673 도로로도로또 2026-01-08 19:37:45 +09:00
  • 64848deb18 llama-fit-params: free memory target per device (#18679) b7672 Johannes Gäßler 2026-01-08 10:07:58 +01:00
  • 9a5724dee2 ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (#18535) Doctor Shotgun 2026-01-08 01:03:21 -08:00
  • 9c142e3a2a model-conversion : add warn about transformers mismatch (#18691) Daniel Bevenius 2026-01-08 09:29:53 +01:00
  • df7fb92170 model-conversion : remove -st targets for converted model (#18689) Daniel Bevenius 2026-01-08 09:29:15 +01:00
  • 2038101bd9 llama : add use_direct_io flag for model loading (#18166) b7668 Julius Tischbein 2026-01-08 07:35:30 +01:00
  • 568371a726 opencl: add FILL op support (#18682) b7667 shaofeiqi 2026-01-07 22:04:50 -08:00
  • 5b8844ae53 scripts : fix repos cloned with .git extension (#18669) b7666 Sigbjørn Skjæret 2026-01-07 22:35:34 +01:00
  • 7e16fef085 convert : more variants of rope_theta config entries (#18668) Sigbjørn Skjæret 2026-01-07 22:34:51 +01:00
  • f5245b5e4e cuda : fix build on cuda 12.8 (#18672) b7664 Oliver Walsh 2026-01-07 21:32:44 +00:00
  • ae9f8df778 fix(docker): add missing libglvnd libraries to Vulkan image (#18664) R 2026-01-07 16:57:42 +01:00
  • 56d2fed2b3 tools : remove llama-run (#18661) b7662 Adrien Gallouët 2026-01-07 16:18:26 +01:00
  • 56426673cb scripts : add pr2wt.sh (#18644) Georgi Gerganov 2026-01-07 15:16:20 +02:00
  • d7c27d4964 fix infinite loop on empty batch Xuan Son Nguyen 2026-01-07 14:08:05 +01:00
  • bb77764c2d convert : clarify sentence-transformers-dense-modules help [no ci] (#18662) Daniel Bevenius 2026-01-07 13:18:53 +01:00
  • a9d7bcb7fc server: fix n_cmpl not skipping processing Xuan Son Nguyen 2026-01-07 13:13:53 +01:00
  • 9dfa8ee950 ci : run cann build unconditionally [no ci] (#18659) Sigbjørn Skjæret 2026-01-07 13:07:08 +01:00
  • ca4a8370bc vulkan: reject ops when a tensor is too large to allocate (#18646) b7658 Jeff Bolz 2026-01-07 05:03:32 -06:00
  • 03023296cf vulkan: Warptile tuning for Intel Xe2/Xe3 (#18178) b7657 virajwad 2026-01-07 02:59:47 -08:00
  • 8c77a04cc7 vulkan: more mul mat optimizations (#18533) b7656 Eve 2026-01-07 10:13:17 +00:00
  • ffba4f29e6 examples : add debug utility/example (#18464) b7655 Daniel Bevenius 2026-01-07 10:42:19 +01:00
  • 3333951d86 CANN: Fix rename for get_env (#18652) b7654 hipudding 2026-01-07 16:11:31 +08:00
  • 193ee38a1b CANN: Rename get_env to get_env_as_lowercase (#18624) b7653 Raul Torres 2026-01-07 02:01:25 +00:00
  • 95ea9e0861 Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) b7652 Max Krasnyansky 2026-01-06 17:38:29 -08:00
  • ccbc84a537 mtmd: mtmd_audio_streaming_istft (#18645) b7651 Tarek Dakhran 2026-01-06 21:00:29 +01:00
  • 68b4d516c3 llama-params-fit: fix last devices with low VRAM (#18494) b7650 Johannes Gäßler 2026-01-06 20:02:30 +01:00
  • 24af22fc36 ggml : optimize cuda ssm_scan using warp-level reduction (#18505) b7649 Aadeshveer Singh 2026-01-06 23:54:34 +05:30
  • 07fbe19f1f arg: use CSV escape style for multiple-value args (#18643) b7648 Xuan-Son Nguyen 2026-01-06 17:51:08 +01:00
  • ea13cba850 vulkan: support buffer_from_host_ptr (#18467) b7647 Jeff Bolz 2026-01-06 10:37:07 -06:00
  • 090b137e56 ggml-cuda: refactor cuda graph usage (#18637) b7646 Aman Gupta 2026-01-06 23:48:45 +08:00
  • 968929528c mmq.cu: tune mmq/rocblas switching for RDNA (#18537) b7645 Beinsezii 2026-01-06 07:26:07 -08:00
  • 3d26a09dc7 server : add thinking content blocks to Anthropic Messages API (#18551) b7644 R 2026-01-06 16:17:13 +01:00
  • 091d98e2c5 rpc : use std::unique_ptr for the message_queue pr/18626 Georgi Gerganov 2026-01-06 15:32:01 +02:00
  • 54ccf2476b ci : require editor config gg/ci-req-editor-config Georgi Gerganov 2026-01-06 13:04:35 +02:00
  • 4a95b44864 alloc : skip unassigned leafs gg/alloc-skip-unassigned-leafs Georgi Gerganov 2026-01-06 11:24:56 +02:00
  • bd2a93d475 gguf-py : add requests to dependencies (#18629) Christian Kastner 2026-01-06 08:56:38 +01:00
  • e75ee11024 ggml : fix avx512bf16 build (#18623) b7642 Adrien Gallouët 2026-01-06 07:54:10 +01:00
  • da9b8d3300 CANN: Make valid_values variable static const (#18627) b7641 Raul Torres 2026-01-06 03:53:28 +00:00
  • e443fbcfa5 ggml webgpu: add CEIL operation support (#18605) b7640 nwyin 2026-01-05 13:38:57 -06:00
  • 73d284a250 model : add LFM2-ColBert-350M (#18607) b7639 Tarek Dakhran 2026-01-05 19:52:56 +01:00
  • df17a4c94f CUDA: fix FA FP16 accumulator overflow for Granite (#18614) b7638 Johannes Gäßler 2026-01-05 19:51:13 +01:00
  • 1871f0ba56 add YoutuVLForConditionalGeneration architectures (#18620) tt 2026-01-06 01:15:14 +08:00
  • f47edb8c19 ggml-cuda: check for srcs outside the cgraph (#18583) b7636 Aman Gupta 2026-01-05 22:46:36 +08:00
  • df27d80ae3 rpc : implement event and async backend APIs Radoslav Gerganov 2025-12-17 13:18:15 +02:00
  • da143b9940 server : fix router child env in containerized environments (#18562) b7635 Vladislav Sayapin 2026-01-05 16:12:05 +03:00
  • f1768d8f03 vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (#18582) b7634 Jeff Bolz 2026-01-05 04:51:39 -06:00
  • 2da64a2f8a models : fix backend assignment for Granite/Nemotron graphs (#18599) b7633 Georgi Gerganov 2026-01-05 12:34:23 +02:00
  • b37124d2d2 vulkan: handle quantize_q8_1 overflowing the max workgroup count (#18515) b7632 Jeff Bolz 2026-01-05 04:30:14 -06:00
  • eadc4184ca llama : refactor rope_freq_base/scale_swa conversion and init (#18553) b7631 Sigbjørn Skjæret 2026-01-05 09:14:04 +01:00
  • 67e3f6f601 CANN: add operator fusion support for ADD + RMS_NORM (#17512) b7630 Chenguang Li 2026-01-05 15:38:18 +08:00
  • 92ac1e016b doc: clarify that steps also apply to linux for opencl (#18002) Francisco Herrera 2026-01-04 23:39:25 -05:00
  • 8e3a761189 ci : init git lfs in every build for RISC-V (#18590) b7628 Ali Tariq 2026-01-05 06:18:33 +05:00
  • d3dce4e0a5 sampling : add support for backend sampling (#17004) Daniel Bevenius 2026-01-04 21:22:16 +01:00
  • 4974bf53cf model : mtmd : make input norm optional in LFM2-VL (#18594) b7626 Tarek Dakhran 2026-01-04 18:50:02 +01:00
  • 908a9e5a1e CUDA: disable cuda graph when using n-cpu-moe (#18593) b7625 Aman Gupta 2026-01-05 01:37:48 +08:00
  • 5126c41c1c ggml-cuda: remove unused params in ggml_cuda_graph (#18579) b7624 Aman Gupta 2026-01-05 01:37:09 +08:00
  • cef1d23c5a common/grammar : replace problematic backtracking regex [\s\S]* (#18342) b7623 Aldehir Rojas 2026-01-03 16:02:43 -06:00
  • c69c7ebc90 graph : fix graph reuse logic when n_pos_per_embd > 1 (#18566) b7622 Georgi Gerganov 2026-01-03 23:59:06 +02:00
  • e57f52334b ggml-cuda: fixes for concurrent streams (#18496) b7621 Aman Gupta 2026-01-03 23:15:01 +08:00
  • a554a1ecc7 context : fix reserve token padding to n_seqs (#18536) b7620 Georgi Gerganov 2026-01-03 15:45:34 +02:00
  • 0f2e42ca1d CUDA: only allocate FA tmp buffer if needed (#18564) b7619 Johannes Gäßler 2026-01-03 13:55:53 +01:00
  • 9dba9f5352 (Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (#18559) b7618 pl752 2026-01-03 15:13:40 +05:00
  • bcfc8c3cec ggml-hexagon: optimize activation function (#18393) b7617 Shouyu 2026-01-03 00:24:24 -05:00
  • 18ddaea2ae vulkan: Optimize GGML_OP_CUMSUM (#18417) b7616 Jeff Bolz 2026-01-02 15:32:30 -06:00
  • 706e3f93a6 vulkan: Implement mmvq for iq1_s/iq1_m (#18450) b7615 Jeff Bolz 2026-01-02 13:19:04 -06:00
  • 5755e52d15 model : Maincoder-1B support (#18534) b7614 Prabod 2026-01-03 06:11:59 +11:00
  • f38de16341 metal : adjust extra size for FA buffer to avoid reallocations (#18545) b7613 Georgi Gerganov 2026-01-02 19:02:18 +02:00
  • af1e8e1a6c graph : reduce topology branching (#18548) b7612 Georgi Gerganov 2026-01-02 19:01:56 +02:00
  • d84a6a98be vocab : reduce debug logs about non-EOG control tokens (#18541) b7611 Georgi Gerganov 2026-01-02 16:17:33 +02:00
  • bf3f12df4c graph : constant topology for tokens/embeddings inputs gg/graph-avoid-branches-2 Georgi Gerganov 2026-01-02 15:46:45 +02:00
  • 4ed59dc2c7 graph : reduce topology branching Georgi Gerganov 2026-01-02 15:35:26 +02:00
  • c6f0e832da rpc : use unordered_map::reserve and emplace (#18513) b7610 Chris Rohlf 2026-01-02 05:09:36 -05:00
  • e86f3c2221 cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (#18433) b7609 MeeMin 2026-01-02 04:54:20 +05:30
  • 169ee68ffb model : remove modern-bert iswa template (#18529) b7608 Sigbjørn Skjæret 2026-01-02 00:06:42 +01:00
  • ced765be44 model: support youtu-vl model (#18479) b7607 tt 2026-01-02 02:25:54 +08:00
  • 3ccccc83f7 Add conversion support for IQuestCoderForCausalLM (#18524) Piotr Wilkin (ilintar) 2026-01-01 18:45:55 +01:00
  • d0a6a31470 model : add support for JinaBertModel with non-gated ffn (#18475) b7605 o7si 2026-01-02 01:38:51 +08:00
  • 2b2afade9f convert : fix encoding of WPM vocab for BERT models (#18500) o7si 2026-01-02 01:27:07 +08:00
  • f4f5019254 model: add Solar Open model (#18511) b7603 HelloKS 2026-01-02 02:01:43 +09:00
  • d5574c919c webui: fix code copy stripping XML/HTML tags (#18518) Anri Lombard 2026-01-01 14:44:11 +02:00
  • 26831bded9 ggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502) b7601 Aman Gupta 2026-01-01 19:18:43 +08:00
  • be47fb9285 vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295) b7600 Jeff Bolz 2026-01-01 01:58:27 -06:00
  • 9e10bd2eaf llama: handle short reads in direct I/O path (#18504) b7599 triplenom 2025-12-31 21:24:43 -05:00
  • 4cd162a123 chat: make tool description and parameters optional per OpenAI spec (#18478) b7598 Anri Lombard 2026-01-01 01:21:37 +02:00
  • 13814eb370 sync : ggml Georgi Gerganov 2025-12-31 18:27:54 +02:00
  • 54f67b9b66 ggml : bump version to 0.9.5 (ggml/1410) Georgi Gerganov 2025-12-31 18:24:07 +02:00