Commit Graph

  • e443fbcfa5 ggml webgpu: add CEIL operation support (#18605) b7640 nwyin 2026-01-05 13:38:57 -06:00
  • 73d284a250 model : add LFM2-ColBert-350M (#18607) b7639 Tarek Dakhran 2026-01-05 19:52:56 +01:00
  • df17a4c94f CUDA: fix FA FP16 accumulator overflow for Granite (#18614) b7638 Johannes Gäßler 2026-01-05 19:51:13 +01:00
  • 1871f0ba56 add YoutuVLForConditionalGeneration architectures (#18620) tt 2026-01-06 01:15:14 +08:00
  • f47edb8c19 ggml-cuda: check for srcs outside the cgraph (#18583) b7636 Aman Gupta 2026-01-05 22:46:36 +08:00
  • df27d80ae3 rpc : implement event and async backend APIs Radoslav Gerganov 2025-12-17 13:18:15 +02:00
  • da143b9940 server : fix router child env in containerized environments (#18562) b7635 Vladislav Sayapin 2026-01-05 16:12:05 +03:00
  • f1768d8f03 vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (#18582) b7634 Jeff Bolz 2026-01-05 04:51:39 -06:00
  • 2da64a2f8a models : fix backend assignment for Granite/Nemotron graphs (#18599) b7633 Georgi Gerganov 2026-01-05 12:34:23 +02:00
  • b37124d2d2 vulkan: handle quantize_q8_1 overflowing the max workgroup count (#18515) b7632 Jeff Bolz 2026-01-05 04:30:14 -06:00
  • eadc4184ca llama : refactor rope_freq_base/scale_swa conversion and init (#18553) b7631 Sigbjørn Skjæret 2026-01-05 09:14:04 +01:00
  • 67e3f6f601 CANN: add operator fusion support for ADD + RMS_NORM (#17512) b7630 Chenguang Li 2026-01-05 15:38:18 +08:00
  • 92ac1e016b doc: clarify that steps also apply to linux for opencl (#18002) Francisco Herrera 2026-01-04 23:39:25 -05:00
  • 8e3a761189 ci : init git lfs in every build for RISC-V (#18590) b7628 Ali Tariq 2026-01-05 06:18:33 +05:00
  • d3dce4e0a5 sampling : add support for backend sampling (#17004) Daniel Bevenius 2026-01-04 21:22:16 +01:00
  • 4974bf53cf model : mtmd : make input norm optional in LFM2-VL (#18594) b7626 Tarek Dakhran 2026-01-04 18:50:02 +01:00
  • 908a9e5a1e CUDA: disable cuda graph when using n-cpu-moe (#18593) b7625 Aman Gupta 2026-01-05 01:37:48 +08:00
  • 5126c41c1c ggml-cuda: remove unused params in ggml_cuda_graph (#18579) b7624 Aman Gupta 2026-01-05 01:37:09 +08:00
  • cef1d23c5a common/grammar : replace problematic backtracking regex [\s\S]* (#18342) b7623 Aldehir Rojas 2026-01-03 16:02:43 -06:00
  • c69c7ebc90 graph : fix graph reuse logic when n_pos_per_embd > 1 (#18566) b7622 Georgi Gerganov 2026-01-03 23:59:06 +02:00
  • e57f52334b ggml-cuda: fixes for concurrent streams (#18496) b7621 Aman Gupta 2026-01-03 23:15:01 +08:00
  • a554a1ecc7 context : fix reserve token padding to n_seqs (#18536) b7620 Georgi Gerganov 2026-01-03 15:45:34 +02:00
  • 0f2e42ca1d CUDA: only allocate FA tmp buffer if needed (#18564) b7619 Johannes Gäßler 2026-01-03 13:55:53 +01:00
  • 9dba9f5352 (Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (#18559) b7618 pl752 2026-01-03 15:13:40 +05:00
  • bcfc8c3cec ggml-hexagon: optimize activation function (#18393) b7617 Shouyu 2026-01-03 00:24:24 -05:00
  • 18ddaea2ae vulkan: Optimize GGML_OP_CUMSUM (#18417) b7616 Jeff Bolz 2026-01-02 15:32:30 -06:00
  • 706e3f93a6 vulkan: Implement mmvq for iq1_s/iq1_m (#18450) b7615 Jeff Bolz 2026-01-02 13:19:04 -06:00
  • 5755e52d15 model : Maincoder-1B support (#18534) b7614 Prabod 2026-01-03 06:11:59 +11:00
  • f38de16341 metal : adjust extra size for FA buffer to avoid reallocations (#18545) b7613 Georgi Gerganov 2026-01-02 19:02:18 +02:00
  • af1e8e1a6c graph : reduce topology branching (#18548) b7612 Georgi Gerganov 2026-01-02 19:01:56 +02:00
  • d84a6a98be vocab : reduce debug logs about non-EOG control tokens (#18541) b7611 Georgi Gerganov 2026-01-02 16:17:33 +02:00
  • bf3f12df4c graph : constant topology for tokens/embeddings inputs gg/graph-avoid-branches-2 Georgi Gerganov 2026-01-02 15:46:45 +02:00
  • 4ed59dc2c7 graph : reduce topology branching Georgi Gerganov 2026-01-02 15:35:26 +02:00
  • c6f0e832da rpc : use unordered_map::reserve and emplace (#18513) b7610 Chris Rohlf 2026-01-02 05:09:36 -05:00
  • e86f3c2221 cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (#18433) b7609 MeeMin 2026-01-02 04:54:20 +05:30
  • 169ee68ffb model : remove modern-bert iswa template (#18529) b7608 Sigbjørn Skjæret 2026-01-02 00:06:42 +01:00
  • ced765be44 model: support youtu-vl model (#18479) b7607 tt 2026-01-02 02:25:54 +08:00
  • 3ccccc83f7 Add conversion support for IQuestCoderForCausalLM (#18524) Piotr Wilkin (ilintar) 2026-01-01 18:45:55 +01:00
  • d0a6a31470 model : add support for JinaBertModel with non-gated ffn (#18475) b7605 o7si 2026-01-02 01:38:51 +08:00
  • 2b2afade9f convert : fix encoding of WPM vocab for BERT models (#18500) o7si 2026-01-02 01:27:07 +08:00
  • f4f5019254 model: add Solar Open model (#18511) b7603 HelloKS 2026-01-02 02:01:43 +09:00
  • d5574c919c webui: fix code copy stripping XML/HTML tags (#18518) Anri Lombard 2026-01-01 14:44:11 +02:00
  • 26831bded9 ggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502) b7601 Aman Gupta 2026-01-01 19:18:43 +08:00
  • be47fb9285 vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295) b7600 Jeff Bolz 2026-01-01 01:58:27 -06:00
  • 9e10bd2eaf llama: handle short reads in direct I/O path (#18504) b7599 triplenom 2025-12-31 21:24:43 -05:00
  • 4cd162a123 chat: make tool description and parameters optional per OpenAI spec (#18478) b7598 Anri Lombard 2026-01-01 01:21:37 +02:00
  • 13814eb370 sync : ggml Georgi Gerganov 2025-12-31 18:27:54 +02:00
  • 54f67b9b66 ggml : bump version to 0.9.5 (ggml/1410) Georgi Gerganov 2025-12-31 18:24:07 +02:00
  • 33ded988ba quantize: prevent input/output file collision (#18451) b7595 Anri Lombard 2025-12-31 17:29:03 +02:00
  • 0db8109849 convert : lint fix (#18507) Sigbjørn Skjæret 2025-12-31 14:28:21 +01:00
  • 9b8329de7a mtmd : Adding support for Nvidia Music Flamingo Model (#18470) b7593 Henry147147 2025-12-31 06:13:23 -05:00
  • cb5e0f8734 deprecate llama_adapter_lora_free Xuan Son Nguyen 2025-12-31 12:07:07 +01:00
  • 9a6369bb60 metal : add count_equal op (#18314) b7592 gatbontonpc 2025-12-31 00:39:48 -08:00
  • ecc343de63 CUDA: fix KQ max calculation (#18487) b7591 Johannes Gäßler 2025-12-31 09:37:00 +01:00
  • 01ade96e71 metal : remove BF16 x F16 kernels (#18456) b7590 Georgi Gerganov 2025-12-31 09:53:48 +02:00
  • 7bcaf815c2 sycl: add newline at the end of CMakeLists.txt (#18503) b7589 Aman Gupta 2025-12-31 14:23:44 +08:00
  • c8a3798041 Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (#18345) b7588 Rahul Sathe 2025-12-31 06:38:44 +05:30
  • 4849661d98 docker : add CUDA 13.1 image build (#18441) Sigbjørn Skjæret 2025-12-30 22:28:53 +01:00
  • 6e0c8cbc40 docs : document that JSON Schema is not available to model when using response_format (#18492) Bart Louwers 2025-12-30 22:13:49 +01:00
  • 0f89d2ecf1 common : default content to an empty string (#18485) b7585 Aldehir Rojas 2025-12-30 12:00:57 -06:00
  • ac1d0eb7bf llama : fix typo in comment in llama-kv-cache.h [no ci] (#18489) Daniel Bevenius 2025-12-30 17:20:14 +01:00
  • f5e8bfddc3 lora: make sure model keep track of associated adapters Xuan Son Nguyen 2025-12-30 15:57:21 +01:00
  • cd78e57c3a lora: count lora nodes in graph_max_nodes (#18469) b7583 Xuan-Son Nguyen 2025-12-30 15:53:12 +01:00
  • c32fa21db8 sampling: reuse token data buffer in llama_sampler_sample (#18365) b7582 Jay Zenith 2025-12-30 06:27:49 -08:00
  • 6ecba0d0d0 fix 5 danbev/gpu-sampling-rev-0 Georgi Gerganov 2025-12-30 14:53:52 +02:00
  • 94bfa7803e fix 4 Georgi Gerganov 2025-12-30 14:15:04 +02:00
  • f14f4e421b server: fix files built redundantly (#18474) b7581 Jeff Bolz 2025-12-30 06:11:13 -06:00
  • 3e0a3e865b fix 3 Georgi Gerganov 2025-12-30 14:06:42 +02:00
  • 2d6c00a9b8 kleidiai: add and integrate SVE 256-bit vector-length kernel (#18458) b7580 Charles Xu 2025-12-30 13:04:53 +01:00
  • bd48a0ac10 fix2 Georgi Gerganov 2025-12-30 14:02:58 +02:00
  • ab6f1122a4 fix Georgi Gerganov 2025-12-30 14:02:09 +02:00
  • faad7d4743 test Georgi Gerganov 2025-12-30 14:00:36 +02:00
  • 23e8bb4077 arg : add shorthand for --backend-sampling Georgi Gerganov 2025-12-30 13:56:22 +02:00
  • d77d7c5c06 CUDA: add log line when mxfp4 acceleration is used (#18483) b7579 Aman Gupta 2025-12-30 17:40:46 +08:00
  • a864fb1c14 model-conversion : use CONVERTED_MODEL for compare-embeddings (#18461) Daniel Bevenius 2025-12-30 10:13:12 +01:00
  • ebfe545cf9 Merge remote-tracking branch 'upstream/master' into backend-sampling Daniel Bevenius 2025-12-30 07:59:02 +01:00
  • 51a48720b8 webui: fix prompt progress ETA calculation (#18468) b7577 Xuan-Son Nguyen 2025-12-29 21:42:11 +01:00
  • 42c40819ca handle case done === 0 xsn/webui_fix_eta Xuan Son Nguyen 2025-12-29 21:07:10 +01:00
  • ae7412056a webui: fix prompt progress ETA calculation Xuan Son Nguyen 2025-12-29 20:58:27 +01:00
  • c9a3b40d65 Webui/prompt processing progress (#18300) Pascal 2025-12-29 19:32:21 +01:00
  • 0bd1212a43 CUDA: fix replacment of bad archs in CMake (#18457) Johannes Gäßler 2025-12-29 17:58:20 +01:00
  • 5b1248c9af server : Cmdline arg -to changes http read timeout from current 600sec default (#18279) b7574 wbtek 2025-12-30 01:12:48 +09:00
  • 3595ae5963 contributing: tighten AI usage policy (#18388) Xuan-Son Nguyen 2025-12-29 16:01:32 +01:00
  • c1366056f6 android: routine maintenance - Dec 2025 (#18338) b7572 Naco Siren 2025-12-29 05:51:13 -08:00
  • 2a85f720b8 server : handle closed connection for tasks (#18459) b7571 Georgi Gerganov 2025-12-29 15:34:41 +02:00
  • 7cbec34a63 model-conversion : add device option to embd run orig model (#18386) Daniel Bevenius 2025-12-29 13:37:02 +01:00
  • eaa639af65 update xsn/contrib_tighter_ai_policy Xuan Son Nguyen 2025-12-29 12:41:48 +01:00
  • 59f072cf9f Apply suggestions from code review Xuan-Son Nguyen 2025-12-29 12:40:54 +01:00
  • 0c8986403b retrieval : use at most n_seq_max chunks (#18400) b7569 Héctor Estrada Moreno 2025-12-29 05:21:13 -06:00
  • 28b97ea300 trailing space Xuan Son Nguyen 2025-12-29 11:07:15 +01:00
  • 3631acefb8 improve Xuan Son Nguyen 2025-12-29 10:12:37 +01:00
  • daa242dfc8 common: fix return value check for setpriority (#18412) b7568 o7si 2025-12-29 17:07:49 +08:00
  • e70e640db3 CUDA: Blackwell features for non-native builds (#18436) b7567 Johannes Gäßler 2025-12-29 09:35:42 +01:00
  • 5fa66c6e67 cuda: fix race condition in cumsum (#18448) b7566 Aman Gupta 2025-12-29 14:07:17 +08:00
  • 382808c14b ci : re-enable rocm build on amd64 (#18439) b7565 Tim Neumann 2025-12-29 00:29:23 +01:00
  • 6d4b1ad6ae Update CONTRIBUTING.md Xuan-Son Nguyen 2025-12-28 21:34:23 +01:00
  • 4ffc47cb20 HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (#18202) b7564 uvos 2025-12-28 20:12:55 +01:00
  • 4f2c127b5b revise Xuan Son Nguyen 2025-12-28 19:28:38 +01:00
  • 9c675c7140 model : Plamo3 support (#17304) b7563 momonga 2025-12-29 01:28:31 +09:00
  • 07a0c4ba92 Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (#18426) b7562 Aman Gupta 2025-12-28 20:53:36 +08:00