Commit Graph

  • 1f8d36665d minor : cleanup + fix build Georgi Gerganov 2026-01-26 14:05:17 +02:00
  • a3300937e5 common : better names Georgi Gerganov 2026-01-26 13:59:08 +02:00
  • f895bca71a minor : cleanup Georgi Gerganov 2026-01-26 13:56:28 +02:00
  • 56f3ebf38e model : add correct type for GLM 4.7 Flash (#19106) b7837 Georgi Gerganov 2026-01-26 11:24:30 +02:00
  • fd4d803c60 common: print performance in spec decoding Sascha Rogmann 2026-01-26 00:20:05 +01:00
  • 288ab50597 doc: (draftless) speculative decoding Sascha Rogmann 2026-01-25 23:58:55 +01:00
  • 8ea068e5f8 spec: remove --spec-config Sascha Rogmann 2026-01-25 23:56:29 +01:00
  • 0c21677e43 CUDA: faster FA for GQA > 1 but not power of 2 (#19092) b7836 Johannes Gäßler 2026-01-25 21:19:47 +01:00
  • 9ac881767c cont : naming Georgi Gerganov 2026-01-25 21:15:15 +02:00
  • 0440bfd160 metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (#19088) b7835 ccbinn 2026-01-26 02:07:19 +08:00
  • 0bf5636938 convert : yield Gemma3N custom_map tensors directly (#19091) Sigbjørn Skjæret 2026-01-25 18:03:34 +01:00
  • 924517dd38 spec : refactor Georgi Gerganov 2026-01-25 17:15:46 +02:00
  • af382c384a common: cleanup (use common_speculative_state_draft) Sascha Rogmann 2026-01-25 16:41:44 +01:00
  • bcb43163ae ggml-cpu: Use tiled FA for prompt-processing (#19012) b7833 Aman Gupta 2026-01-25 23:25:58 +08:00
  • d9c6ce46f7 kv-cache : support V-less cache (#19067) b7832 Georgi Gerganov 2026-01-25 15:48:56 +02:00
  • 70d860824a convert : fix Gemma3N, GraniteMoe and Ernie4.5Moe (#19084) Sigbjørn Skjæret 2026-01-25 13:05:05 +01:00
  • 080b161995 completion : fix prompt cache for recurrent models (#19045) b7830 Georgi Gerganov 2026-01-25 09:12:50 +02:00
  • 1243f93a2d readme: update RWKV7 model links (#19061) Molly Sophia 2026-01-25 15:11:19 +08:00
  • 24bc238303 llama: fix integer type consistency in split helpers (#18894) b7828 Jakkala Mahesh 2026-01-25 12:40:52 +05:30
  • 16639ba217 common : use two decimal places for float arg help messages (#19048) b7827 Daniel Bevenius 2026-01-25 07:31:42 +01:00
  • 9981c30130 convert : fix conversion for inheriting models that were bypassing modify_tensors (#19064) b7826 Bartowski 2026-01-24 20:36:47 -05:00
  • cb3a40277a common: moved self-spec impl to ngram-map Sascha Rogmann 2026-01-25 01:16:06 +01:00
  • e9fd8dcab4 llama-fit-params: keep explicit --ctx-size 0 (#19070) b7825 Johannes Gäßler 2026-01-24 22:13:08 +01:00
  • 4e5b83b226 GGUF: check that tensor size is representable (#19072) b7824 Johannes Gäßler 2026-01-24 21:57:51 +01:00
  • bb02f74c61 chat: fix language input for translategemma (#19052) b7823 Xuan-Son Nguyen 2026-01-24 17:58:45 +01:00
  • a1584ac80f server: cleanup (remove slot.batch_spec, rename) Sascha Rogmann 2026-01-23 23:31:32 +01:00
  • 1e29af4ea5 common: add option --spec-draftless Sascha Rogmann 2026-01-22 23:17:56 +01:00
  • eb43748b05 common: add vector of speculative states Sascha Rogmann 2026-01-21 22:46:28 +01:00
  • b38eb5907c common: add enum common_speculative_type Sascha Rogmann 2026-01-18 18:45:10 +01:00
  • 456268fa7f common: ngram map, config self-speculative decoding Sascha Rogmann 2026-01-14 23:44:23 +01:00
  • 907d094f9e server: can_speculate() requires a task instance Sascha Rogmann 2026-01-03 10:16:22 +01:00
  • f1f6584ce6 common: use %zu format specifier for size_t in logging Sascha Rogmann 2026-01-03 09:54:22 +01:00
  • 917f4bb14b server: replace can_speculate() with slot.can_speculate() Sascha Rogmann 2026-01-02 22:42:59 +01:00
  • 38f7c28795 server: can_speculate() tests self-spec Sascha Rogmann 2026-01-02 00:10:46 +01:00
  • e3e809cc01 can_speculate() includes self-speculation Sascha Rogmann 2026-01-02 00:17:53 +01:00
  • 1faeb628db server: moved self-call into speculative.cpp Sascha Rogmann 2025-12-31 00:55:39 +01:00
  • 1fb2658b0d server: introduce self-speculative decoding Sascha Rogmann 2025-12-29 20:46:32 +01:00
  • 8f91ca54ec CUDA: re-use MLA K data for V in MMA FA (#19057) b7822 Johannes Gäßler 2026-01-24 10:09:36 +01:00
  • 81ab64f3c8 ggml-cuda: enable cuda-graphs for n-cpu-moe (#18934) b7821 Aman Gupta 2026-01-24 14:25:20 +08:00
  • 8af1f5f430 ggml-hexagon: flash-attn opt (#19025) b7820 nullname 2026-01-24 14:02:07 +08:00
  • 557515be1e graph : utilize ggml_build_forward_select() to avoid reallocations (#18898) b7819 Georgi Gerganov 2026-01-23 18:22:34 +02:00
  • cb6caca191 [SYCL] use malloc to support both iGPU and dGPU in same time (#18992) b7818 Neo Zhang 2026-01-23 20:54:10 +08:00
  • b5b8fa1c8b chat : fix translategemma crash on common_chat_format_example (#19019) Xuan-Son Nguyen 2026-01-23 12:03:42 +01:00
  • a14b960bc7 model-conversion : use BUILD_DIR variable in all scripts (#19015) Daniel Bevenius 2026-01-23 09:01:36 +01:00
  • 091a46cb8d ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860) b7815 Alberto Cabrera Pérez 2026-01-23 07:55:08 +00:00
  • a3e812811d cli : load parser definition (#19031) b7814 Aldehir Rojas 2026-01-22 20:31:22 -06:00
  • 51fa458a92 server : support preserving reasoning_content in assistant message (#18994) b7813 Xuan-Son Nguyen 2026-01-22 21:30:06 +01:00
  • a5eaa1d6a3 mla : make the V tensor a view of K (#18986) b7812 Georgi Gerganov 2026-01-22 22:09:01 +02:00
  • e2baf02162 CUDA: fix alignment check for FA (#19023) b7811 Johannes Gäßler 2026-01-22 20:39:25 +01:00
  • e34d6d03b2 convert_hf_to_gguf.py: refactor modify_tensors to call super (#18866) Aman Gupta 2026-01-23 02:58:07 +08:00
  • 9c96465f99 opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970) b7809 lhez 2026-01-22 10:29:25 -08:00
  • 4e595b250a server: do not log certain endpoints (avoid log spam) (#19028) b7808 Xuan-Son Nguyen 2026-01-22 19:24:37 +01:00
  • 0e4ebeb057 quant : manual overrides of tensor types take precedence (#18952) b7807 Georgi Gerganov 2026-01-22 16:17:06 +02:00
  • 8b30840703 release: update github api (#19022) b7806 Aaron Teo 2026-01-22 21:38:02 +08:00
  • 9eb5bfec1a mtmd : update docs to use llama_model_n_embd_inp (#18999) b7805 Xuan-Son Nguyen 2026-01-22 14:36:32 +01:00
  • c6926d1d95 server: Reorder methods in server-task.cpp (#19016) b7804 손희준 2026-01-22 22:36:04 +09:00
  • b70d251076 CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953) Aman Gupta 2026-01-22 18:51:53 +08:00
  • 5516b9c16a opencl: add TRI op support (#18979) b7802 shaofeiqi 2026-01-21 22:05:54 -08:00
  • 94242a62c0 ggml-zdnn : mark zDNN buffers as non-host (#18967) b7801 Aleksei Nikiforov 2026-01-22 01:16:21 +01:00
  • 6b99a223e3 ci : update GitHub Actions versions [no ci] (#18935) Pádraic Slattery 2026-01-22 00:57:18 +01:00
  • 77078e80e5 convert : add Devstral-2 (Ministral3ForCausalLM) arch (#18972) Mariusz Woloszyn 2026-01-22 00:55:55 +01:00
  • c301172f66 jinja: support none|string (#18995) b7798 Piotr Wilkin (ilintar) 2026-01-21 19:24:37 +01:00
  • 3802d3c78f fix: Use tabular-nums for chat message statistics (#18915) Hendrik Erz 2026-01-21 18:46:01 +01:00
  • 9da3dcd753 llama : clarify nemotron-h.cpp comment about RoPE [no ci] (#18997) Daniel Bevenius 2026-01-21 18:31:34 +01:00
  • bd544c94a3 vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945) b7795 Jeff Bolz 2026-01-21 11:01:40 -06:00
  • 14be5a39b1 common : improve error message when HTTPS is missing but required (#18987) b7794 Adrien Gallouët 2026-01-21 17:58:38 +01:00
  • fbbf3ad190 server: /v1/responses (partial) (#18486) b7793 손희준 2026-01-22 01:47:23 +09:00
  • 33f890e579 vulkan: support flash attention GQA/split_k with small batches (#18938) b7792 Jeff Bolz 2026-01-21 10:43:43 -06:00
  • 067b8d7af3 Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (#18831) b7791 Masato Nakasaka 2026-01-22 01:13:43 +09:00
  • 50b7f076a5 vulkan: Use mul_mat_vec_id for small values of n (#18918) b7790 Jeff Bolz 2026-01-21 09:22:02 -06:00
  • ad8d85bd94 memory : add llama_memory_hybrid_iswa (#18601) b7789 Tarek Dakhran 2026-01-21 13:30:23 +01:00
  • 12a4a47e6a Fix GLM 4.7 Lite MoE gating func (#18980) b7788 Piotr Wilkin (ilintar) 2026-01-21 12:35:20 +01:00
  • 37c35f0e1c gguf: display strerrno when cant load a model (#18884) b7787 Matthieu Coudron 2026-01-21 07:52:46 +01:00
  • 5bd341c9a1 CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964) b7786 Oliver Simons 2026-01-21 02:34:29 +01:00
  • 1c7cf94b22 common, server : use the same User-Agent by default (#18957) b7785 Adrien Gallouët 2026-01-20 18:28:43 +01:00
  • 2c1f199653 cli : fix reasoning responses in CLI (#18961) b7784 Xuan-Son Nguyen 2026-01-20 18:23:25 +01:00
  • d1e3556481 CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930) b7783 Oliver Simons 2026-01-20 13:11:01 +01:00
  • 08f3f4a8a3 ggml : cleanup path_str() (#18928) b7782 Adrien Gallouët 2026-01-20 11:42:49 +01:00
  • 271191906c metal : enable FA for MLA heads (#18950) b7781 Georgi Gerganov 2026-01-20 12:21:28 +02:00
  • 8b407e3978 quant : manual overrides of tensor types take precedence gg/quant-manual-overrides Georgi Gerganov 2026-01-20 11:16:46 +02:00
  • 7dee9ff59a convert : use n_groups instead of hardcoded values in reshape (#18929) Daniel Bevenius 2026-01-20 06:55:24 +01:00
  • 6df686bee6 server : refactor oai_parser_opt, move it to server_chat_params (#18937) b7779 Xuan-Son Nguyen 2026-01-19 23:28:01 +01:00
  • 1706a6d7c6 convert : support Glm4MoeLite (#18936) ddh0 2026-01-19 16:09:20 -06:00
  • 959ecf7f23 jinja : fix undefined keys and attributes and int/float as bool (#18924) b7777 Sigbjørn Skjæret 2026-01-19 20:29:43 +01:00
  • 4037093c66 ci : run test-jinja -py on high perf [no ci] (#18916) Sigbjørn Skjæret 2026-01-19 20:29:15 +01:00
  • 18361c579c server: fix memory reservations in populate_token_probs (#18787) b7775 Lennart Austenfeld 2026-01-19 19:13:31 +01:00
  • 365a3e8c31 ggml : add ggml_build_forward_select (#18550) b7774 Georgi Gerganov 2026-01-19 20:03:19 +02:00
  • 3d55846a5c model-conversion : add BUILD_DIR variable to run-converted-model scripts (#18927) Daniel Bevenius 2026-01-19 13:12:38 +01:00
  • 287a33017b llama : Extend fallback, fix fileno for dio file, exclude case that mmap uses dio file (#18887) b7772 Julius Tischbein 2026-01-18 17:35:57 +01:00
  • 293a1565dc docs: add linux to index (#18907) Francisco Herrera 2026-01-18 05:03:35 -05:00
  • 3bfbbcc5fc winget : update komac version gg/winget-update Georgi Gerganov 2026-01-18 10:29:03 +02:00
  • fe44d35574 tests : add test-jinja -py option for cross-checking (#18906) b7770 Xuan-Son Nguyen 2026-01-18 08:14:27 +01:00
  • bbcdac0189 jinja : fix object item order (and properly implement dictsort) (#18904) b7769 Sigbjørn Skjæret 2026-01-18 03:40:06 +01:00
  • d03c45c9c5 jinja : attribute support for join, map and sort (#18883) b7768 Sigbjørn Skjæret 2026-01-18 02:53:01 +01:00
  • 10c98cbdf6 jinja : add missing tojson filter for bool (#18900) b7767 Sigbjørn Skjæret 2026-01-18 01:05:09 +01:00
  • 420960ab92 jinja : fix lexing of float literals with sign (#18901) b7766 Sigbjørn Skjæret 2026-01-18 00:57:51 +01:00
  • f55b033ae6 jinja: correct member access rule (#18905) b7765 Xuan-Son Nguyen 2026-01-18 00:48:55 +01:00
  • d1b4757ded opencl: fix q6_K mv for m=1 (#18893) lhez 2026-01-17 13:50:32 -08:00
  • 57c0beaed0 ci : add label for jinja changes (#18903) Sigbjørn Skjæret 2026-01-17 21:52:02 +01:00
  • 2fbde785bc kv-cache : optimize KQ mask construction (#18842) b7762 Georgi Gerganov 2026-01-17 15:42:42 +02:00