Commit Graph

  • a817a22bc6 ggml : implement fast walsh-hadamard transform for kv rotation (#21352) (#22631) b9026 Ismail 2026-05-05 04:05:05 +02:00
  • eff06702b2 kleidiai : update to v1.24.0 and use release archive (#22549) b9025 Charles Xu 2026-05-04 21:13:31 +02:00
  • 069be0ae22 Merge branch 'master' into pr/18039 Georgi Gerganov 2026-05-04 21:42:27 +03:00
  • e77056f9b2 CUDA: use fastdiv for batch index split in get_rows (#22650) leonardHONG 2026-05-04 22:24:05 +08:00
  • 935a340292 server: implement /models?reload=1 (#21848) b9023 Xuan-Son Nguyen 2026-05-04 16:23:26 +02:00
  • d8794eecd5 examples: refactor diffusion generation (#22590) b9022 Shakhnazar Sailaukan 2026-05-04 16:19:30 +04:00
  • 36a694c965 webui : fix circular dependency between chat.service.ts and models.svelte.ts (#22625) JusteLeo 2026-05-04 13:38:10 +02:00
  • a4701c98f7 common/autoparser: fixes for newline handling / forced tool calls (#22654) b9020 Piotr Wilkin (ilintar) 2026-05-04 13:18:11 +02:00
  • 994118a183 model: move load_hparams and load_tensors to per-model definition (#22004) b9019 Xuan-Son Nguyen 2026-05-04 12:36:59 +02:00
  • c84e6d6db5 server: Add a simple get_datetime server tool (#22649) b9018 Evan Huus 2026-05-04 06:19:41 -04:00
  • 82af405161 arg : silence warnings about removed params gg/arg-silence-warnings Georgi Gerganov 2026-05-04 10:07:57 +03:00
  • fa8feaed34 webui: restore missing settings (#22666) Nick Towle 2026-05-04 00:04:07 -07:00
  • 846262d787 docs : update speculative decoding parameters after refactor (#22397) (#22539) b9016 Georgi Gerganov 2026-05-04 08:52:07 +03:00
  • 6dcd824fce vulkan: delete dead GGML_VK_MAX_NODES def (#22621) b9015 Atomic-Germ 2026-05-03 22:49:29 -07:00
  • d4b0c22f9e ggml-webgpu: add layer norm ops (#22406) b9014 Chen Yuan 2026-05-03 23:52:53 -04:00
  • e48034dfc9 common : determine generation prompt using longest common prefix (#22657) Aldehir Rojas 2026-05-03 17:18:23 -05:00
  • 048a490f76 convert : Mistral format yarn apply_scale support (#22612) b9012 Julien Denize 2026-05-03 21:51:21 +02:00
  • db44417b02 convert : apply Q/K RoPE permutation in NVFP4 repack path (#22611) JM Robles 2026-05-03 17:22:00 +02:00
  • d05fe1d7da fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus entirely) (#22533) b9010 lucy 2026-05-02 16:19:25 -04:00
  • 459b02f6c0 Merge branch 'master' into pr/18039 Georgi Gerganov 2026-05-02 18:08:25 +03:00
  • 0754b7b6fe server : avoid checkpoint data host copies (#22558) b9009 Georgi Gerganov 2026-05-02 18:03:25 +03:00
  • 09294365a9 ggml-virtgpu: fix circular dependency in headers (#22557) b9008 JusteLeo 2026-05-02 15:28:50 +02:00
  • 63d93d1733 convert : disable uint types (#18908) Csaba Kecskemeti 2026-05-01 23:05:59 -07:00
  • c5a3bc39b1 opencl: Adreno optimization for MoE - MxFP4 (#22301) b9006 Shawn Gu 2026-05-01 23:02:24 -07:00
  • 9dbb372610 Github: update issue templates (#22594) Johannes Gäßler 2026-05-02 07:56:13 +02:00
  • 228e836344 sync : ggml b9004 Georgi Gerganov 2026-05-02 08:46:31 +03:00
  • ed23489f42 ggml : bump version to 0.10.2 (ggml/1474) Georgi Gerganov 2026-05-02 08:45:46 +03:00
  • 81eabb4781 sync : ggml sync-ggml-26-05-02 Georgi Gerganov 2026-05-02 08:46:31 +03:00
  • 7b38b8660f ggml : bump version to 0.10.2 (ggml/1474) Georgi Gerganov 2026-05-02 08:45:46 +03:00
  • 457e2288c9 sync : ggml b9002 Georgi Gerganov 2026-05-01 21:29:15 +03:00
  • e8ec7ab058 ggml : try fix win32 build (whisper/0) Georgi Gerganov 2026-05-01 18:53:30 +03:00
  • 1a03cf47f6 hexagon: hmx flash attention (#22347) b9000 Yiwei Shao 2026-05-01 20:29:13 -07:00
  • b97ebdc98f llama-quant : fix --tensor-type when default qtype is overriden (#22572) b8999 ddh0 2026-05-01 12:55:55 -05:00
  • 2098fd6169 hexagon: enable non-contiguous row tensor support for unary ops (#22574) b8998 Aparna M P 2026-05-01 22:39:23 +05:30
  • ab6120cde5 webui: Spring Cleaning Refactor v1 (#22505) Aleksander Grygier 2026-05-01 18:36:29 +02:00
  • c3c1505392 ggml-webgpu: Fix vectorized handling in mul-mat and mul-mat-id (#22578) b8996 Masashi Yoshimura 2026-05-01 23:55:01 +09:00
  • da1f16886f load directly from downloaded state 0cc4m/server-memory-limit Ruben Ortlam 2026-04-21 13:22:50 +02:00
  • 884901f04d handle models that need to be downloaded before estimation Ruben Ortlam 2026-04-20 14:48:55 +02:00
  • 01dd39342d cont : clean-up Georgi Gerganov 2026-04-16 14:32:47 +03:00
  • 972813c253 also strip models memory margin from child processes Ruben Ortlam 2026-04-13 10:14:53 +02:00
  • b440ee05b8 improve variable naming, fix style Ruben Ortlam 2026-04-07 13:35:02 +02:00
  • f24011f2cf improve memory_per_device map naming Ruben Ortlam 2026-04-07 13:28:49 +02:00
  • f4a384b46c fix model count exceeded check Ruben Ortlam 2026-04-02 11:39:36 +02:00
  • f750bae2d3 move llama_context_device_memory function to llama-ext.h Ruben Ortlam 2026-04-02 11:39:07 +02:00
  • 527c91ac87 add server memory debug logging Ruben Ortlam 2026-04-02 10:07:04 +02:00
  • 3c815b369e use memory margin instead of total size limit, apply to each device separately Ruben Ortlam 2026-04-02 09:24:53 +02:00
  • 18163c4143 only set model memory_mb if not previously calculated Ruben Ortlam 2026-03-31 17:37:16 +02:00
  • af28cd24dc use no_alloc to get memory requirements for model load Ruben Ortlam 2026-03-31 16:18:03 +02:00
  • e6468c1715 estimate with to-be-loaded model size included Ruben Ortlam 2026-03-29 12:18:51 +02:00
  • 0a019ed812 server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold Ruben Ortlam 2026-03-29 10:00:49 +02:00
  • 05e141a6b3 vulkan: Support asymmetric FA in coopmat2 path (#21753) b8995 Jeff Bolz 2026-05-01 15:28:32 +02:00
  • 033e652e92 output device group info 0cc4m/vulkan-device-cpy-benchmark Ruben Ortlam 2026-05-01 14:53:12 +02:00
  • d40697c46a add device group test Ruben Ortlam 2026-05-01 07:49:49 +02:00
  • aab68217b7 ggml-webgpu: add the upscale shader (#22419) b8994 Chen Yuan 2026-05-01 01:22:18 -04:00
  • a95a11e5b8 ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (#22464) Masashi Yoshimura 2026-05-01 06:19:10 +09:00
  • 5cbfb18075 Update llama-mmap to use ftello/fseeko (#22497) b8992 Reese Levine 2026-04-30 14:17:52 -07:00
  • 1b2bd8699c fix windows build xsn/vertexai Xuan Son Nguyen 2026-04-30 21:52:31 +02:00
  • beb42fffa4 common : check for null getpwuid in hf-cache (#22550) b8991 Adrien Gallouët 2026-04-30 21:32:41 +02:00
  • 9233271823 fix test case Xuan Son Nguyen 2026-04-30 20:07:55 +02:00
  • 9d5887035f testing gg/spec-ckpt-test Georgi Gerganov 2026-04-30 19:18:57 +03:00
  • a7c1110e87 server : avoid checkpoint data host copies Georgi Gerganov 2026-04-30 16:24:49 +03:00
  • 660b1b4bdc vulkan: add get/set tensor 2d functions (#22514) b8990 Ruben Ortlam 2026-04-30 17:37:13 +02:00
  • c20c44514a spec: fix argument typo (#22552) b8989 Ben Guidarelli 2026-04-30 10:32:32 -04:00
  • 6118c043b1 ci : bump ty to 0.0.33 (#22535) Sigbjørn Skjæret 2026-04-30 15:15:54 +02:00
  • 5f0ab726f7 vendor : update cpp-httplib to 0.43.2 (#22548) b8987 Adrien Gallouët 2026-04-30 15:04:39 +02:00
  • 331e4d21f5 if AIP_MODE is unset, do nothing Xuan Son Nguyen 2026-04-30 13:32:41 +02:00
  • 348e6088f3 various fixes Xuan Son Nguyen 2026-04-30 13:24:58 +02:00
  • e82aaf2587 CUDA: fix tile FA kernel on Pascal (#22541) b8986 Johannes Gäßler 2026-04-30 13:04:50 +02:00
  • 5e11eafc3e support other AIP_* env var Xuan Son Nguyen 2026-04-30 12:37:28 +02:00
  • 5dd6c9e58e a bit safer Xuan Son Nguyen 2026-04-30 12:15:47 +02:00
  • d34f9713e5 Merge branch 'master' into xsn/vertexai Xuan Son Nguyen 2026-04-30 11:55:25 +02:00
  • bfc135fee2 server: support Vertex AI compatible API Xuan Son Nguyen 2026-04-30 11:55:13 +02:00
  • cb8a3a93ec Merge branch 'master' into pr/18039 Georgi Gerganov 2026-04-30 10:08:10 +03:00
  • 6eddb1c6e3 pi : add rule to use gh CLI for GitHub resources gg/pi-gh-tool-rule Georgi Gerganov 2026-04-30 09:49:54 +03:00
  • c6dbd31146 docs : update speculative decoding parameters after refactor (#22397) Georgi Gerganov 2026-04-30 09:44:48 +03:00
  • 27aef3dd91 scripts : add wc2wt.sh - create worktree from current HEAD (#22513) Georgi Gerganov 2026-04-30 09:20:26 +03:00
  • 45155597aa add fast matmul iquants (#22504) b8984 Rithik Sharma 2026-04-29 22:58:32 -07:00
  • 80afa33aad spec : fix draft model checkpoints (#22521) b8983 Georgi Gerganov 2026-04-30 08:32:18 +03:00
  • b42c7fa5b8 spec : fix vocab compat checks in spec example (#22426) b8982 Peter Sideris 2026-04-30 08:18:25 +03:00
  • d77599234e common : do not pass prompt tokens to reasoning budget sampler (#22488) b8981 Aldehir Rojas 2026-04-29 14:10:58 -05:00
  • 41a63be28e hexagon: make vmem and buffer-size configurable (#22487) b8980 Max Krasnyansky 2026-04-29 11:51:21 -07:00
  • 098705a29e CUDA: fuse SSM_CONV + ADD(bias) + SILU (#22478) b8979 Anav Prasad 2026-04-29 11:39:56 -07:00
  • 683c5acb90 spec : disacard last drafted token with low prob (#22506) b8978 Georgi Gerganov 2026-04-29 17:00:00 +03:00
  • b1d5f5b449 sync : ggml b8977 Georgi Gerganov 2026-04-29 16:43:08 +03:00
  • 4b221b7f1e ggml : bump version to 0.10.1 (ggml/1469) Georgi Gerganov 2026-04-29 16:41:45 +03:00
  • 56cc6e1e4e clean up tests, add dma_buf test Ruben Ortlam 2026-04-29 15:20:42 +02:00
  • c1680de104 benchmark Ruben Ortlam 2026-04-08 18:26:50 +02:00
  • c6a04cb5c3 ggml-metal: fix 2D async copy to use row-by-row transfers gg/metal-implement-async-2d Georgi Gerganov 2026-04-29 14:57:48 +03:00
  • f9e19a1f6e pi: add rule to not force push branches unless asked Georgi Gerganov 2026-04-29 14:37:13 +03:00
  • c3a54d6253 ggml-metal: implement async 2D tensor copy functions Georgi Gerganov 2026-04-29 14:22:06 +03:00
  • 59237bfbbc webui: fix slow mic stop and WAV encode (#22480) Pascal 2026-04-29 12:58:35 +02:00
  • 1cbc846eba ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (#22293) b8974 shalinib-ibm 2026-04-29 16:02:40 +05:30
  • 3142f1dbb9 ggml-cuda: refactor fusion code (#22468) b8973 Aman Gupta 2026-04-29 16:19:33 +08:00
  • b5c4227dc6 ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (#22317) b8972 qiurui144 2026-04-29 15:59:21 +08:00
  • d6a5094004 ggml-webgpu: Fix bug in FlashAttention support check (#22492) b8971 Reese Levine 2026-04-29 00:59:00 -07:00
  • 7b95ea5d11 common: Intentionally leak logger instance to fix hanging on Windows (#22273) b8970 Masato Nakasaka 2026-04-29 16:58:43 +09:00
  • bdc9c743a5 ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (#21916) b8969 hrushitfujitsu 2026-04-29 13:27:37 +05:30
  • 739393beeb TP: fix delayed AllReduce + zero-sized slices (#22489) b8968 Johannes Gäßler 2026-04-29 08:55:07 +02:00
  • fc2b0053ff ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (#22196) b8967 Michael Wand 2026-04-28 15:47:42 -07:00
  • 7b8443ac78 ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286) b8966 lnigam 2026-04-29 01:07:35 +05:30