Commit Graph

  • 1e7a4175d4 add filter_tensors classmethod cisc/convert-filter-tensors-refactor Sigbjørn Skjæret 2026-05-02 00:52:42 +02:00
  • cdabf39be9 sync : ggml sync-ggml-26-05-01 Georgi Gerganov 2026-05-01 21:29:15 +03:00
  • 2279ac82e4 ggml : try fix win32 build (whisper/0) Georgi Gerganov 2026-05-01 18:53:30 +03:00
  • b97ebdc98f llama-quant : fix --tensor-type when default qtype is overriden (#22572) master b8999 ddh0 2026-05-01 12:55:55 -05:00
  • 2098fd6169 hexagon: enable non-contiguous row tensor support for unary ops (#22574) b8998 Aparna M P 2026-05-01 22:39:23 +05:30
  • ab6120cde5 webui: Spring Cleaning Refactor v1 (#22505) Aleksander Grygier 2026-05-01 18:36:29 +02:00
  • c3c1505392 ggml-webgpu: Fix vectorized handling in mul-mat and mul-mat-id (#22578) b8996 Masashi Yoshimura 2026-05-01 23:55:01 +09:00
  • da1f16886f load directly from downloaded state 0cc4m/server-memory-limit Ruben Ortlam 2026-04-21 13:22:50 +02:00
  • 884901f04d handle models that need to be downloaded before estimation Ruben Ortlam 2026-04-20 14:48:55 +02:00
  • 01dd39342d cont : clean-up Georgi Gerganov 2026-04-16 14:32:47 +03:00
  • 972813c253 also strip models memory margin from child processes Ruben Ortlam 2026-04-13 10:14:53 +02:00
  • b440ee05b8 improve variable naming, fix style Ruben Ortlam 2026-04-07 13:35:02 +02:00
  • f24011f2cf improve memory_per_device map naming Ruben Ortlam 2026-04-07 13:28:49 +02:00
  • f4a384b46c fix model count exceeded check Ruben Ortlam 2026-04-02 11:39:36 +02:00
  • f750bae2d3 move llama_context_device_memory function to llama-ext.h Ruben Ortlam 2026-04-02 11:39:07 +02:00
  • 527c91ac87 add server memory debug logging Ruben Ortlam 2026-04-02 10:07:04 +02:00
  • 3c815b369e use memory margin instead of total size limit, apply to each device separately Ruben Ortlam 2026-04-02 09:24:53 +02:00
  • 18163c4143 only set model memory_mb if not previously calculated Ruben Ortlam 2026-03-31 17:37:16 +02:00
  • af28cd24dc use no_alloc to get memory requirements for model load Ruben Ortlam 2026-03-31 16:18:03 +02:00
  • e6468c1715 estimate with to-be-loaded model size included Ruben Ortlam 2026-03-29 12:18:51 +02:00
  • 0a019ed812 server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold Ruben Ortlam 2026-03-29 10:00:49 +02:00
  • 88d8c574ac Converge implementation with export-graph-ops cross-profiler Piotr Wilkin 2026-04-07 22:01:00 +02:00
  • 00271bcfdc Add missing op parameters to the profiler; add support for test-backend-ops to run performance tests with exactly the tensor shapes from the run Piotr Wilkin 2026-04-03 17:41:57 +02:00
  • 6b4af1f344 docs, pass copy details Piotr Wilkin 2026-03-29 23:35:38 +02:00
  • 05ced7c850 fix mul_mat_id stats, add throughput stat, add envvar trigger, add concurrent mode fix Piotr Wilkin 2026-03-29 22:52:33 +02:00
  • 69f649addd fix builds, integrate vulkan profiler, fix copy events, fix export Piotr Wilkin 2026-03-29 16:52:50 +02:00
  • cb6f855323 Fix more missing backend stuff (and Python errors) Piotr Wilkin 2026-03-29 01:57:02 +01:00
  • feeca707aa add second dimension to reported tensors, fix Mac build, add missing initializer to all backends Piotr Wilkin 2026-03-29 01:49:52 +01:00
  • 3492073424 feat: cool profiler thingy Piotr Wilkin 2026-03-29 01:14:09 +01:00
  • 05e141a6b3 vulkan: Support asymmetric FA in coopmat2 path (#21753) b8995 Jeff Bolz 2026-05-01 15:28:32 +02:00
  • 033e652e92 output device group info 0cc4m/vulkan-device-cpy-benchmark Ruben Ortlam 2026-05-01 14:53:12 +02:00
  • d40697c46a add device group test Ruben Ortlam 2026-05-01 07:49:49 +02:00
  • aab68217b7 ggml-webgpu: add the upscale shader (#22419) b8994 Chen Yuan 2026-05-01 01:22:18 -04:00
  • a95a11e5b8 ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (#22464) Masashi Yoshimura 2026-05-01 06:19:10 +09:00
  • 5cbfb18075 Update llama-mmap to use ftello/fseeko (#22497) b8992 Reese Levine 2026-04-30 14:17:52 -07:00
  • 1b2bd8699c fix windows build xsn/vertexai Xuan Son Nguyen 2026-04-30 21:52:31 +02:00
  • beb42fffa4 common : check for null getpwuid in hf-cache (#22550) b8991 Adrien Gallouët 2026-04-30 21:32:41 +02:00
  • 9233271823 fix test case Xuan Son Nguyen 2026-04-30 20:07:55 +02:00
  • 9d5887035f testing gg/spec-ckpt-test Georgi Gerganov 2026-04-30 19:18:57 +03:00
  • a7c1110e87 server : avoid checkpoint data host copies gg/spec-opt-checkpoints Georgi Gerganov 2026-04-30 16:24:49 +03:00
  • 660b1b4bdc vulkan: add get/set tensor 2d functions (#22514) b8990 Ruben Ortlam 2026-04-30 17:37:13 +02:00
  • c20c44514a spec: fix argument typo (#22552) b8989 Ben Guidarelli 2026-04-30 10:32:32 -04:00
  • 6118c043b1 ci : bump ty to 0.0.33 (#22535) Sigbjørn Skjæret 2026-04-30 15:15:54 +02:00
  • 5f0ab726f7 vendor : update cpp-httplib to 0.43.2 (#22548) b8987 Adrien Gallouët 2026-04-30 15:04:39 +02:00
  • 331e4d21f5 if AIP_MODE is unset, do nothing Xuan Son Nguyen 2026-04-30 13:32:41 +02:00
  • 348e6088f3 various fixes Xuan Son Nguyen 2026-04-30 13:24:58 +02:00
  • e82aaf2587 CUDA: fix tile FA kernel on Pascal (#22541) b8986 Johannes Gäßler 2026-04-30 13:04:50 +02:00
  • 5e11eafc3e support other AIP_* env var Xuan Son Nguyen 2026-04-30 12:37:28 +02:00
  • 5dd6c9e58e a bit safer Xuan Son Nguyen 2026-04-30 12:15:47 +02:00
  • d34f9713e5 Merge branch 'master' into xsn/vertexai Xuan Son Nguyen 2026-04-30 11:55:25 +02:00
  • bfc135fee2 server: support Vertex AI compatible API Xuan Son Nguyen 2026-04-30 11:55:13 +02:00
  • 211e58178a wip pr/18039-gg Georgi Gerganov 2026-04-25 18:27:15 +03:00
  • cb8a3a93ec Merge branch 'master' into pr/18039 Georgi Gerganov 2026-04-30 10:08:10 +03:00
  • c64e772d35 pi : add rule to use gh CLI for GitHub resources gg/spec-update-docs Georgi Gerganov 2026-04-30 09:50:39 +03:00
  • 6eddb1c6e3 pi : add rule to use gh CLI for GitHub resources gg/pi-gh-tool-rule Georgi Gerganov 2026-04-30 09:49:54 +03:00
  • c6dbd31146 docs : update speculative decoding parameters after refactor (#22397) Georgi Gerganov 2026-04-30 09:44:48 +03:00
  • a7fb22fc50 server : validate --tools CLI argument against known tool names gg/server-sanitize-tools-cli-arg Georgi Gerganov 2026-04-30 09:40:58 +03:00
  • 27aef3dd91 scripts : add wc2wt.sh - create worktree from current HEAD (#22513) Georgi Gerganov 2026-04-30 09:20:26 +03:00
  • 45155597aa add fast matmul iquants (#22504) b8984 Rithik Sharma 2026-04-29 22:58:32 -07:00
  • 80afa33aad spec : fix draft model checkpoints (#22521) b8983 Georgi Gerganov 2026-04-30 08:32:18 +03:00
  • b42c7fa5b8 spec : fix vocab compat checks in spec example (#22426) b8982 Peter Sideris 2026-04-30 08:18:25 +03:00
  • d77599234e common : do not pass prompt tokens to reasoning budget sampler (#22488) b8981 Aldehir Rojas 2026-04-29 14:10:58 -05:00
  • 41a63be28e hexagon: make vmem and buffer-size configurable (#22487) b8980 Max Krasnyansky 2026-04-29 11:51:21 -07:00
  • 098705a29e CUDA: fuse SSM_CONV + ADD(bias) + SILU (#22478) b8979 Anav Prasad 2026-04-29 11:39:56 -07:00
  • 683c5acb90 spec : disacard last drafted token with low prob (#22506) b8978 Georgi Gerganov 2026-04-29 17:00:00 +03:00
  • b1d5f5b449 sync : ggml b8977 Georgi Gerganov 2026-04-29 16:43:08 +03:00
  • 4b221b7f1e ggml : bump version to 0.10.1 (ggml/1469) Georgi Gerganov 2026-04-29 16:41:45 +03:00
  • 56cc6e1e4e clean up tests, add dma_buf test Ruben Ortlam 2026-04-29 15:20:42 +02:00
  • c1680de104 benchmark Ruben Ortlam 2026-04-08 18:26:50 +02:00
  • c6a04cb5c3 ggml-metal: fix 2D async copy to use row-by-row transfers gg/metal-implement-async-2d Georgi Gerganov 2026-04-29 14:57:48 +03:00
  • f9e19a1f6e pi: add rule to not force push branches unless asked Georgi Gerganov 2026-04-29 14:37:13 +03:00
  • c3a54d6253 ggml-metal: implement async 2D tensor copy functions Georgi Gerganov 2026-04-29 14:22:06 +03:00
  • 59237bfbbc webui: fix slow mic stop and WAV encode (#22480) Pascal 2026-04-29 12:58:35 +02:00
  • 1cbc846eba ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (#22293) b8974 shalinib-ibm 2026-04-29 16:02:40 +05:30
  • 3142f1dbb9 ggml-cuda: refactor fusion code (#22468) b8973 Aman Gupta 2026-04-29 16:19:33 +08:00
  • b5c4227dc6 ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (#22317) b8972 qiurui144 2026-04-29 15:59:21 +08:00
  • d6a5094004 ggml-webgpu: Fix bug in FlashAttention support check (#22492) b8971 Reese Levine 2026-04-29 00:59:00 -07:00
  • 7b95ea5d11 common: Intentionally leak logger instance to fix hanging on Windows (#22273) b8970 Masato Nakasaka 2026-04-29 16:58:43 +09:00
  • bdc9c743a5 ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (#21916) b8969 hrushitfujitsu 2026-04-29 13:27:37 +05:30
  • 739393beeb TP: fix delayed AllReduce + zero-sized slices (#22489) b8968 Johannes Gäßler 2026-04-29 08:55:07 +02:00
  • fc2b0053ff ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (#22196) b8967 Michael Wand 2026-04-28 15:47:42 -07:00
  • 7b8443ac78 ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286) b8966 lnigam 2026-04-29 01:07:35 +05:30
  • 5d56effdee convert : add support for Nemotron Nano 3 Omni (#22481) Daniel Bevenius 2026-04-28 19:17:57 +02:00
  • 52e5f0a5c1 common : re-arm reasoning budget after DONE on new <think> (#22323) b8964 Jillis ter Hove 2026-04-28 19:15:36 +02:00
  • f9f33654a6 vulkan: Coalesce Q4_K/Q5_K scale loads (#21751) b8963 Matt Corallo 2026-04-28 15:31:04 +00:00
  • 98bb57916a ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (#22456) b8962 Reese Levine 2026-04-28 07:27:17 -07:00
  • f42e29fdf1 webui: Server tools (#21237) Aleksander Grygier 2026-04-28 14:35:49 +03:00
  • 19821178be vulkan: add barrier after writetimestamp (#21865) b8960 Jeff Bolz 2026-04-28 12:28:12 +02:00
  • 698d19b93c ggml: improve SPIR-V headers detection with __has_include (#21918) Emil Askerov 2026-04-28 13:19:06 +03:00
  • 50494a2800 ggml : skip already registered backends and devices (#22296) b8958 Adrien Gallouët 2026-04-28 09:02:32 +02:00
  • d530d6e7a2 ggml : revert to -lm linking instead of find_library (#22355) b8957 Adrien Gallouët 2026-04-28 08:56:02 +02:00
  • c3e08f4700 CANN: add new ops, optimize existing ops (#21204) b8956 hipudding 2026-04-28 14:27:22 +08:00
  • 14e733e36f spec : refactor params (#22397) b8955 Georgi Gerganov 2026-04-28 09:07:33 +03:00
  • 516e8d7a8a server: use pos_next instead of n_tokens for m-rope (#22439) b8954 Aman Gupta 2026-04-28 13:41:00 +08:00
  • 434b2a1ff6 ggml-webgpu: add Q1_0 support (#22374) b8953 Rithik Sharma 2026-04-27 15:50:59 -07:00
  • 983ca8992e server: (router) Forward form-data to model server (Fixes #22044) (#22118) b8952 tha80 2026-04-27 23:55:00 +02:00
  • 665abc6097 add fast mat-vec kernels for i-quants (#22344) b8951 Rithik Sharma 2026-04-27 08:25:45 -07:00
  • 4414c04b9a Additional test for common/gemma4 : handle parsing edge cases (#22420) b8950 Igor Rudenko 2026-04-27 17:36:59 +03:00
  • ceaf47c4b1 fix: rpc-server cache may not work in Windows environments (#22394) b8949 unraido 2026-04-27 23:25:09 +09:00
  • 42401c72b8 Fix type casting for unaccounted memory calculation (#22424) b8948 rankaiyx 2026-04-27 20:31:13 +08:00