Commit Graph

7468 Commits

Author SHA1 Message Date
Georgi Gerganov
c02654eb7d graph : make the compute graph constant with respect to active samplers 2025-12-10 16:19:18 +02:00
Georgi Gerganov
0ecee8be37 server : reconnect the backend_sampling setting in the WebUI 2025-12-10 15:42:20 +02:00
Georgi Gerganov
81cb5783c8 Merge branch 'master' into HEAD 2025-12-10 13:41:32 +02:00
Xuan-Son Nguyen
9e79b0116e convert: allow using quantized Mistral weight (#17889)
* convert: allow using quantized Mistral weight

* data_torch.ndim

* update dequant fn

Co-authored-by: compilade <compilade@users.noreply.github.com>

---------

Co-authored-by: compilade <compilade@users.noreply.github.com>
2025-12-10 10:26:22 +01:00
Neo Zhang Jianyu
2e9eab80c2 fix softmax for iGPU (#17838) b7343 2025-12-10 16:59:57 +08:00
Aldehir Rojas
2fbe3b7bb7 common : add parser for ministral/mistral large 3/devstral 2 (#17713) b7342 2025-12-09 17:31:04 -06:00
Sigbjørn Skjæret
63391852b0 docs : update cpu and cuda ops (#17890)
* update cuda ops

* update CPU as well
2025-12-09 23:31:29 +01:00
Gabe Goodhart
086a63e3a5 metal: SSM kernel improvements (#17876)
* feat: Add a batched version of ssm_conv

This was done using Claude Code. It found a number of optimizations around
how the threads were organized, resulting in a huge performance boost!

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Optimized SSM_SCAN kernel for metal

This used Claude Code and resulted in a modest performance improvement
while maintaining correctness.

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: Add test-backend-ops perf tests for SSM_CONV

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: Real representitive tests for SSM_CONV

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use function constant for ssm_conv batch size

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: backend op tests for ssm_scan from granite4 1b-h

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: remove commented out templates

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: float4 version of ssm_conv_batched

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing ggml_metal_cv_free

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7340
2025-12-09 21:30:02 +02:00
Piotr Wilkin (ilintar)
b63509262a Add DIAG for CUDA (#17873)
* Add DIAG for CUDA

* Refactor parameters
b7339
2025-12-09 20:28:57 +01:00
Johannes Gäßler
48f47565a7 docs: clarify that CPU support should be first (#17886) 2025-12-09 20:10:36 +01:00
Oliver Simons
6dc6614bf0 Disable cooperative groups for musa
Didn't find any doc online, so I don't even know if they support this
2025-12-09 19:09:52 +01:00
Oliver Simons
a25fda5290 Fix launch logic when supports_cooperative_launch=false 2025-12-09 19:03:47 +01:00
Oliver Simons
3f0594ad0b Try fixing HIP build errors by adding corresponding #defines
Will likely have to disable for MUSA as I didn't find any docs online
2025-12-09 18:51:28 +01:00
Gabe Goodhart
02e409a5be ggml : Provide macos-specific backtrace printing to avoid terminal death (#17869)
* fix: Provide macos-specific backtrace printing to avoid terminal death

Branch: MacOSSafeBacktrace

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add GGML_BACKTRACE_LLDB env var to enable using lldb for backtrace

Branch: MacOSSafeBacktrace

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
b7337
2025-12-09 18:29:07 +02:00
Georgi Gerganov
34b407b41c sampling : use host buffer type for inputs 2025-12-09 17:53:17 +02:00
Georgi Gerganov
92ff767918 llama : require backend samplers to be of type llama_sampler_chain 2025-12-09 15:38:37 +02:00
Georgi Gerganov
6b82eb7883 metal : print node names for debugging (#17882) b7336 2025-12-09 15:25:49 +02:00
Oliver Simons
07003f1ffb Fix compiler warnings by casting const away 2025-12-09 13:05:43 +01:00
Oliver Simons
886c3668b5 Add TODOs to and adjust heuristics of row-wise soft_max in CUDA
Heuristics were selected based on the following numbers:

```
-- Before
Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97250 MB (96691 MB free)

  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                2236 runs -   450.34 us/run -   655360 kB/run - 1401.20 GB/s
  SOFT_MAX(type=f32,ne=[12888,256,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):               17748 runs -    56.80 us/run -   128880 kB/run - 2168.19 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 57204 runs -    18.35 us/run -    12320 kB/run -  640.57 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):               9840 runs -   102.46 us/run -    81920 kB/run -  763.45 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                98064 runs -    10.25 us/run -     6160 kB/run -  573.43 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                98310 runs -    10.25 us/run -    10240 kB/run -  953.20 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 172011 runs -     5.99 us/run -      640 kB/run -  101.84 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 172011 runs -     5.97 us/run -      770 kB/run -  123.02 GB/s
  SOFT_MAX(type=f32,ne=[8192,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 172011 runs -     6.00 us/run -       64 kB/run -   10.16 GB/s
  SOFT_MAX(type=f32,ne=[8192,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 163820 runs -     6.12 us/run -      256 kB/run -   39.91 GB/s
  SOFT_MAX(type=f32,ne=[8192,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                147438 runs -     6.88 us/run -     1024 kB/run -  141.92 GB/s
  SOFT_MAX(type=f32,ne=[16384,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                122865 runs -     8.20 us/run -      128 kB/run -   14.89 GB/s
  SOFT_MAX(type=f32,ne=[16384,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                114674 runs -     8.87 us/run -      512 kB/run -   55.06 GB/s
  SOFT_MAX(type=f32,ne=[16384,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                98292 runs -    10.24 us/run -     2048 kB/run -  190.82 GB/s
  SOFT_MAX(type=f32,ne=[32768,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 49146 runs -    21.37 us/run -      256 kB/run -   11.43 GB/s
  SOFT_MAX(type=f32,ne=[32768,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 49146 runs -    22.54 us/run -     1024 kB/run -   43.33 GB/s
  SOFT_MAX(type=f32,ne=[32768,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                49146 runs -    23.92 us/run -     4096 kB/run -  163.32 GB/s
  SOFT_MAX(type=f32,ne=[65536,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 32764 runs -    38.94 us/run -      512 kB/run -   12.54 GB/s
  SOFT_MAX(type=f32,ne=[65536,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 24573 runs -    41.94 us/run -     2048 kB/run -   46.57 GB/s
  SOFT_MAX(type=f32,ne=[65536,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                24582 runs -    43.09 us/run -     8192 kB/run -  181.32 GB/s
  SOFT_MAX(type=f32,ne=[131072,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                16382 runs -    74.56 us/run -     1024 kB/run -   13.10 GB/s
  SOFT_MAX(type=f32,ne=[131072,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                16382 runs -    79.85 us/run -     4096 kB/run -   48.92 GB/s
  SOFT_MAX(type=f32,ne=[131072,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):               12294 runs -    82.41 us/run -    16384 kB/run -  189.64 GB/s
  SOFT_MAX(type=f32,ne=[262144,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 8191 runs -   145.16 us/run -     2048 kB/run -   13.46 GB/s
  SOFT_MAX(type=f32,ne=[262144,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 8194 runs -   155.46 us/run -     8192 kB/run -   50.26 GB/s
  SOFT_MAX(type=f32,ne=[262144,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                7175 runs -   160.70 us/run -    32768 kB/run -  194.56 GB/s
  SOFT_MAX(type=f32,ne=[524288,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 8191 runs -   285.81 us/run -     4096 kB/run -   13.67 GB/s
  SOFT_MAX(type=f32,ne=[524288,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 4098 runs -   306.91 us/run -    16384 kB/run -   50.92 GB/s
  SOFT_MAX(type=f32,ne=[524288,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                3591 runs -   317.06 us/run -    65536 kB/run -  197.32 GB/s

-- After
Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97250 MB (96691 MB free)

  SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                2236 runs -   450.67 us/run -   655360 kB/run - 1400.15 GB/s
  SOFT_MAX(type=f32,ne=[12888,256,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):               17748 runs -    56.97 us/run -   128880 kB/run - 2161.50 GB/s
  SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 57204 runs -    18.35 us/run -    12320 kB/run -  640.36 GB/s
  SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):               9840 runs -   102.46 us/run -    81920 kB/run -  763.42 GB/s
  SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                98064 runs -    10.25 us/run -     6160 kB/run -  573.43 GB/s
  SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                98310 runs -    10.25 us/run -    10240 kB/run -  953.21 GB/s
  SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 147438 runs -     7.00 us/run -      640 kB/run -   87.26 GB/s
  SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 147438 runs -     6.99 us/run -      770 kB/run -  105.05 GB/s
  SOFT_MAX(type=f32,ne=[8192,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 172011 runs -     6.02 us/run -       64 kB/run -   10.13 GB/s
  SOFT_MAX(type=f32,ne=[8192,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 163820 runs -     6.12 us/run -      256 kB/run -   39.87 GB/s
  SOFT_MAX(type=f32,ne=[8192,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                147438 runs -     6.91 us/run -     1024 kB/run -  141.40 GB/s
  SOFT_MAX(type=f32,ne=[16384,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                122865 runs -     8.20 us/run -      128 kB/run -   14.89 GB/s
  SOFT_MAX(type=f32,ne=[16384,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                114674 runs -     8.79 us/run -      512 kB/run -   55.54 GB/s
  SOFT_MAX(type=f32,ne=[16384,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                98292 runs -    10.24 us/run -     2048 kB/run -  190.82 GB/s
  SOFT_MAX(type=f32,ne=[32768,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                131056 runs -     8.11 us/run -      256 kB/run -   30.12 GB/s
  SOFT_MAX(type=f32,ne=[32768,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 49146 runs -    22.54 us/run -     1024 kB/run -   43.33 GB/s
  SOFT_MAX(type=f32,ne=[32768,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                49146 runs -    23.32 us/run -     4096 kB/run -  167.50 GB/s
  SOFT_MAX(type=f32,ne=[65536,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                122865 runs -     8.19 us/run -      512 kB/run -   59.63 GB/s
  SOFT_MAX(type=f32,ne=[65536,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                 40955 runs -    24.59 us/run -     2048 kB/run -   79.43 GB/s
  SOFT_MAX(type=f32,ne=[65536,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                24582 runs -    43.21 us/run -     8192 kB/run -  180.84 GB/s
  SOFT_MAX(type=f32,ne=[131072,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):               122865 runs -     8.19 us/run -     1024 kB/run -  119.25 GB/s
  SOFT_MAX(type=f32,ne=[131072,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                40955 runs -    24.59 us/run -     4096 kB/run -  158.87 GB/s
  SOFT_MAX(type=f32,ne=[131072,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):               12294 runs -    82.37 us/run -    16384 kB/run -  189.74 GB/s
  SOFT_MAX(type=f32,ne=[262144,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):               122865 runs -     8.20 us/run -     2048 kB/run -  238.28 GB/s
  SOFT_MAX(type=f32,ne=[262144,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                36873 runs -    28.66 us/run -     8192 kB/run -  272.61 GB/s
  SOFT_MAX(type=f32,ne=[262144,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                9225 runs -   108.51 us/run -    32768 kB/run -  288.13 GB/s
  SOFT_MAX(type=f32,ne=[524288,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                98292 runs -    10.24 us/run -     4096 kB/run -  381.65 GB/s
  SOFT_MAX(type=f32,ne=[524288,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                32784 runs -    31.74 us/run -    16384 kB/run -  492.43 GB/s
  SOFT_MAX(type=f32,ne=[524288,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0):                8721 runs -   121.20 us/run -    65536 kB/run -  516.19 GB/s
```
2025-12-09 12:58:56 +01:00
Oliver Simons
a84dfd3e10 CUDA: Add Cooperative-Groups-based parallelization of ncols in softmax
Old implementation parallelizes rows across SMs, which does not fit the
needs of backend-sampling (where we have ncols >> nrows and thus want to
parallelize ncols across SMs)
2025-12-09 12:58:56 +01:00
Sigbjørn Skjæret
86a3f0fad8 ggml : allow fill node alloc inplace (#17870) b7335 2025-12-09 12:23:47 +01:00
Rhys-T
63908b631a cmake: fix Mach-O current version number (#17877)
PR #17091 set the VERSION of various libraries to 0.0.abcd, where abcd
is the LLAMA_BUILD_NUMBER. That build number is too large to fit in the
Mach-O 'current version' field's 'micro' part, which only goes up to
255. This just sets the Mach-O current version to 0 to get it building
properly again.

Fixes #17258.
b7334
2025-12-09 13:17:41 +02:00
Sigbjørn Skjæret
42b12b5608 model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B (#12652)
* nit, DeepSeek V1 MoE is 16B

* base type on n_ff_exp instead
b7333
2025-12-09 12:15:06 +01:00
Xuan-Son Nguyen
4e842d5120 console: allow using arrow left/right, home/end keys and history mode (#17836)
* console: allow using arrow left/right to edit the line (with UTF-8 support)

* console: fix arrow keys on Windows using private-use Unicode

* console: add Home/End key support for Windows and Linux

* console: add basic Up/Down history navigation

* fix build

* console: allow using arrow left/right to edit the line (with UTF-8 support)

* console: fix arrow keys on Windows using private-use Unicode

* console: add Home/End key support for Windows and Linux

* console: add basic Up/Down history navigation

* console: remove unreachable wc == 0 check after VK switch

* console: add Ctrl+Left/Right word navigation

- Add KEY_CTRL_ARROW_LEFT and KEY_CTRL_ARROW_RIGHT codes
- Windows: detect CTRL modifier via dwControlKeyState
- Linux: parse ANSI sequences with modifier (1;5D/C)
- Implement move_word_left/right with space-skipping logic
- Refactor escape sequence parsing to accumulate params

* console: add Delete key support

- Windows: VK_DELETE detection
- Linux: ESC[3~ sequence parsing
- Forward character deletion with UTF-8 support

* console: implement bash-style history editing

- Edit any history line during UP/DOWN navigation, edits persist
- Pressing Enter appends edited version as new history entry
- Original line stay untouched in their positions

* clean up

* better history impl

* fix decode_utf8

---------

Co-authored-by: Pascal <admin@serveurperso.com>
b7332
2025-12-09 11:53:59 +01:00
Georgi Gerganov
7ab6f51b97 Revert "ggml : remove redundant src in ggml_cast"
This reverts commit 62d1b0082d.
2025-12-09 12:52:59 +02:00
Chenguang Li
ca709e427b CANN: add support for partial RoPE and Vision mode (#17543)
* cann: add support for partial RoPE and Vision mode

Add support for two important RoPE variants: partial rotation (rope_dims < ne0)
and Vision mode rotation.

1. Support for partial RoPE (rope_dims < ne0):
   - Split tensor into head (first rope_dims dimensions) and tail portions
   - Apply rotation only to head portion using RotaryPositionEmbedding operator
   - Copy unrotated tail portion directly from source to destination
   - Handle both contiguous and non-contiguous tensor layouts

2. Support for Vision mode (GGML_ROPE_TYPE_VISION):
   - Set rope_dims = ne0 for Vision mode to rotate entire tensor
   - Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
   - No tail handling needed since entire tensor is rotated

Implementation details:
   - Use has_tail flag to determine execution path: head/tail splitting when
     rope_dims < ne0, or full tensor rotation when rope_dims == ne0
   - Support both F32 and F16 data types with intermediate F32 conversion
   - Copy non-contiguous tensors to contiguous buffers before calling
     RotaryPositionEmbedding operator for compatibility
   - Improve cache invalidation logic to include rope_dims and indep_sects
     parameters

These enhancements enable CANN backend to handle various RoPE configurations
used in modern vision-language models and models with partial rotation.

* cann: fix review comment
b7331
2025-12-09 17:53:23 +08:00
Georgi Gerganov
9f6681c3a4 ggml-alloc : fix reuse-parent logic for misaligned sizes 2025-12-09 11:13:44 +02:00
Georgi Gerganov
62d1b0082d ggml : remove redundant src in ggml_cast 2025-12-09 10:58:06 +02:00
Georgi Gerganov
d62b5804e1 metal : print node names for debugging 2025-12-09 10:55:54 +02:00
Georgi Gerganov
560ac16f7d server : handle unsupported cases 2025-12-09 10:55:11 +02:00
Johannes Gäßler
0cdce38a97 CUDA: fix FP16 overflow in tile FA kernel (#17875) b7330 2025-12-09 09:34:02 +01:00
Aldehir Rojas
e39502e74b llama : add token matching support to llama-grammar (#17816)
* llama : add token support to llama-grammar

* fix inverse token comment

* refactor trigger_patterns to replay tokens instead of the entire string

* add token documentation

* fix test-llama-grammar

* improve test cases for tokens
b7329
2025-12-09 00:32:57 -06:00
philip-essential
1d2a1ab73d model : support Rnj-1 (#17811)
* add support for rnj1

* refactor gemma3 to support rnj-1

* address review comments
b7328
2025-12-09 04:49:03 +01:00
Sigbjørn Skjæret
c8554b66e0 graph : use fill instead of scale_bias in grouped expert selection (#17867)
* use fill instead of scale_bias in grouped expert selection

* do not explicitly use _inplace
b7327
2025-12-08 21:29:59 +01:00
Georgi Gerganov
f3beb22b17 sampling : handle n_probs case 2025-12-08 21:30:10 +02:00
Daniel Bevenius
2fa51c19b0 model-conversion : add token ids to prompt token output [no ci] (#17863)
This commit adds the token ids to the printed prompt outputs.

The motivation for this is that is can be useful to see the actual token
ids alongside the token strings for debugging.
2025-12-08 17:13:08 +01:00
Xuan-Son Nguyen
951520ddb0 server: delegate result_state creation to server_task (#17835)
* server: delegate result_state creation to server_task

* remove unued states

* add more docs
b7325
2025-12-08 17:04:38 +01:00
Georgi Gerganov
6d38db5dfe Merge branch 'master' into HEAD 2025-12-08 17:55:24 +02:00
Neo Zhang
68522c678d ci : support bfloat16 SYCL release package (#17855)
* support bfloat16 release package

* add fallback file
b7324
2025-12-08 15:09:39 +01:00
Xuan-Son Nguyen
f896d2c34f server: improve speed of speculative decoding (#17808)
* server: improve speed of speculative decoding

* fix small draft case

* add link to the PR

* server : fix generation time measurement

* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

* server : add comment

* add PR to docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:35:28 +01:00
Piotr Wilkin (ilintar)
e4e9c4329c Make graph_max_nodes vary by ubatch size (#17794)
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph

* Update src/llama-context.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add missing const

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:32:41 +01:00
hksdpc255
636fc17a37 Fix Kimi-K2 tool-call parsing issues (#17376)
* Fix kimi-k2 parsing

* fix template & add more tests for kimi-k2

* Another fix for Kimi-K2 chat template.

* enable allow_toolcall_in_think for Kimi-K2

* Refine key-value separator and value end format

* Enable tool call in think for kimi-k2

* allow_toolcall_in_think is now tested with Kimi-K2

* Remove outdated TODO comment in XML tool call parser

Removed TODO comment about untested tool call feature.

* Rename function from "utf8_truncate_safe" to "utf8_truncate_safe_len"
2025-12-08 14:32:04 +01:00
Jay Zenith
51e0c2d917 cuda : add FILL op support (#17851)
* cuda : add FILL op support

* cuda : add missing FILL op files
2025-12-08 21:10:12 +08:00
Xuan-Son Nguyen
37a4f63244 server : add development documentation (#17760)
* first draft

* rewrite

* update & remove duplicated sections
2025-12-08 13:54:58 +01:00
Georgi Gerganov
2bc96931d2 server : make cache_reuse configurable per request (#17858) b7318 2025-12-08 12:43:12 +02:00
wsbagnsv1
5814b4dce1 cuda: optimize SOLVE_TRI using registers and FMAF (#17703)
* ggml-cuda: optimize solve_tri_f32_fast and fix stride handling

- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.

* Small cleanup

* Remove comments in solve_tri.cu

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Use const for variables in solve_tri.cu

* Replace fmaf with more readable code

* remove last fmaf

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b7317
2025-12-08 10:41:08 +01:00
ixgbe
79d61896d3 ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support (#17784)
* ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

* cmake: enable RISC-V zihintpause extension for Spacemit builds

* readme : add ZIHINTPAUSE support for RISC-V

---------

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
b7316
2025-12-08 10:41:34 +02:00
Xuan-Son Nguyen
4d3726278b model: add llama 4 scaling for mistral-large (deepseek arch) (#17744) b7315 2025-12-07 22:29:54 +01:00
lovedheart
08f9d3cc1d Vulkan: improve mul_mat_vec_iq1_m (#16907)
* Optimize Vulkan shader for matrix-vector multiplication

* Revert changes on compute_outputs and main

Refactor compute_outputs to handle remaining rows correctly.

* Fix trailing whitespace
b7314
2025-12-07 18:40:42 +01:00
Georgi Gerganov
72e3681073 sampling : fix top-p 2025-12-07 17:11:50 +02:00