llama : fix signed comparison warning on FreeBSD (#17497 )

This ensures correct RLIM_INFINITY handling and compatibility on all platforms (32/64-bit). warning: comparison of integers of different signs: 'rlim_t' (aka 'long') and 'size_t' (aka 'unsigned long') [-Wsign-compare] 488 | if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size)) { | ~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Adrien Gallouët <angt@huggingface.co>
convert: add error message for mistral3 quantized weight (#17686 )
2026-05-19 23:44:06 +00:00 · 2025-12-02 12:05:38 +01:00 · 2025-12-02 11:48:31 +01:00 · 2025-12-02 11:38:57 +01:00 · 2025-12-02 11:18:39 +01:00 · 2025-12-02 10:25:11 +01:00
18 changed files with 418 additions and 103 deletions
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -66,14 +66,21 @@ jobs:
        id: pack_artifacts
        run: |
          cp LICENSE ./build/bin/
-          zip -r llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.zip ./build/bin/*
+          zip -y -r llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.zip ./build/bin/*
+          tar -czvf llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz -C ./build/bin .

-      - name: Upload artifacts
+      - name: Upload artifacts (zip)
        uses: actions/upload-artifact@v4
        with:
          path: llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.zip
          name: llama-bin-macos-arm64.zip

+      - name: Upload artifacts (tar)
+        uses: actions/upload-artifact@v4
+        with:
+          path: llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz
+          name: llama-bin-macos-arm64.tar.gz
+
  macOS-x64:
    runs-on: macos-15-intel

@@ -120,14 +127,21 @@ jobs:
        id: pack_artifacts
        run: |
          cp LICENSE ./build/bin/
-          zip -r llama-${{ steps.tag.outputs.name }}-bin-macos-x64.zip ./build/bin/*
+          zip -y -r llama-${{ steps.tag.outputs.name }}-bin-macos-x64.zip ./build/bin/*
+          tar -czvf llama-${{ steps.tag.outputs.name }}-bin-macos-x64.tar.gz -C ./build/bin .

-      - name: Upload artifacts
+      - name: Upload artifacts (zip)
        uses: actions/upload-artifact@v4
        with:
          path: llama-${{ steps.tag.outputs.name }}-bin-macos-x64.zip
          name: llama-bin-macos-x64.zip

+      - name: Upload artifacts (tar)
+        uses: actions/upload-artifact@v4
+        with:
+          path: llama-${{ steps.tag.outputs.name }}-bin-macos-x64.tar.gz
+          name: llama-bin-macos-x64.tar.gz
+
  ubuntu-22-cpu:
    strategy:
      matrix:
@@ -182,14 +196,21 @@ jobs:
        id: pack_artifacts
        run: |
          cp LICENSE ./build/bin/
-          zip -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.zip ./build/bin/*
+          zip -y -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.zip ./build/bin/*
+          tar -czvf llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.tar.gz -C ./build/bin .

-      - name: Upload artifacts
+      - name: Upload artifacts (zip)
        uses: actions/upload-artifact@v4
        with:
          path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.zip
          name: llama-bin-ubuntu-${{ matrix.build }}.zip

+      - name: Upload artifacts (tar)
+        uses: actions/upload-artifact@v4
+        with:
+          path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.tar.gz
+          name: llama-bin-ubuntu-${{ matrix.build }}.tar.gz
+
  ubuntu-22-vulkan:
    runs-on: ubuntu-22.04

@@ -235,14 +256,21 @@ jobs:
        id: pack_artifacts
        run: |
          cp LICENSE ./build/bin/
-          zip -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.zip ./build/bin/*
+          zip -y -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.zip ./build/bin/*
+          tar -czvf llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.tar.gz -C ./build/bin .

-      - name: Upload artifacts
+      - name: Upload artifacts (zip)
        uses: actions/upload-artifact@v4
        with:
          path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.zip
          name: llama-bin-ubuntu-vulkan-x64.zip

+      - name: Upload artifacts (tar)
+        uses: actions/upload-artifact@v4
+        with:
+          path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.tar.gz
+          name: llama-bin-ubuntu-vulkan-x64.tar.gz
+
  windows-cpu:
    runs-on: windows-2025

@@ -298,7 +326,7 @@ jobs:
        run: |
          Copy-Item $env:CURL_PATH\bin\libcurl-${{ matrix.arch }}.dll .\build\bin\Release\
          Copy-Item "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.44.35112\debug_nonredist\${{ matrix.arch }}\Microsoft.VC143.OpenMP.LLVM\libomp140.${{ matrix.arch == 'x64' && 'x86_64' || 'aarch64' }}.dll" .\build\bin\Release\
-          7z a llama-bin-win-cpu-${{ matrix.arch }}.zip .\build\bin\Release\*
+          7z a -snl llama-bin-win-cpu-${{ matrix.arch }}.zip .\build\bin\Release\*

      - name: Upload artifacts
        uses: actions/upload-artifact@v4
@@ -380,7 +408,7 @@ jobs:
      - name: Pack artifacts
        id: pack_artifacts
        run: |
-          7z a llama-bin-win-${{ matrix.backend }}-${{ matrix.arch }}.zip .\build\bin\Release\${{ matrix.target }}.dll
+          7z a -snl llama-bin-win-${{ matrix.backend }}-${{ matrix.arch }}.zip .\build\bin\Release\${{ matrix.target }}.dll

      - name: Upload artifacts
        uses: actions/upload-artifact@v4
@@ -434,7 +462,7 @@ jobs:
      - name: Pack artifacts
        id: pack_artifacts
        run: |
-          7z a llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip .\build\bin\Release\ggml-cuda.dll
+          7z a -snl llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip .\build\bin\Release\ggml-cuda.dll

      - name: Upload artifacts
        uses: actions/upload-artifact@v4
@@ -526,7 +554,7 @@ jobs:
          cp "${{ env.ONEAPI_ROOT }}/umf/latest/bin/umf.dll" ./build/bin

          echo "cp oneAPI running time dll files to ./build/bin done"
-          7z a llama-bin-win-sycl-x64.zip ./build/bin/*
+          7z a -snl llama-bin-win-sycl-x64.zip ./build/bin/*

      - name: Upload the release package
        uses: actions/upload-artifact@v4
@@ -632,7 +660,7 @@ jobs:
      - name: Pack artifacts
        id: pack_artifacts
        run: |
-          7z a llama-bin-win-hip-${{ matrix.name }}-x64.zip .\build\bin\*
+          7z a -snl llama-bin-win-hip-${{ matrix.name }}-x64.zip .\build\bin\*

      - name: Upload artifacts
        uses: actions/upload-artifact@v4
@@ -685,13 +713,20 @@ jobs:
      - name: Pack artifacts
        id: pack_artifacts
        run: |
-          zip --symlinks -r llama-${{ steps.tag.outputs.name }}-xcframework.zip build-apple/llama.xcframework
+          zip -y -r llama-${{ steps.tag.outputs.name }}-xcframework.zip build-apple/llama.xcframework
+          tar -czvf llama-${{ steps.tag.outputs.name }}-xcframework.tar.gz -C build-apple llama.xcframework

-      - name: Upload artifacts
+      - name: Upload artifacts (zip)
        uses: actions/upload-artifact@v4
        with:
          path: llama-${{ steps.tag.outputs.name }}-xcframework.zip
-          name: llama-${{ steps.tag.outputs.name }}-xcframework
+          name: llama-${{ steps.tag.outputs.name }}-xcframework.zip
+
+      - name: Upload artifacts (tar)
+        uses: actions/upload-artifact@v4
+        with:
+          path: llama-${{ steps.tag.outputs.name }}-xcframework.tar.gz
+          name: llama-${{ steps.tag.outputs.name }}-xcframework.tar.gz

  openEuler-cann:
    strategy:
@@ -730,14 +765,21 @@ jobs:
      - name: Pack artifacts
        run: |
          cp LICENSE ./build/bin/
-          zip -r llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.zip ./build/bin/*
+          zip -y -r llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.zip ./build/bin/*
+          tar -czvf llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.tar.gz -C ./build/bin .

-      - name: Upload artifacts
+      - name: Upload artifacts (zip)
        uses: actions/upload-artifact@v4
        with:
          path: llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.zip
          name: llama-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.zip

+      - name: Upload artifacts (tar)
+        uses: actions/upload-artifact@v4
+        with:
+          path: llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.tar.gz
+          name: llama-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.tar.gz
+
  release:
    if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}

@@ -814,6 +856,7 @@ jobs:

          echo "Moving other artifacts..."
          mv -v artifact/*.zip release
+          mv -v artifact/*.tar.gz release

      - name: Create release
        id: create_release
@@ -822,6 +865,39 @@ jobs:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          tag_name: ${{ steps.tag.outputs.name }}
+          body: |
+            > [!WARNING]
+            > **Release Format Update**: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
+
+            **macOS/iOS:**
+            - [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz)
+            - [macOS Intel (x64)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-macos-x64.tar.gz)
+            - [iOS XCFramework](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-xcframework.tar.gz)
+
+            **Linux:**
+            - [Ubuntu x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-x64.tar.gz)
+            - [Ubuntu x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.tar.gz)
+            - [Ubuntu s390x (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-s390x.tar.gz)
+
+            **Windows:**
+            - [Windows x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cpu-x64.zip)
+            - [Windows arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cpu-arm64.zip)
+            - [Windows x64 (CUDA)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-12.4-x64.zip)
+            - [Windows x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-vulkan-x64.zip)
+            - [Windows x64 (SYCL)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-sycl-x64.zip)
+            - [Windows x64 (HIP)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-hip-radeon-x64.zip)
+
+            **openEuler:**
+            - [openEuler x86 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-310p-openEuler-x86.tar.gz)
+            - [openEuler x86 (910b)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-910b-openEuler-x86.tar.gz)
+            - [openEuler aarch64 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-310p-openEuler-aarch64.tar.gz)
+            - [openEuler aarch64 (910b)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-910b-openEuler-aarch64.tar.gz)
+
+            <details>
+
+            ${{ github.event.head_commit.message }}
+
+            </details>

      - name: Upload release
        id: upload_release
@@ -833,7 +909,7 @@ jobs:
            const fs = require('fs');
            const release_id = '${{ steps.create_release.outputs.id }}';
            for (let file of await fs.readdirSync('./release')) {
-              if (path.extname(file) === '.zip') {
+              if (path.extname(file) === '.zip' || file.endsWith('.tar.gz')) {
                console.log('uploadReleaseAsset', file);
                await github.repos.uploadReleaseAsset({
                  owner: context.repo.owner,
--- a/.github/workflows/winget.yml
+++ b/.github/workflows/winget.yml
@@ -9,6 +9,7 @@ jobs:
  update:
    name: Update Winget Package
    runs-on: ubuntu-latest
+    if: ${{ github.repository.owner.login == 'ggml-org' }}

    steps:
      - name: Install cargo binstall
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -2842,6 +2842,10 @@ class Mistral3Model(LlamaModel):
            self.gguf_writer.add_attn_temperature_scale(rope_params["llama_4_scaling_beta"])

    def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None):
+        # TODO: probably not worth supporting quantized weight, as official BF16 is also available
+        if name.endswith("weight_scale_inv"):
+            raise ValueError("This is a quantized weight, please use BF16 weight instead")
+
        name = name.replace("language_model.", "")
        if "multi_modal_projector" in name or "vision_tower" in name:
            return []
--- a/ggml/src/ggml-cpu/arch/arm/cpu-feats.cpp
+++ b/ggml/src/ggml-cpu/arch/arm/cpu-feats.cpp
@@ -8,6 +8,10 @@
 #include <sys/sysctl.h>
 #endif

+#if !defined(HWCAP2_SVE2)
+#define HWCAP2_SVE2 (1 << 1)
+#endif
+
 #if !defined(HWCAP2_I8MM)
 #define HWCAP2_I8MM (1 << 13)
 #endif
--- a/ggml/src/ggml-cuda/common.cuh
+++ b/ggml/src/ggml-cuda/common.cuh
@@ -989,6 +989,10 @@ struct ggml_cuda_concurrent_event {
    int                                          n_streams = 0;
    std::unordered_map<const ggml_tensor *, int> stream_mapping;

+    // Original order of nodes in this concurrent region (before interleaving)
+    // Used to restore grouping for fusion within streams
+    std::vector<const ggml_tensor *> original_order;
+
    const ggml_tensor * join_node;

    ggml_cuda_concurrent_event() = default;
@@ -1011,6 +1015,7 @@ struct ggml_cuda_concurrent_event {
    , fork_event(other.fork_event)
    , n_streams(other.n_streams)
    , stream_mapping(std::move(other.stream_mapping))
+    , original_order(std::move(other.original_order))
    , join_node(other.join_node) {
        other.fork_event = nullptr;
    }
@@ -1121,11 +1126,9 @@ struct ggml_cuda_concurrent_event {
 };

 struct ggml_cuda_stream_context {
-    std::vector<const ggml_tensor *>                                    original_nodes;
    std::unordered_map<const ggml_tensor *, ggml_cuda_concurrent_event> concurrent_events;

    void reset() {
-        original_nodes.clear();
        concurrent_events.clear();
    }
 };
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -3238,9 +3238,56 @@ static void evaluate_and_capture_cuda_graph(ggml_backend_cuda_context * cuda_ctx
                }
            }
            if (should_launch_concurrent_events) {
-                //Restore the original graph to enable fusion within the streams
-                cgraph->nodes   = const_cast<ggml_tensor **>(stream_ctx.original_nodes.data());
-                cgraph->n_nodes = (int) stream_ctx.original_nodes.size();
+                // Restore original node order within each concurrent region to enable fusion within streams
+
+                std::unordered_map<const ggml_tensor *, int> node_to_idx;
+                node_to_idx.reserve(cgraph->n_nodes);
+                for (int i = 0; i < cgraph->n_nodes; ++i) {
+                    node_to_idx[cgraph->nodes[i]] = i;
+                }
+
+                for (auto & [fork_node, event] : stream_ctx.concurrent_events) {
+                    // Find positions of all nodes from this event in the current graph
+                    std::vector<int> positions;
+                    positions.reserve(event.original_order.size());
+
+                    bool all_found = true;
+                    for (const ggml_tensor * orig_node : event.original_order) {
+                        auto it = node_to_idx.find(orig_node);
+                        if (it != node_to_idx.end()) {
+                            positions.push_back(it->second);
+                        } else {
+                            all_found = false;
+                            break;
+                        }
+                    }
+
+                    if (!all_found || positions.size() != event.original_order.size()) {
+                        continue;
+                    }
+
+                    // Sort positions to get contiguous range
+                    std::vector<int> sorted_positions = positions;
+                    std::sort(sorted_positions.begin(), sorted_positions.end());
+
+                    bool is_contiguous = true;
+                    for (size_t i = 1; i < sorted_positions.size(); ++i) {
+                        if (sorted_positions[i] != sorted_positions[i-1] + 1) {
+                            is_contiguous = false;
+                            break;
+                        }
+                    }
+
+                    if (!is_contiguous) {
+                        continue;
+                    }
+
+                    // Restore original order at the sorted positions
+                    int start_pos = sorted_positions[0];
+                    for (size_t i = 0; i < event.original_order.size(); ++i) {
+                        cgraph->nodes[start_pos + i] = const_cast<ggml_tensor *>(event.original_order[i]);
+                    }
+                }
            }

            for (int i = 0; i < cgraph->n_nodes; i++) {
@@ -3805,14 +3852,6 @@ static void ggml_backend_cuda_graph_optimize(ggml_backend_t backend, ggml_cgraph
    // store {fork_idx, join_idx}
    std::vector<std::pair<int, int>> concurrent_node_ranges;

-    // save the original nodes
-    std::vector<const ggml_tensor *> original_nodes;
-    original_nodes.reserve(cgraph->n_nodes);
-    for (int i = 0; i < cgraph->n_nodes; ++i) {
-        original_nodes.push_back(cgraph->nodes[i]);
-    }
-    cuda_ctx->stream_context().original_nodes = std::move(original_nodes);
-
    for (const auto & [root_node, count] : fan_out) {
        if (count >= min_fan_out && count <= max_fan_out) {
            const int root_node_idx = node_indices[root_node];
@@ -3917,6 +3956,13 @@ static void ggml_backend_cuda_graph_optimize(ggml_backend_t backend, ggml_cgraph
                    continue;
                }

+                // Save the original order of nodes in this region before interleaving
+                // This is used later to restore grouping for fusion within streams
+                concurrent_event.original_order.reserve(total_branch_nodes);
+                for (int i = fork_node_idx + 1; i < join_node_idx; ++i) {
+                    concurrent_event.original_order.push_back(cgraph->nodes[i]);
+                }
+
                std::unordered_map<const ggml_tensor *, ggml_cuda_concurrent_event> & concurrent_events = cuda_ctx->stream_context().concurrent_events;
                GGML_ASSERT(concurrent_events.find(root_node) == concurrent_events.end());
                concurrent_events.emplace(root_node, std::move(concurrent_event));
--- a/src/llama-mmap.cpp
+++ b/src/llama-mmap.cpp
@@ -485,7 +485,7 @@ struct llama_mlock::impl {
        if (suggest && getrlimit(RLIMIT_MEMLOCK, &lock_limit)) {
            suggest = false;
        }
-        if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size)) {
+        if (suggest && ((uint64_t)lock_limit.rlim_max > (uint64_t)lock_limit.rlim_cur + size)) {
            suggest = false;
        }
 #endif
--- a/tools/server/public/index.html.gz
+++ b/tools/server/public/index.html.gz
--- a/tools/server/server-common.cpp
+++ b/tools/server/server-common.cpp
@@ -1263,7 +1263,11 @@ json convert_anthropic_to_oai(const json & body) {
    return oai_body;
 }

-json format_embeddings_response_oaicompat(const json & request, const json & embeddings, bool use_base64) {
+json format_embeddings_response_oaicompat(
+        const json & request,
+        const std::string & model_name,
+        const json & embeddings,
+        bool use_base64) {
    json data = json::array();
    int32_t n_tokens = 0;
    int i = 0;
@@ -1293,7 +1297,7 @@ json format_embeddings_response_oaicompat(const json & request, const json & emb
    }

    json res = json {
-        {"model", json_value(request, "model", std::string(DEFAULT_OAICOMPAT_MODEL))},
+        {"model", json_value(request, "model", model_name)},
        {"object", "list"},
        {"usage", json {
            {"prompt_tokens", n_tokens},
@@ -1307,6 +1311,7 @@ json format_embeddings_response_oaicompat(const json & request, const json & emb

 json format_response_rerank(
        const json & request,
+        const std::string & model_name,
        const json & ranks,
        bool is_tei_format,
        std::vector<std::string> & texts,
@@ -1338,7 +1343,7 @@ json format_response_rerank(
    if (is_tei_format) return results;

    json res = json{
-        {"model", json_value(request, "model", std::string(DEFAULT_OAICOMPAT_MODEL))},
+        {"model", json_value(request, "model", model_name)},
        {"object", "list"},
        {"usage", json{
            {"prompt_tokens", n_tokens},
--- a/tools/server/server-common.h
+++ b/tools/server/server-common.h
@@ -13,8 +13,6 @@
 #include <vector>
 #include <cinttypes>

-#define DEFAULT_OAICOMPAT_MODEL "gpt-3.5-turbo"
-
 const static std::string build_info("b" + std::to_string(LLAMA_BUILD_NUMBER) + "-" + LLAMA_COMMIT);

 using json = nlohmann::ordered_json;
@@ -298,11 +296,16 @@ json oaicompat_chat_params_parse(
 json convert_anthropic_to_oai(const json & body);

 // TODO: move it to server-task.cpp
-json format_embeddings_response_oaicompat(const json & request, const json & embeddings, bool use_base64 = false);
+json format_embeddings_response_oaicompat(
+    const json & request,
+    const std::string & model_name,
+    const json & embeddings,
+    bool use_base64 = false);

 // TODO: move it to server-task.cpp
 json format_response_rerank(
        const json & request,
+        const std::string & model_name,
        const json & ranks,
        bool is_tei_format,
        std::vector<std::string> & texts,
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -17,6 +17,7 @@
 #include <cinttypes>
 #include <memory>
 #include <unordered_set>
+#include <filesystem>

 // fix problem with std::min and std::max
 #if defined(_WIN32)
@@ -518,6 +519,8 @@ struct server_context_impl {
    // Necessary similarity of prompt for slot selection
    float slot_prompt_similarity = 0.0f;

+    std::string model_name; // name of the loaded model, to be used by API
+
    common_chat_templates_ptr chat_templates;
    oaicompat_parser_options  oai_parser_opt;

@@ -758,6 +761,18 @@ struct server_context_impl {
        }
        SRV_WRN("%s", "for more info see https://github.com/ggml-org/llama.cpp/pull/16391\n");

+        if (!params_base.model_alias.empty()) {
+            // user explicitly specified model name
+            model_name = params_base.model_alias;
+        } else if (!params_base.model.name.empty()) {
+            // use model name in registry format (for models in cache)
+            model_name = params_base.model.name;
+        } else {
+            // fallback: derive model name from file name
+            auto model_path = std::filesystem::path(params_base.model.path);
+            model_name = model_path.filename().string();
+        }
+
        // thinking is enabled if:
        // 1. It's not explicitly disabled (reasoning_budget == 0)
        // 2. The chat template supports it
@@ -2611,7 +2626,7 @@ static std::unique_ptr<server_res_generator> handle_completions_impl(
            // OAI-compat
            task.params.res_type          = res_type;
            task.params.oaicompat_cmpl_id = completion_id;
-            // oaicompat_model is already populated by params_from_json_cmpl
+            task.params.oaicompat_model   = ctx_server.model_name;

            tasks.push_back(std::move(task));
        }
@@ -2939,7 +2954,7 @@ void server_routes::init_routes() {
        json data = {
            { "default_generation_settings", default_generation_settings_for_props },
            { "total_slots",                 ctx_server.params_base.n_parallel },
-            { "model_alias",                 ctx_server.params_base.model_alias },
+            { "model_alias",                 ctx_server.model_name },
            { "model_path",                  ctx_server.params_base.model.path },
            { "modalities",                  json {
                {"vision", ctx_server.oai_parser_opt.allow_image},
@@ -3181,8 +3196,8 @@ void server_routes::init_routes() {
        json models = {
            {"models", {
                {
-                    {"name", params.model_alias.empty() ? params.model.path : params.model_alias},
-                    {"model", params.model_alias.empty() ? params.model.path : params.model_alias},
+                    {"name", ctx_server.model_name},
+                    {"model", ctx_server.model_name},
                    {"modified_at", ""},
                    {"size", ""},
                    {"digest", ""}, // dummy value, llama.cpp does not support managing model file's hash
@@ -3204,7 +3219,7 @@ void server_routes::init_routes() {
            {"object", "list"},
            {"data", {
                {
-                    {"id",       params.model_alias.empty() ? params.model.path : params.model_alias},
+                    {"id",       ctx_server.model_name},
                    {"object",   "model"},
                    {"created",  std::time(0)},
                    {"owned_by", "llamacpp"},
@@ -3351,6 +3366,7 @@ void server_routes::init_routes() {
        // write JSON response
        json root = format_response_rerank(
            body,
+            ctx_server.model_name,
            responses,
            is_tei_format,
            documents,
@@ -3613,7 +3629,7 @@ std::unique_ptr<server_res_generator> server_routes::handle_embeddings_impl(cons

    // write JSON response
    json root = res_type == TASK_RESPONSE_TYPE_OAI_EMBD
-        ? format_embeddings_response_oaicompat(body, responses, use_base64)
+        ? format_embeddings_response_oaicompat(body, ctx_server.model_name, responses, use_base64)
        : json(responses);
    res->ok(root);
    return res;
--- a/tools/server/server-models.cpp
+++ b/tools/server/server-models.cpp
@@ -24,8 +24,55 @@
 #include <unistd.h>
 #endif

+#if defined(__APPLE__) && defined(__MACH__)
+// macOS: use _NSGetExecutablePath to get the executable path
+#include <mach-o/dyld.h>
+#include <limits.h>
+#endif
+
 #define CMD_EXIT "exit"

+static std::filesystem::path get_server_exec_path() {
+#if defined(_WIN32)
+    wchar_t buf[32768] = { 0 };  // Large buffer to handle long paths
+    DWORD len = GetModuleFileNameW(nullptr, buf, _countof(buf));
+    if (len == 0 || len >= _countof(buf)) {
+        throw std::runtime_error("GetModuleFileNameW failed or path too long");
+    }
+    return std::filesystem::path(buf);
+#elif defined(__APPLE__) && defined(__MACH__)
+    char small_path[PATH_MAX];
+    uint32_t size = sizeof(small_path);
+
+    if (_NSGetExecutablePath(small_path, &size) == 0) {
+        // resolve any symlinks to get absolute path
+        try {
+            return std::filesystem::canonical(std::filesystem::path(small_path));
+        } catch (...) {
+            return std::filesystem::path(small_path);
+        }
+    } else {
+        // buffer was too small, allocate required size and call again
+        std::vector<char> buf(size);
+        if (_NSGetExecutablePath(buf.data(), &size) == 0) {
+            try {
+                return std::filesystem::canonical(std::filesystem::path(buf.data()));
+            } catch (...) {
+                return std::filesystem::path(buf.data());
+            }
+        }
+        throw std::runtime_error("_NSGetExecutablePath failed after buffer resize");
+    }
+#else
+    char path[FILENAME_MAX];
+    ssize_t count = readlink("/proc/self/exe", path, FILENAME_MAX);
+    if (count <= 0) {
+        throw std::runtime_error("failed to resolve /proc/self/exe");
+    }
+    return std::filesystem::path(std::string(path, count));
+#endif
+}
+
 struct local_model {
    std::string name;
    std::string path;
@@ -99,6 +146,14 @@ server_models::server_models(
    for (char ** env = envp; *env != nullptr; env++) {
        base_env.push_back(std::string(*env));
    }
+    GGML_ASSERT(!base_args.empty());
+    // set binary path
+    try {
+        base_args[0] = get_server_exec_path().string();
+    } catch (const std::exception & e) {
+        LOG_WRN("failed to get server executable path: %s\n", e.what());
+        LOG_WRN("using original argv[0] as fallback: %s\n", base_args[0].c_str());
+    }
    // TODO: allow refreshing cached model list
    // add cached models
    auto cached_models = common_list_cached_models();
@@ -587,26 +642,26 @@ static void res_ok(std::unique_ptr<server_http_res> & res, const json & response
    res->data = safe_json_to_str(response_data);
 }

-static void res_error(std::unique_ptr<server_http_res> & res, const json & error_data) {
+static void res_err(std::unique_ptr<server_http_res> & res, const json & error_data) {
    res->status = json_value(error_data, "code", 500);
    res->data = safe_json_to_str({{ "error", error_data }});
 }

 static bool router_validate_model(const std::string & name, server_models & models, bool models_autoload, std::unique_ptr<server_http_res> & res) {
    if (name.empty()) {
-        res_error(res, format_error_response("model name is missing from the request", ERROR_TYPE_INVALID_REQUEST));
+        res_err(res, format_error_response("model name is missing from the request", ERROR_TYPE_INVALID_REQUEST));
        return false;
    }
    auto meta = models.get_meta(name);
    if (!meta.has_value()) {
-        res_error(res, format_error_response("model not found", ERROR_TYPE_INVALID_REQUEST));
+        res_err(res, format_error_response("model not found", ERROR_TYPE_INVALID_REQUEST));
        return false;
    }
    if (models_autoload) {
        models.ensure_model_loaded(name);
    } else {
        if (meta->status != SERVER_MODEL_STATUS_LOADED) {
-            res_error(res, format_error_response("model is not loaded", ERROR_TYPE_INVALID_REQUEST));
+            res_err(res, format_error_response("model is not loaded", ERROR_TYPE_INVALID_REQUEST));
            return false;
        }
    }
@@ -706,11 +761,11 @@ void server_models_routes::init_routes() {
        std::string name = json_value(body, "model", std::string());
        auto model = models.get_meta(name);
        if (!model.has_value()) {
-            res_error(res, format_error_response("model is not found", ERROR_TYPE_NOT_FOUND));
+            res_err(res, format_error_response("model is not found", ERROR_TYPE_NOT_FOUND));
            return res;
        }
        if (model->status == SERVER_MODEL_STATUS_LOADED) {
-            res_error(res, format_error_response("model is already loaded", ERROR_TYPE_INVALID_REQUEST));
+            res_err(res, format_error_response("model is already loaded", ERROR_TYPE_INVALID_REQUEST));
            return res;
        }
        models.load(name, false);
@@ -768,11 +823,11 @@ void server_models_routes::init_routes() {
        std::string name = json_value(body, "model", std::string());
        auto model = models.get_meta(name);
        if (!model.has_value()) {
-            res_error(res, format_error_response("model is not found", ERROR_TYPE_INVALID_REQUEST));
+            res_err(res, format_error_response("model is not found", ERROR_TYPE_INVALID_REQUEST));
            return res;
        }
        if (model->status != SERVER_MODEL_STATUS_LOADED) {
-            res_error(res, format_error_response("model is not loaded", ERROR_TYPE_INVALID_REQUEST));
+            res_err(res, format_error_response("model is not loaded", ERROR_TYPE_INVALID_REQUEST));
            return res;
        }
        models.unload(name);
--- a/tools/server/server-task.cpp
+++ b/tools/server/server-task.cpp
@@ -450,9 +450,6 @@ task_params server_task::params_from_json_cmpl(
        }
    }

-    std::string model_name = params_base.model_alias.empty() ? DEFAULT_OAICOMPAT_MODEL : params_base.model_alias;
-    params.oaicompat_model = json_value(data, "model", model_name);
-
    return params;
 }

--- a/tools/server/tests/unit/test_chat_completion.py
+++ b/tools/server/tests/unit/test_chat_completion.py
@@ -41,7 +41,8 @@ def test_chat_completion(model, system_prompt, user_prompt, max_tokens, re_conte
    assert res.status_code == 200
    assert "cmpl" in res.body["id"] # make sure the completion id has the expected format
    assert res.body["system_fingerprint"].startswith("b")
-    assert res.body["model"] == model if model is not None else server.model_alias
+    # we no longer reflect back the model name, see https://github.com/ggml-org/llama.cpp/pull/17668
+    # assert res.body["model"] == model if model is not None else server.model_alias
    assert res.body["usage"]["prompt_tokens"] == n_prompt
    assert res.body["usage"]["completion_tokens"] == n_predicted
    choice = res.body["choices"][0]
@@ -59,7 +60,7 @@ def test_chat_completion(model, system_prompt, user_prompt, max_tokens, re_conte
 )
 def test_chat_completion_stream(system_prompt, user_prompt, max_tokens, re_content, n_prompt, n_predicted, finish_reason):
    global server
-    server.model_alias = None # try using DEFAULT_OAICOMPAT_MODEL
+    server.model_alias = "llama-test-model"
    server.start()
    res = server.make_stream_request("POST", "/chat/completions", data={
        "max_tokens": max_tokens,
@@ -81,7 +82,7 @@ def test_chat_completion_stream(system_prompt, user_prompt, max_tokens, re_conte
            else:
                assert "role" not in choice["delta"]
            assert data["system_fingerprint"].startswith("b")
-            assert "gpt-3.5" in data["model"] # DEFAULT_OAICOMPAT_MODEL, maybe changed in the future
+            assert data["model"] == "llama-test-model"
            if last_cmpl_id is None:
                last_cmpl_id = data["id"]
            assert last_cmpl_id == data["id"] # make sure the completion id is the same for all events in the stream
--- a/tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
+++ b/tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
@@ -575,6 +575,7 @@

 <DialogChatError
 	message={activeErrorDialog?.message ?? ''}
+	contextInfo={activeErrorDialog?.contextInfo}
 	onOpenChange={handleErrorDialogOpenChange}
 	open={Boolean(activeErrorDialog)}
 	type={activeErrorDialog?.type ?? 'server'}
--- a/tools/server/webui/src/lib/components/app/dialogs/DialogChatError.svelte
+++ b/tools/server/webui/src/lib/components/app/dialogs/DialogChatError.svelte
@@ -6,10 +6,11 @@
 		open: boolean;
 		type: 'timeout' | 'server';
 		message: string;
+		contextInfo?: { n_prompt_tokens: number; n_ctx: number };
 		onOpenChange?: (open: boolean) => void;
 	}

-	let { open = $bindable(), type, message, onOpenChange }: Props = $props();
+	let { open = $bindable(), type, message, contextInfo, onOpenChange }: Props = $props();

 	const isTimeout = $derived(type === 'timeout');
 	const title = $derived(isTimeout ? 'TCP Timeout' : 'Server Error');
@@ -51,6 +52,15 @@

 		<div class={`rounded-lg border px-4 py-3 text-sm ${badgeClass}`}>
 			<p class="font-medium">{message}</p>
+			{#if contextInfo}
+				<div class="mt-2 space-y-1 text-xs opacity-80">
+					<p>
+						<span class="font-medium">Prompt tokens:</span>
+						{contextInfo.n_prompt_tokens.toLocaleString()}
+					</p>
+					<p><span class="font-medium">Context size:</span> {contextInfo.n_ctx.toLocaleString()}</p>
+				</div>
+			{/if}
 		</div>

 		<AlertDialog.Footer>
--- a/tools/server/webui/src/lib/services/chat.ts
+++ b/tools/server/webui/src/lib/services/chat.ts
@@ -764,18 +764,33 @@ export class ChatService {
 	 * @param response - HTTP response object
 	 * @returns Promise<Error> - Parsed error with context info if available
 	 */
-	private static async parseErrorResponse(response: Response): Promise<Error> {
+	private static async parseErrorResponse(
+		response: Response
+	): Promise<Error & { contextInfo?: { n_prompt_tokens: number; n_ctx: number } }> {
 		try {
 			const errorText = await response.text();
 			const errorData: ApiErrorResponse = JSON.parse(errorText);

 			const message = errorData.error?.message || 'Unknown server error';
-			const error = new Error(message);
+			const error = new Error(message) as Error & {
+				contextInfo?: { n_prompt_tokens: number; n_ctx: number };
+			};
 			error.name = response.status === 400 ? 'ServerError' : 'HttpError';

+			if (errorData.error && 'n_prompt_tokens' in errorData.error && 'n_ctx' in errorData.error) {
+				error.contextInfo = {
+					n_prompt_tokens: errorData.error.n_prompt_tokens,
+					n_ctx: errorData.error.n_ctx
+				};
+			}
+
 			return error;
 		} catch {
-			const fallback = new Error(`Server error (${response.status}): ${response.statusText}`);
+			const fallback = new Error(
+				`Server error (${response.status}): ${response.statusText}`
+			) as Error & {
+				contextInfo?: { n_prompt_tokens: number; n_ctx: number };
+			};
 			fallback.name = 'HttpError';
 			return fallback;
 		}
--- a/tools/server/webui/src/lib/stores/chat.svelte.ts
+++ b/tools/server/webui/src/lib/stores/chat.svelte.ts
@@ -58,7 +58,11 @@ class ChatStore {

 	activeProcessingState = $state<ApiProcessingState | null>(null);
 	currentResponse = $state('');
-	errorDialogState = $state<{ type: 'timeout' | 'server'; message: string } | null>(null);
+	errorDialogState = $state<{
+		type: 'timeout' | 'server';
+		message: string;
+		contextInfo?: { n_prompt_tokens: number; n_ctx: number };
+	} | null>(null);
 	isLoading = $state(false);
 	chatLoadingStates = new SvelteMap<string, boolean>();
 	chatStreamingStates = new SvelteMap<string, { response: string; messageId: string }>();
@@ -335,8 +339,12 @@ class ChatStore {
 		return error instanceof Error && (error.name === 'AbortError' || error instanceof DOMException);
 	}

-	private showErrorDialog(type: 'timeout' | 'server', message: string): void {
-		this.errorDialogState = { type, message };
+	private showErrorDialog(
+		type: 'timeout' | 'server',
+		message: string,
+		contextInfo?: { n_prompt_tokens: number; n_ctx: number }
+	): void {
+		this.errorDialogState = { type, message, contextInfo };
 	}

 	dismissErrorDialog(): void {
@@ -347,6 +355,23 @@ class ChatStore {
 	// Message Operations
 	// ─────────────────────────────────────────────────────────────────────────────

+	/**
+	 * Finds a message by ID and optionally validates its role.
+	 * Returns message and index, or null if not found or role doesn't match.
+	 */
+	private getMessageByIdWithRole(
+		messageId: string,
+		expectedRole?: ChatRole
+	): { message: DatabaseMessage; index: number } | null {
+		const index = conversationsStore.findMessageIndex(messageId);
+		if (index === -1) return null;
+
+		const message = conversationsStore.activeMessages[index];
+		if (expectedRole && message.role !== expectedRole) return null;
+
+		return { message, index };
+	}
+
 	async addMessage(
 		role: ChatRole,
 		content: string,
@@ -508,7 +533,6 @@ class ChatStore {
 				) => {
 					this.stopStreaming();

-					// Build update data - only include model if not already persisted
 					const updateData: Record<string, unknown> = {
 						content: finalContent || streamedContent,
 						thinking: reasoningContent || streamedReasoningContent,
@@ -520,7 +544,6 @@ class ChatStore {
 					}
 					await DatabaseService.updateMessage(assistantMessage.id, updateData);

-					// Update UI state - always include model and timings if available
 					const idx = conversationsStore.findMessageIndex(assistantMessage.id);
 					const uiUpdate: Partial<DatabaseMessage> = {
 						content: updateData.content as string,
@@ -543,22 +566,38 @@ class ChatStore {
 				},
 				onError: (error: Error) => {
 					this.stopStreaming();
+
 					if (this.isAbortError(error)) {
 						this.setChatLoading(assistantMessage.convId, false);
 						this.clearChatStreaming(assistantMessage.convId);
 						this.clearProcessingState(assistantMessage.convId);
+
 						return;
 					}
+
 					console.error('Streaming error:', error);
+
 					this.setChatLoading(assistantMessage.convId, false);
 					this.clearChatStreaming(assistantMessage.convId);
 					this.clearProcessingState(assistantMessage.convId);
+
 					const idx = conversationsStore.findMessageIndex(assistantMessage.id);
+
 					if (idx !== -1) {
 						const failedMessage = conversationsStore.removeMessageAtIndex(idx);
 						if (failedMessage) DatabaseService.deleteMessage(failedMessage.id).catch(console.error);
 					}
-					this.showErrorDialog(error.name === 'TimeoutError' ? 'timeout' : 'server', error.message);
+
+					const contextInfo = (
+						error as Error & { contextInfo?: { n_prompt_tokens: number; n_ctx: number } }
+					).contextInfo;
+
+					this.showErrorDialog(
+						error.name === 'TimeoutError' ? 'timeout' : 'server',
+						error.message,
+						contextInfo
+					);
+
 					if (onError) onError(error);
 				}
 			},
@@ -591,7 +630,9 @@ class ChatStore {
 				await conversationsStore.updateConversationName(currentConv.id, content.trim());

 			const assistantMessage = await this.createAssistantMessage(userMessage.id);
+
 			if (!assistantMessage) throw new Error('Failed to create assistant message');
+
 			conversationsStore.addMessageToActive(assistantMessage);
 			await this.streamChatCompletion(
 				conversationsStore.activeMessages.slice(0, -1),
@@ -607,15 +648,26 @@ class ChatStore {
 			if (!this.errorDialogState) {
 				const dialogType =
 					error instanceof Error && error.name === 'TimeoutError' ? 'timeout' : 'server';
-				this.showErrorDialog(dialogType, error instanceof Error ? error.message : 'Unknown error');
+				const contextInfo = (
+					error as Error & { contextInfo?: { n_prompt_tokens: number; n_ctx: number } }
+				).contextInfo;
+
+				this.showErrorDialog(
+					dialogType,
+					error instanceof Error ? error.message : 'Unknown error',
+					contextInfo
+				);
 			}
 		}
 	}

 	async stopGeneration(): Promise<void> {
 		const activeConv = conversationsStore.activeConversation;
+
 		if (!activeConv) return;
+
 		await this.savePartialResponseIfNeeded(activeConv.id);
+
 		this.stopStreaming();
 		this.abortRequest(activeConv.id);
 		this.setChatLoading(activeConv.id, false);
@@ -655,17 +707,22 @@ class ChatStore {

 	private async savePartialResponseIfNeeded(convId?: string): Promise<void> {
 		const conversationId = convId || conversationsStore.activeConversation?.id;
+
 		if (!conversationId) return;
+
 		const streamingState = this.chatStreamingStates.get(conversationId);
+
 		if (!streamingState || !streamingState.response.trim()) return;

 		const messages =
 			conversationId === conversationsStore.activeConversation?.id
 				? conversationsStore.activeMessages
 				: await conversationsStore.getConversationMessages(conversationId);
+
 		if (!messages.length) return;

 		const lastMessage = messages[messages.length - 1];
+
 		if (lastMessage?.role === 'assistant') {
 			try {
 				const updateData: { content: string; thinking?: string; timings?: ChatMessageTimings } = {
@@ -684,9 +741,13 @@ class ChatStore {
 								: undefined
 					};
 				}
+
 				await DatabaseService.updateMessage(lastMessage.id, updateData);
+
 				lastMessage.content = this.currentResponse;
+
 				if (updateData.thinking) lastMessage.thinking = updateData.thinking;
+
 				if (updateData.timings) lastMessage.timings = updateData.timings;
 			} catch (error) {
 				lastMessage.content = this.currentResponse;
@@ -700,14 +761,12 @@ class ChatStore {
 		if (!activeConv) return;
 		if (this.isLoading) this.stopGeneration();

+		const result = this.getMessageByIdWithRole(messageId, 'user');
+		if (!result) return;
+		const { message: messageToUpdate, index: messageIndex } = result;
+		const originalContent = messageToUpdate.content;
+
 		try {
-			const messageIndex = conversationsStore.findMessageIndex(messageId);
-			if (messageIndex === -1) return;
-
-			const messageToUpdate = conversationsStore.activeMessages[messageIndex];
-			const originalContent = messageToUpdate.content;
-			if (messageToUpdate.role !== 'user') return;
-
 			const allMessages = await conversationsStore.getConversationMessages(activeConv.id);
 			const rootMessage = allMessages.find((m) => m.type === 'root' && m.parent === null);
 			const isFirstUserMessage = rootMessage && messageToUpdate.parent === rootMessage.id;
@@ -724,7 +783,9 @@ class ChatStore {
 			}

 			const messagesToRemove = conversationsStore.activeMessages.slice(messageIndex + 1);
+
 			for (const message of messagesToRemove) await DatabaseService.deleteMessage(message.id);
+
 			conversationsStore.sliceActiveMessages(messageIndex + 1);
 			conversationsStore.updateConversationTimestamp();

@@ -732,8 +793,11 @@ class ChatStore {
 			this.clearChatStreaming(activeConv.id);

 			const assistantMessage = await this.createAssistantMessage();
+
 			if (!assistantMessage) throw new Error('Failed to create assistant message');
+
 			conversationsStore.addMessageToActive(assistantMessage);
+
 			await conversationsStore.updateCurrentNode(assistantMessage.id);
 			await this.streamChatCompletion(
 				conversationsStore.activeMessages.slice(0, -1),
@@ -758,12 +822,11 @@ class ChatStore {
 		const activeConv = conversationsStore.activeConversation;
 		if (!activeConv || this.isLoading) return;

-		try {
-			const messageIndex = conversationsStore.findMessageIndex(messageId);
-			if (messageIndex === -1) return;
-			const messageToRegenerate = conversationsStore.activeMessages[messageIndex];
-			if (messageToRegenerate.role !== 'assistant') return;
+		const result = this.getMessageByIdWithRole(messageId, 'assistant');
+		if (!result) return;
+		const { index: messageIndex } = result;

+		try {
 			const messagesToRemove = conversationsStore.activeMessages.slice(messageIndex);
 			for (const message of messagesToRemove) await DatabaseService.deleteMessage(message.id);
 			conversationsStore.sliceActiveMessages(messageIndex);
@@ -832,6 +895,7 @@ class ChatStore {
 				const siblings = allMessages.filter(
 					(m) => m.parent === messageToDelete.parent && m.id !== messageId
 				);
+
 				if (siblings.length > 0) {
 					const latestSibling = siblings.reduce((latest, sibling) =>
 						sibling.timestamp > latest.timestamp ? sibling : latest
@@ -845,6 +909,7 @@ class ChatStore {
 			}
 			await DatabaseService.deleteMessageCascading(activeConv.id, messageId);
 			await conversationsStore.refreshActiveMessages();
+
 			conversationsStore.updateConversationTimestamp();
 		} catch (error) {
 			console.error('Failed to delete message:', error);
@@ -862,12 +927,12 @@ class ChatStore {
 	): Promise<void> {
 		const activeConv = conversationsStore.activeConversation;
 		if (!activeConv || this.isLoading) return;
-		try {
-			const idx = conversationsStore.findMessageIndex(messageId);
-			if (idx === -1) return;
-			const msg = conversationsStore.activeMessages[idx];
-			if (msg.role !== 'assistant') return;

+		const result = this.getMessageByIdWithRole(messageId, 'assistant');
+		if (!result) return;
+		const { message: msg, index: idx } = result;
+
+		try {
 			if (shouldBranch) {
 				const newMessage = await DatabaseService.createMessageBranch(
 					{
@@ -902,12 +967,12 @@ class ChatStore {
 	async editUserMessagePreserveResponses(messageId: string, newContent: string): Promise<void> {
 		const activeConv = conversationsStore.activeConversation;
 		if (!activeConv) return;
-		try {
-			const idx = conversationsStore.findMessageIndex(messageId);
-			if (idx === -1) return;
-			const msg = conversationsStore.activeMessages[idx];
-			if (msg.role !== 'user') return;

+		const result = this.getMessageByIdWithRole(messageId, 'user');
+		if (!result) return;
+		const { message: msg, index: idx } = result;
+
+		try {
 			await DatabaseService.updateMessage(messageId, {
 				content: newContent,
 				timestamp: Date.now()
@@ -916,6 +981,7 @@ class ChatStore {

 			const allMessages = await conversationsStore.getConversationMessages(activeConv.id);
 			const rootMessage = allMessages.find((m) => m.type === 'root' && m.parent === null);
+
 			if (rootMessage && msg.parent === rootMessage.id && newContent.trim()) {
 				await conversationsStore.updateConversationTitleWithConfirmation(
 					activeConv.id,
@@ -932,15 +998,16 @@ class ChatStore {
 	async editMessageWithBranching(messageId: string, newContent: string): Promise<void> {
 		const activeConv = conversationsStore.activeConversation;
 		if (!activeConv || this.isLoading) return;
-		try {
-			const idx = conversationsStore.findMessageIndex(messageId);
-			if (idx === -1) return;
-			const msg = conversationsStore.activeMessages[idx];
-			if (msg.role !== 'user') return;

+		const result = this.getMessageByIdWithRole(messageId, 'user');
+		if (!result) return;
+		const { message: msg } = result;
+
+		try {
 			const allMessages = await conversationsStore.getConversationMessages(activeConv.id);
 			const rootMessage = allMessages.find((m) => m.type === 'root' && m.parent === null);
 			const isFirstUserMessage = rootMessage && msg.parent === rootMessage.id;
+
 			const parentId = msg.parent || rootMessage?.id;
 			if (!parentId) return;

@@ -1034,7 +1101,9 @@ class ChatStore {

 	private async generateResponseForMessage(userMessageId: string): Promise<void> {
 		const activeConv = conversationsStore.activeConversation;
+
 		if (!activeConv) return;
+
 		this.errorDialogState = null;
 		this.setChatLoading(activeConv.id, true);
 		this.clearChatStreaming(activeConv.id);
@@ -1071,26 +1140,30 @@ class ChatStore {
 	async continueAssistantMessage(messageId: string): Promise<void> {
 		const activeConv = conversationsStore.activeConversation;
 		if (!activeConv || this.isLoading) return;
-		try {
-			const idx = conversationsStore.findMessageIndex(messageId);
-			if (idx === -1) return;
-			const msg = conversationsStore.activeMessages[idx];
-			if (msg.role !== 'assistant') return;
-			if (this.isChatLoading(activeConv.id)) return;

+		const result = this.getMessageByIdWithRole(messageId, 'assistant');
+		if (!result) return;
+		const { message: msg, index: idx } = result;
+
+		if (this.isChatLoading(activeConv.id)) return;
+
+		try {
 			this.errorDialogState = null;
 			this.setChatLoading(activeConv.id, true);
 			this.clearChatStreaming(activeConv.id);

 			const allMessages = await conversationsStore.getConversationMessages(activeConv.id);
 			const dbMessage = allMessages.find((m) => m.id === messageId);
+
 			if (!dbMessage) {
 				this.setChatLoading(activeConv.id, false);
+
 				return;
 			}

 			const originalContent = dbMessage.content;
 			const originalThinking = dbMessage.thinking || '';
+
 			const conversationContext = conversationsStore.activeMessages.slice(0, idx);
 			const contextWithContinue = [
 				...conversationContext,
@@ -1107,6 +1180,7 @@ class ChatStore {
 				contextWithContinue,
 				{
 					...this.getApiOptions(),
+
 					onChunk: (chunk: string) => {
 						hasReceivedContent = true;
 						appendedContent += chunk;
@@ -1114,6 +1188,7 @@ class ChatStore {
 						this.setChatStreaming(msg.convId, fullContent, msg.id);
 						conversationsStore.updateMessageAtIndex(idx, { content: fullContent });
 					},
+
 					onReasoningChunk: (reasoningChunk: string) => {
 						hasReceivedContent = true;
 						appendedThinking += reasoningChunk;
@@ -1121,6 +1196,7 @@ class ChatStore {
 							thinking: originalThinking + appendedThinking
 						});
 					},
+
 					onTimings: (timings: ChatMessageTimings, promptProgress?: ChatMessagePromptProgress) => {
 						const tokensPerSecond =
 							timings?.predicted_ms && timings?.predicted_n
@@ -1137,6 +1213,7 @@ class ChatStore {
 							msg.convId
 						);
 					},
+
 					onComplete: async (
 						finalContent?: string,
 						reasoningContent?: string,
@@ -1161,6 +1238,7 @@ class ChatStore {
 						this.clearChatStreaming(msg.convId);
 						this.clearProcessingState(msg.convId);
 					},
+
 					onError: async (error: Error) => {
 						if (this.isAbortError(error)) {
 							if (hasReceivedContent && appendedContent) {
Author	SHA1	Message	Date
Adrien Gallouët	f3a9674ae8	llama : fix signed comparison warning on FreeBSD (#17497 ) This ensures correct RLIM_INFINITY handling and compatibility on all platforms (32/64-bit). warning: comparison of integers of different signs: 'rlim_t' (aka 'long') and 'size_t' (aka 'unsigned long') [-Wsign-compare] 488 \| if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size)) { \| ~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-02 12:05:38 +01:00
Xuan-Son Nguyen	2c453c6c77	convert: add error message for mistral3 quantized weight (#17686 )	2025-12-02 11:48:31 +01:00
Xuan-Son Nguyen	5d6bd842ea	server: remove default "gpt-3.5-turbo" model name (#17668 ) * server: remove default "gpt-3.5-turbo" model name * do not reflect back model name from request * fix test	2025-12-02 11:38:57 +01:00
senhtry	fd3abe849e	server: fixing naming conflict res_error in server-models.cpp (#17679 )	2025-12-02 11:18:39 +01:00
Xuan-Son Nguyen	682e6658bb	server: explicitly set exec path when create new instance (#17669 ) * Revert "rm unused fn" This reverts commit `f2dbe9c087`. * server: explicitly set exec path when create new instance * put back TODO * only call get_server_exec_path() once * add fallback logic	2025-12-02 10:25:11 +01:00
Adrien Gallouët	4574f2949e	ci : skip winget update when not in ggml-org (#17465 ) Prevent forks from generating daily failure notifications. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-02 10:15:01 +01:00
Adrien Gallouët	ab6726eeff	ggml : add fallback definition for HWCAP2_SVE2 (#17683 ) This align with other HWCAP2 feature flags See #17528 Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-02 10:41:26 +02:00
Aleksander Grygier	cee92af553	Add context info to server error (#17663 ) * fix: Add context info to server error * chore: update webui build output	2025-12-02 09:20:57 +01:00
Aman Gupta	ed32089927	ggml-cuda: reorder only relevant nodes (#17639 )	2025-12-02 12:36:31 +08:00
Aaron Teo	7b6d745364	release: fix duplicate libs, store symbolic links (#17299 )	2025-12-02 11:52:05 +08:00