llama : orion rope type is neox (#13261 )

llama : plamo rope type is neox (#13260 )
llama-chat : reset glmedge chat template (#13253 )
2026-05-03 23:54:19 +00:00 · 2025-05-02 12:44:24 +02:00 · 2025-05-02 12:40:56 +02:00 · 2025-05-02 11:06:09 +02:00 · 2025-05-02 10:20:27 +02:00 · 2025-05-02 09:48:31 +03:00
11 changed files with 114 additions and 75 deletions
--- a/.github/workflows/build-linux-cross.yml
+++ b/.github/workflows/build-linux-cross.yml
@@ -4,18 +4,25 @@ on:
  workflow_call:

 jobs:
-  ubuntu-latest-riscv64-cpu-cross:
-    runs-on: ubuntu-latest
+  ubuntu-24-riscv64-cpu-cross:
+    runs-on: ubuntu-24.04

    steps:
      - uses: actions/checkout@v4
      - name: Setup Riscv
        run: |
          sudo dpkg --add-architecture riscv64
-          sudo sed -i 's|http://azure.archive.ubuntu.com/ubuntu|http://ports.ubuntu.com/ubuntu-ports|g' \
-                 /etc/apt/sources.list /etc/apt/apt-mirrors.txt
-          sudo apt-get clean
-          sudo apt-get update
+
+          # Add arch-specific repositories for non-amd64 architectures
+          cat << EOF | sudo tee /etc/apt/sources.list.d/riscv64-ports.list
+          deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble main universe
+          deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-updates main universe
+          deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-security main universe
+          deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-backports main universe
+          EOF
+
+          sudo apt-get update || true    ;# Prevent failure due to missing URLs.
+
          sudo apt-get install -y --no-install-recommends \
                  build-essential \
                  gcc-14-riscv64-linux-gnu \
@@ -40,21 +47,25 @@ jobs:

          cmake --build build --config Release -j $(nproc)

-  ubuntu-latest-riscv64-vulkan-cross:
-    runs-on: ubuntu-latest
+  ubuntu-24-riscv64-vulkan-cross:
+    runs-on: ubuntu-24.04

    steps:
      - uses: actions/checkout@v4
-        with:
-          fetch-depth: 0
-
      - name: Setup Riscv
        run: |
          sudo dpkg --add-architecture riscv64
-          sudo sed -i 's|http://azure.archive.ubuntu.com/ubuntu|http://ports.ubuntu.com/ubuntu-ports|g' \
-                 /etc/apt/sources.list /etc/apt/apt-mirrors.txt
-          sudo apt-get clean
-          sudo apt-get update
+
+          # Add arch-specific repositories for non-amd64 architectures
+          cat << EOF | sudo tee /etc/apt/sources.list.d/riscv64-ports.list
+          deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble main universe
+          deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-updates main universe
+          deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-security main universe
+          deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-backports main universe
+          EOF
+
+          sudo apt-get update || true    ;# Prevent failure due to missing URLs.
+
          sudo apt-get install -y --no-install-recommends \
                  build-essential \
                  glslc \
@@ -82,21 +93,25 @@ jobs:

          cmake --build build --config Release -j $(nproc)

-  ubuntu-latest-arm64-vulkan-cross:
-    runs-on: ubuntu-latest
+  ubuntu-24-arm64-vulkan-cross:
+    runs-on: ubuntu-24.04

    steps:
      - uses: actions/checkout@v4
-        with:
-          fetch-depth: 0
-
      - name: Setup Arm64
        run: |
          sudo dpkg --add-architecture arm64
-          sudo sed -i 's|http://azure.archive.ubuntu.com/ubuntu|http://ports.ubuntu.com/ubuntu-ports|g' \
-                 /etc/apt/sources.list /etc/apt/apt-mirrors.txt
-          sudo apt-get clean
-          sudo apt-get update
+
+          # Add arch-specific repositories for non-amd64 architectures
+          cat << EOF | sudo tee /etc/apt/sources.list.d/arm64-ports.list
+          deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports/ noble main universe
+          deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports/ noble-updates main universe
+          deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports/ noble-security main universe
+          deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports/ noble-backports main universe
+          EOF
+
+          sudo apt-get update || true    ;# Prevent failure due to missing URLs.
+
          sudo apt-get install -y --no-install-recommends \
                  build-essential \
                  glslc \
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -601,9 +601,8 @@ jobs:
            -DGGML_SYCL_F16=ON
          cmake --build build --config Release -j $(nproc)

-# Disabled for now due to sporadic issue syncing.
-#  build-linux-cross:
-#    uses: ./.github/workflows/build-linux-cross.yml
+  build-linux-cross:
+    uses: ./.github/workflows/build-linux-cross.yml

  macOS-latest-cmake-ios:
    runs-on: macos-latest
--- a/common/arg.cpp
+++ b/common/arg.cpp
@@ -2783,7 +2783,10 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
    ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_THREADS_HTTP"));
    add_opt(common_arg(
        {"--cache-reuse"}, "N",
-        string_format("min chunk size to attempt reusing from the cache via KV shifting (default: %d)", params.n_cache_reuse),
+        string_format(
+            "min chunk size to attempt reusing from the cache via KV shifting (default: %d)\n"
+            "[(card)](https://ggml.ai/f0.png)", params.n_cache_reuse
+        ),
        [](common_params & params, int value) {
            params.n_cache_reuse = value;
        }
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -419,7 +419,9 @@ class ModelBase:
    @staticmethod
    def load_hparams(dir_model: Path):
        try:
-            return AutoConfig.from_pretrained(dir_model).to_dict()
+            # for security reason, we don't allow loading remote code by default
+            # if a model need remote code, we will fallback to config.json
+            return AutoConfig.from_pretrained(dir_model, trust_remote_code=False).to_dict()
        except Exception as e:
            logger.warning(f"Failed to load model config from {dir_model}: {e}")
            logger.warning("Trying to load config.json instead")
--- a/examples/llava/mtmd-cli.cpp
+++ b/examples/llava/mtmd-cli.cpp
@@ -72,6 +72,8 @@ struct mtmd_cli_context {
    llama_batch         batch;
    int                 n_batch;

+    std::vector<mtmd_bitmap> bitmaps;
+
    // note: we know that gemma3 template is "linear", meaning each turn is completely separated to another
    // so here we don't need to keep track of chat history
    common_chat_templates_ptr tmpls;
@@ -135,13 +137,22 @@ struct mtmd_cli_context {
            antiprompt_tokens.begin()
        );
    }
+
+    bool load_image(const std::string & fname) {
+        mtmd_bitmap bitmap;
+        if (mtmd_helper_bitmap_init_from_file(fname.c_str(), bitmap)) {
+            return false;
+        }
+        bitmaps.push_back(std::move(bitmap));
+        return true;
+    }
 };

 static int generate_response(mtmd_cli_context & ctx, common_sampler * smpl, int n_predict) {
    llama_tokens generated_tokens;
    for (int i = 0; i < n_predict; i++) {
        if (i > n_predict || !g_is_generating || g_is_interrupted) {
-            printf("\n");
+            LOG("\n");
            break;
        }

@@ -150,15 +161,15 @@ static int generate_response(mtmd_cli_context & ctx, common_sampler * smpl, int
        common_sampler_accept(smpl, token_id, true);

        if (llama_vocab_is_eog(ctx.vocab, token_id) || ctx.check_antiprompt(generated_tokens)) {
-            printf("\n");
+            LOG("\n");
            break; // end of generation
        }

-        printf("%s", common_token_to_piece(ctx.lctx, token_id).c_str());
+        LOG("%s", common_token_to_piece(ctx.lctx, token_id).c_str());
        fflush(stdout);

        if (g_is_interrupted) {
-            printf("\n");
+            LOG("\n");
            break;
        }

@@ -173,9 +184,7 @@ static int generate_response(mtmd_cli_context & ctx, common_sampler * smpl, int
    return 0;
 }

-static int eval_message(mtmd_cli_context & ctx, common_chat_msg & msg, std::vector<std::string> & images_fname, bool add_bos = false) {
-    std::vector<mtmd_bitmap> bitmaps;
-
+static int eval_message(mtmd_cli_context & ctx, common_chat_msg & msg, bool add_bos = false) {
    common_chat_templates_inputs tmpl_inputs;
    tmpl_inputs.messages = {msg};
    tmpl_inputs.add_generation_prompt = true;
@@ -183,15 +192,6 @@ static int eval_message(mtmd_cli_context & ctx, common_chat_msg & msg, std::vect
    auto formatted_chat = common_chat_templates_apply(ctx.tmpls.get(), tmpl_inputs);
    LOG_DBG("formatted_chat.prompt: %s\n", formatted_chat.prompt.c_str());

-    for (auto & fname : images_fname) {
-        mtmd_bitmap bitmap;
-        if (mtmd_helper_bitmap_init_from_file(fname.c_str(), bitmap)) {
-            LOG_ERR("Unable to load image %s\n", fname.c_str());
-            return 2; // image not found
-        }
-        bitmaps.push_back(std::move(bitmap));
-    }
-
    mtmd_input_text text;
    text.text          = formatted_chat.prompt;
    text.add_special   = add_bos;
@@ -200,12 +200,14 @@ static int eval_message(mtmd_cli_context & ctx, common_chat_msg & msg, std::vect

    if (g_is_interrupted) return 0;

-    int32_t res = mtmd_tokenize(ctx.ctx_vision.get(), chunks, text, bitmaps);
+    int32_t res = mtmd_tokenize(ctx.ctx_vision.get(), chunks, text, ctx.bitmaps);
    if (res != 0) {
        LOG_ERR("Unable to tokenize prompt, res = %d\n", res);
        return 1;
    }

+    ctx.bitmaps.clear();
+
    if (mtmd_helper_eval(ctx.ctx_vision.get(), ctx.lctx, chunks, ctx.n_past, 0, ctx.n_batch)) {
        LOG_ERR("Unable to eval prompt\n");
        return 1;
@@ -213,6 +215,8 @@ static int eval_message(mtmd_cli_context & ctx, common_chat_msg & msg, std::vect

    ctx.n_past += mtmd_helper_get_n_pos(chunks);

+    LOG("\n");
+
    return 0;
 }

@@ -235,7 +239,7 @@ int main(int argc, char ** argv) {
    }

    mtmd_cli_context ctx(params);
-    printf("%s: %s\n", __func__, params.model.path.c_str());
+    LOG("%s: loading model: %s\n", __func__, params.model.path.c_str());

    bool is_single_turn = !params.prompt.empty() && !params.image.empty();

@@ -268,7 +272,12 @@ int main(int argc, char ** argv) {
        common_chat_msg msg;
        msg.role = "user";
        msg.content = params.prompt;
-        if (eval_message(ctx, msg, params.image, true)) {
+        for (const auto & image : params.image) {
+            if (!ctx.load_image(image)) {
+                return 1; // error is already printed by libmtmd
+            }
+        }
+        if (eval_message(ctx, msg, true)) {
            return 1;
        }
        if (!g_is_interrupted && generate_response(ctx, smpl, n_predict)) {
@@ -283,7 +292,6 @@ int main(int argc, char ** argv) {
        LOG("\n");

        bool is_first_msg = true;
-        std::vector<std::string> images_fname;
        std::string content;

        while (!g_is_interrupted) {
@@ -308,10 +316,17 @@ int main(int argc, char ** argv) {
                continue;
            }
            g_is_generating = true;
-            if (line.find("/image") == 0) {
+            if (line == "/image" || line.find("/image ") == 0) {
+                if (line.size() < 8) {
+                    LOG_ERR("ERR: Missing image filename\n");
+                    continue;
+                }
                std::string image = line.substr(7);
-                images_fname.push_back(string_strip(image));
-                content += "<__image__>";
+                if (ctx.load_image(image)) {
+                    LOG("Image %s loaded\n", image.c_str());
+                    content += "<__image__>";
+                }
+                // else, error is already printed by libmtmd
                continue;
            } else {
                content += line;
@@ -319,21 +334,14 @@ int main(int argc, char ** argv) {
            common_chat_msg msg;
            msg.role = "user";
            msg.content = content;
-            int ret = eval_message(ctx, msg, images_fname, is_first_msg);
-            if (g_is_interrupted) break;
-            if (ret == 2) {
-                // non-fatal error
-                images_fname.clear();
-                content.clear();
-                continue;
-            }
+            int ret = eval_message(ctx, msg, is_first_msg);
            if (ret) {
                return 1;
            }
+            if (g_is_interrupted) break;
            if (generate_response(ctx, smpl, n_predict)) {
                return 1;
            }
-            images_fname.clear();
            content.clear();
            is_first_msg = false;
        }
--- a/examples/llava/mtmd.cpp
+++ b/examples/llava/mtmd.cpp
@@ -590,7 +590,7 @@ int32_t mtmd_helper_eval(mtmd_context * ctx,
            }

        } else if (chunk.type == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
-            GGML_ASSERT(!is_last && "logits for last image chunk is not yet support");
+            GGML_ASSERT(!is_last && "logits for last image chunk is not yet supported");
            GGML_ASSERT(chunk.tokens_image != nullptr);
            int64_t t0 = ggml_time_ms();
            if (ctx->print_timings) {
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -154,7 +154,7 @@ The project is under active development, and we are [looking for feedback and co
 | `--ssl-cert-file FNAME` | path to file a PEM-encoded SSL certificate<br/>(env: LLAMA_ARG_SSL_CERT_FILE) |
 | `-to, --timeout N` | server read/write timeout in seconds (default: 600)<br/>(env: LLAMA_ARG_TIMEOUT) |
 | `--threads-http N` | number of threads used to process HTTP requests (default: -1)<br/>(env: LLAMA_ARG_THREADS_HTTP) |
-| `--cache-reuse N` | min chunk size to attempt reusing from the cache via KV shifting (default: 0)<br/>(env: LLAMA_ARG_CACHE_REUSE) |
+| `--cache-reuse N` | min chunk size to attempt reusing from the cache via KV shifting (default: 0)<br/>[(card)](https://ggml.ai/f0.png)<br/>(env: LLAMA_ARG_CACHE_REUSE) |
 | `--metrics` | enable prometheus compatible metrics endpoint (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_METRICS) |
 | `--slots` | enable slots monitoring endpoint (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_SLOTS) |
 | `--props` | enable changing global properties via POST /props (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_PROPS) |
--- a/ggml/src/ggml-rpc/ggml-rpc.cpp
+++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
@@ -518,6 +518,11 @@ static rpc_tensor serialize_tensor(const ggml_tensor * tensor) {
    result.view_src = reinterpret_cast<uint64_t>(tensor->view_src);
    result.view_offs = tensor->view_offs;
    result.data = reinterpret_cast<uint64_t>(tensor->data);
+
+    // Avoid sending uninitialized data over the wire
+    memset(result.name, 0, sizeof(result.name));
+    memset(result.padding, 0, sizeof(result.padding));
+
    snprintf(result.name, GGML_MAX_NAME, "%s", tensor->name);
    return result;
 }
--- a/src/llama-chat.cpp
+++ b/src/llama-chat.cpp
@@ -447,7 +447,7 @@ int32_t llm_chat_apply_template(
        if (add_ass) {
            ss << "<|assistant|>";
        }
-    } else if (tmpl == LLM_CHAT_TEMPLATE_CHATGLM_4 || tmpl == LLM_CHAT_TEMPLATE_GLMEDGE) {
+    } else if (tmpl == LLM_CHAT_TEMPLATE_CHATGLM_4) {
        ss << "[gMASK]" << "<sop>";
        for (auto message : chat) {
            std::string role(message->role);
@@ -456,6 +456,14 @@ int32_t llm_chat_apply_template(
        if (add_ass) {
            ss << "<|assistant|>\n";
        }
+    } else if (tmpl == LLM_CHAT_TEMPLATE_GLMEDGE) {
+        for (auto message : chat) {
+            std::string role(message->role);
+            ss << "<|" << role << "|>" << "\n" << message->content;
+        }
+        if (add_ass) {
+            ss << "<|assistant|>";
+        }
    } else if (tmpl == LLM_CHAT_TEMPLATE_MINICPM) {
        // MiniCPM-3B-OpenHermes-2.5-v2-GGUF
        for (auto message : chat) {
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -13226,8 +13226,6 @@ llama_rope_type llama_model_rope_type(const llama_model * model) {
        case LLM_ARCH_DECI:
        case LLM_ARCH_BAICHUAN:
        case LLM_ARCH_STARCODER:
-        case LLM_ARCH_PLAMO:
-        case LLM_ARCH_ORION:
        case LLM_ARCH_INTERNLM2:
        case LLM_ARCH_MINICPM:
        case LLM_ARCH_XVERSE:
@@ -13265,6 +13263,7 @@ llama_rope_type llama_model_rope_type(const llama_model * model) {
        case LLM_ARCH_PHI2:
        case LLM_ARCH_PHI3:
        case LLM_ARCH_PHIMOE:
+        case LLM_ARCH_PLAMO:
        case LLM_ARCH_GEMMA:
        case LLM_ARCH_GEMMA2:
        case LLM_ARCH_GEMMA3:
@@ -13272,6 +13271,7 @@ llama_rope_type llama_model_rope_type(const llama_model * model) {
        case LLM_ARCH_OPENELM:
        case LLM_ARCH_GPTNEOX:
        case LLM_ARCH_CODESHELL:
+        case LLM_ARCH_ORION:
        case LLM_ARCH_NEMOTRON:
        case LLM_ARCH_EXAONE:
        case LLM_ARCH_MINICPM3:
--- a/tests/test-chat-template.cpp
+++ b/tests/test-chat-template.cpp
@@ -187,15 +187,14 @@ int main(void) {
            /* .bos_token= */ "",
            /* .eos_token= */ "",
        },
-        // TODO @ngxson : GLMEdge produces poor result without `[gMASK]<sop>`, so we're temporarily using GLM4 template for it. We should fix this in the future.
-        // {
-        //     /* .name= */ "GLMEdge",
-        //     /* .template_str= */ "{% for item in messages %}{% if item['role'] == 'system' %}<|system|>\n{{ item['content'] }}{% elif item['role'] == 'user' %}<|user|>\n{{ item['content'] }}{% elif item['role'] == 'assistant' %}<|assistant|>\n{{ item['content'] }}{% endif %}{% endfor %}<|assistant|>",
-        //     /* .expected_output= */ "<|system|>\nYou are a helpful assistant<|user|>\nHello<|assistant|>\nHi there<|user|>\nWho are you<|assistant|>\n   I am an assistant   <|user|>\nAnother question<|assistant|>",
-        //     /* .expected_output_jinja= */ "<|system|>\nYou are a helpful assistant<|user|>\nHello<|assistant|>\nHi there<|user|>\nWho are you<|assistant|>\n   I am an assistant   <|user|>\nAnother question<|assistant|>",
-        //     /* .bos_token= */ "",
-        //     /* .eos_token= */ "",
-        // },
+        {
+            /* .name= */ "GLMEdge",
+            /* .template_str= */ "{% for item in messages %}{% if item['role'] == 'system' %}<|system|>\n{{ item['content'] }}{% elif item['role'] == 'user' %}<|user|>\n{{ item['content'] }}{% elif item['role'] == 'assistant' %}<|assistant|>\n{{ item['content'] }}{% endif %}{% endfor %}<|assistant|>",
+            /* .expected_output= */ "<|system|>\nYou are a helpful assistant<|user|>\nHello<|assistant|>\nHi there<|user|>\nWho are you<|assistant|>\n   I am an assistant   <|user|>\nAnother question<|assistant|>",
+            /* .expected_output_jinja= */ "<|system|>\nYou are a helpful assistant<|user|>\nHello<|assistant|>\nHi there<|user|>\nWho are you<|assistant|>\n   I am an assistant   <|user|>\nAnother question<|assistant|>",
+            /* .bos_token= */ "",
+            /* .eos_token= */ "",
+        },
        {
            /* .name= */ "MiniCPM-3B-OpenHermes-2.5-v2-GGUF",
            /* .template_str= */ U8C("{% for message in messages %}{% if message['role'] == 'user' %}{{'<用户>' + message['content'].strip() + '<AI>'}}{% else %}{{message['content'].strip()}}{% endif %}{% endfor %}"),
Author	SHA1	Message	Date
Sigbjørn Skjæret	cb06a3c363	llama : orion rope type is neox (#13261 )	2025-05-02 12:44:24 +02:00
Sigbjørn Skjæret	626083faf7	llama : plamo rope type is neox (#13260 )	2025-05-02 12:40:56 +02:00
piDack	2af6880178	llama-chat : reset glmedge chat template (#13253 ) * reset glmedge chat template * fix glmedge chat template	2025-05-02 11:06:09 +02:00
Shakil Ahmed	e84773ab60	mtmd-cli : fix out_of_range when input image path is empty (#13244 ) * fix out_of_range error to keep the chat loop running * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mtmd-cli : load image right away * add a new line for readability * rm printf * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/llava/mtmd-cli.cpp --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-05-02 10:20:27 +02:00
Georgi Gerganov	fab647e884	server : add cache reuse card link to help (#13230 ) * server : add cache reuse card link to help * args : use short url	2025-05-02 09:48:31 +03:00
Xuan-Son Nguyen	dcf886007d	convert : explicitly disable trust_remote_code for AutoConfig (#13246 )	2025-05-02 08:45:10 +02:00
bandoti	d24d592808	ci: fix cross-compile sync issues (#12804 )	2025-05-01 19:06:39 -03:00
Justin Santa Barbara	8efbdadc61	rpc : avoid uninitialized memory in serialize_tensor (#13210 ) Zero out the name and padding buffers.	2025-05-01 23:32:11 +02:00