Fix OLMo HF to GGUF conversion (#6910 )

server : update readme with undocumented options (#7013 )
readme : update hot topics
2026-05-02 15:14:06 +00:00 · 2024-05-07 21:39:43 +02:00 · 2024-05-07 21:44:29 +03:00 · 2024-05-07 21:43:13 +03:00 · 2024-05-07 20:51:31 +03:00 · 2024-05-07 18:20:33 +03:00
14 changed files with 41 additions and 17 deletions
--- a/README.md
+++ b/README.md
@@ -20,7 +20,8 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)

 ### Hot topics

- **BPE pre-tokenization support has been added: https://github.com/ggerganov/llama.cpp/pull/6920**
+- **Initial Flash-Attention support: https://github.com/ggerganov/llama.cpp/pull/5021**
+- BPE pre-tokenization support has been added: https://github.com/ggerganov/llama.cpp/pull/6920
 - MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` https://github.com/ggerganov/llama.cpp/pull/6387
 - Model sharding instructions using `gguf-split` https://github.com/ggerganov/llama.cpp/discussions/6404
 - Fix major bug in Metal batched inference https://github.com/ggerganov/llama.cpp/pull/6225
--- a/ci/run.sh
+++ b/ci/run.sh
@@ -160,9 +160,8 @@ function gg_run_test_scripts_debug {

    set -e

-    # TODO: too slow, run on dedicated node
-   #(cd ./examples/gguf-split && time bash tests.sh "$SRC/build-ci-debug/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
-   #(cd ./examples/quantize   && time bash tests.sh "$SRC/build-ci-debug/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
+    (cd ./examples/gguf-split && time bash tests.sh "$SRC/build-ci-debug/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
+    (cd ./examples/quantize   && time bash tests.sh "$SRC/build-ci-debug/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log

    set +e
 }
@@ -695,8 +694,10 @@ test $ret -eq 0 && gg_run ctest_release
 if [ -z ${GG_BUILD_LOW_PERF} ]; then
    test $ret -eq 0 && gg_run embd_bge_small

-    test $ret -eq 0 && gg_run test_scripts_debug
-    test $ret -eq 0 && gg_run test_scripts_release
+    if [ -z ${GG_BUILD_CLOUD} ] || [ ${GG_BUILD_EXTRA_TESTS_0} ]; then
+        test $ret -eq 0 && gg_run test_scripts_debug
+        test $ret -eq 0 && gg_run test_scripts_release
+    fi

    if [ -z ${GG_BUILD_VRAM_GB} ] || [ ${GG_BUILD_VRAM_GB} -ge 8 ]; then
        if [ -z ${GG_BUILD_CUDA} ]; then
--- a/convert-hf-to-gguf-update.py
+++ b/convert-hf-to-gguf-update.py
@@ -67,6 +67,7 @@ models = [
    {"name": "gpt-2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/openai-community/gpt2", },
    {"name": "refact",         "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/smallcloudai/Refact-1_6-base", },
    {"name": "command-r",      "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/CohereForAI/c4ai-command-r-v01", },
+    {"name": "olmo",           "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/allenai/OLMo-1.7-7B-hf", },
 ]

 # make directory "models/tokenizers" if it doesn't exist
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -314,6 +314,9 @@ class Model(ABC):
        if chkhsh == "9c2227e4dd922002fb81bde4fc02b0483ca4f12911410dee2255e4987644e3f8":
            # ref: https://huggingface.co/CohereForAI/c4ai-command-r-v01
            res = "command-r"
+        if chkhsh == "b6dc8df998e1cfbdc4eac8243701a65afe638679230920b50d6f17d81c098166":
+            # ref: https://huggingface.co/allenai/OLMo-1.7-7B-hf
+            res = "olmo"

        if res is None:
            logger.warning("\n")
@@ -2831,8 +2834,9 @@ class OlmoModel(Model):
    def set_gguf_parameters(self):
        super().set_gguf_parameters()
        self.gguf_writer.add_layer_norm_eps(1e-5)
-        if "clip_qkv" in self.hparams is not None:
-            self.gguf_writer.add_clamp_kqv(self.hparams["clip_qkv"])
+        clip_qkv = self.hparams.get("clip_qkv")
+        if clip_qkv is not None:
+            self.gguf_writer.add_clamp_kqv(clip_qkv)

    # Same as super class, but permuting q_proj, k_proj
    # Copied from: LlamaModel
--- a/docs/BLIS.md
+++ b/docs/BLIS.md
@@ -23,7 +23,7 @@ Install BLIS:
 sudo make install
 ```

-We recommend using openmp since it's easier to modify the cores been used.
+We recommend using openmp since it's easier to modify the cores being used.

 ### llama.cpp compilation

--- a/docs/HOWTO-add-model.md
+++ b/docs/HOWTO-add-model.md
@@ -96,9 +96,9 @@ NOTE: The dimensions in `ggml` are typically in the reverse order of the `pytorc

 This is the funniest part, you have to provide the inference graph implementation of the new model architecture in `llama_build_graph`.

-Have a look to existing implementation like `build_llama`, `build_dbrx` or `build_bert`.
+Have a look at existing implementation like `build_llama`, `build_dbrx` or `build_bert`.

-When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support of missing backend operations can be added in another PR.
+When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR.

 Note: to debug the inference graph: you can use [eval-callback](../examples/eval-callback).

--- a/examples/llava/README.md
+++ b/examples/llava/README.md
@@ -56,7 +56,7 @@ python ./examples/llava/convert-image-encoder-to-gguf.py -m ../clip-vit-large-pa
 python ./convert.py ../llava-v1.5-7b --skip-unknown
 ```

-Now both the LLaMA part and the image encoder is in the `llava-v1.5-7b` directory.
+Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.

 ## LLaVA 1.6 gguf conversion
 1) First clone a LLaVA 1.6 model:
--- a/examples/main/README.md
+++ b/examples/main/README.md
@@ -143,7 +143,7 @@ The `--ctx-size` option allows you to set the size of the prompt context used by

 ### Extended Context Size

-Some fine-tuned models have extended the context length by scaling RoPE. For example, if the original pre-trained model have a context length (max sequence length) of 4096 (4k) and the fine-tuned model have 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8.
+Some fine-tuned models have extended the context length by scaling RoPE. For example, if the original pre-trained model has a context length (max sequence length) of 4096 (4k) and the fine-tuned model has 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8.

 -   `--rope-scale N`: Where N is the linear scaling factor used by the fine-tuned model.

@@ -286,7 +286,7 @@ These options help improve the performance and memory usage of the LLaMA models.

 -   `--numa distribute`: Pin an equal proportion of the threads to the cores on each NUMA node. This will spread the load amongst all cores on the system, utilitizing all memory channels at the expense of potentially requiring memory to travel over the slow links between nodes.
 -   `--numa isolate`: Pin all threads to the NUMA node that the program starts on. This limits the number of cores and amount of memory that can be used, but guarantees all memory access remains local to the NUMA node.
-   `--numa numactl`: Pin threads to the CPUMAP that is passed to the program by starting it with the numactl utility. This is the most flexible mode, and allow arbitraty core usage patterns, for example a map that uses all the cores on one NUMA nodes, and just enough cores on a second node to saturate the inter-node memory bus.
+-   `--numa numactl`: Pin threads to the CPUMAP that is passed to the program by starting it with the numactl utility. This is the most flexible mode, and allow arbitrary core usage patterns, for example a map that uses all the cores on one NUMA nodes, and just enough cores on a second node to saturate the inter-node memory bus.

 These flags attempt optimizations that help on some systems with non-uniform memory access. This currently consists of one of the above strategies, and disabling prefetch and readahead for mmap. The latter causes mapped pages to be faulted in on first access instead of all at once, and in combination with pinning threads to NUMA nodes, more of the pages end up on the NUMA node where they are used. Note that if the model is already in the system page cache, for example because of a previous run without this option, this will have little effect unless you drop the page cache first. This can be done by rebooting the system or on Linux by writing '3' to '/proc/sys/vm/drop_caches' as root.

--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -796,7 +796,7 @@ int main(int argc, char ** argv) {

            // deal with end of generation tokens in interactive mode
            if (llama_token_is_eog(model, llama_sampling_last(ctx_sampling))) {
-                LOG("found EOS token\n");
+                LOG("found an EOG token\n");

                if (params.interactive) {
                    if (!params.antiprompt.empty()) {
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -62,6 +62,18 @@ page cache before using this. See https://github.com/ggerganov/llama.cpp/issues/
 - `--chat-template JINJA_TEMPLATE`: Set custom jinja chat template. This parameter accepts a string, not a file name.  Default: template taken from model's metadata. We only support [some pre-defined templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
 - `--log-disable`: Output logs to stdout only, not to `llama.log`. Default: enabled
 - `--log-format FORMAT`: Define the log output to FORMAT: json or text Default: `json`
+- `--rope-scaling` : RoPE scaling method. Defaults to linear unless otherwise specified by the model. Options are `none`, `linear`, `yarn`
+- `--rope-freq-base N` : RoPE frequency base (default: loaded from model)
+- `--rope-freq-scale N`: RoPE frequency scaling factor, expands context by a factor of 1/N (e.g. 0.25)
+- `--yarn-ext-factor N` : YaRN: extrapolation mix factor (Default: 1.0, 0.0 = full interpolation)
+- `--yarn-attn-factor N` : YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
+- `--yarn-beta-slow N`: YaRN: High correction dim or alpha (default: 1.0)
+- `--yarn-beta-fast N`: YaRN: low correction dim or beta (default: 32.0)
+- `--pooling` : Pooling type for embeddings, use model default if unspecified. Options are `none`, `mean`, `cls`
+- `-dt N`, `--defrag-thold N`: KV cache defragmentation threshold (default: -1.0, < 0 = disabled)
+- `-fa`, `--flash-attn` : enable flash attention (default: disabled).
+- `-ctk TYPE`, `--cache-type-k TYPE` : KV cache data type for K (default: `f16`, options `f32`, `f16`, `q8_0`, `q4_0`, `q4_1`, `iq4_nl`, `q5_0`, or `q5_1`)
+- `-ctv TYPE`, `--cache-type-v TYPE` : KV cache type for V (default `f16`, see `-ctk` for options)

 **If compiled with `LLAMA_SERVER_SSL=ON`**
 - `--ssl-key-file FNAME`: path to file a PEM-encoded SSL private key
--- a/examples/sycl/README.md
+++ b/examples/sycl/README.md
@@ -1,6 +1,6 @@
 # llama.cpp/example/sycl

-This example program provide the tools for llama.cpp for SYCL on Intel GPU.
+This example program provides the tools for llama.cpp for SYCL on Intel GPU.

 ## Tool

--- a/grammars/README.md
+++ b/grammars/README.md
@@ -51,7 +51,7 @@ single-line ::= [^\n]+ "\n"`

 ## Sequences and Alternatives

-The order of symbols in a sequence matter. For example, in `"1. " move " " move "\n"`, the `"1. "` must come before the first `move`, etc.
+The order of symbols in a sequence matters. For example, in `"1. " move " " move "\n"`, the `"1. "` must come before the first `move`, etc.

 Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`.

--- a/llama.cpp
+++ b/llama.cpp
@@ -4389,6 +4389,9 @@ static void llm_load_vocab(
            } else if (
                tokenizer_pre == "command-r") {
                vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_COMMAND_R;
+            } else if (
+                tokenizer_pre == "olmo") {
+                vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_OLMO;
            } else {
                throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str()));
            }
@@ -12248,6 +12251,7 @@ struct llm_tokenizer_bpe {
                        });
                        break;
                    case LLAMA_VOCAB_PRE_TYPE_GPT2:
+                    case LLAMA_VOCAB_PRE_TYPE_OLMO:
                        word_collection = unicode_regex_split(text, {
                            "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
                        });
--- a/llama.h
+++ b/llama.h
@@ -81,6 +81,7 @@ extern "C" {
        LLAMA_VOCAB_PRE_TYPE_GPT2           = 7,
        LLAMA_VOCAB_PRE_TYPE_REFACT         = 8,
        LLAMA_VOCAB_PRE_TYPE_COMMAND_R      = 9,
+        LLAMA_VOCAB_PRE_TYPE_OLMO           = 10,
    };

    // note: these values should be synchronized with ggml_rope
Author	SHA1	Message	Date
nopperl	b6aa670203	Fix OLMo HF to GGUF conversion (#6910 )	2024-05-07 21:39:43 +02:00
Kyle Mistele	260b7c6529	server : update readme with undocumented options (#7013 )	2024-05-07 21:44:29 +03:00
Georgi Gerganov	53d6c52e22	readme : update hot topics	2024-05-07 21:43:13 +03:00
RhinoDevel	3af34c1d1b	main : update log text (EOS to EOG) (#7104 ) * Update log text (EOS to EOG) The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT. * Improve log msg. further by using "an" instead of "some". As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).	2024-05-07 20:51:31 +03:00
omahs	04976db7a8	docs: fix typos (#7124 ) * fix typo * fix typos * fix typo * fix typos * fix typo * fix typos	2024-05-07 18:20:33 +03:00
Georgi Gerganov	947d3ad27d	ci : add GG_BUILD_EXTRA_TESTS_0 env (#7098 ) * ci : add GG_BUILD_EXTRA_TESTS_0 env ggml-ci * Update run.sh ggml-ci	2024-05-07 11:08:49 +03:00