llama.cpp/models at 0f45f1a35cddd6e8fec2aa216cc11b2345205edd - llama.cpp - Gitea: Git with a cup of tea

sdgoij/llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-14 21:14:10 +00:00

Files

History

Kabir Potdar 42532afff4 unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110 )

* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests

- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes #21919.

* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks

* cont : remove trailing whitespace

---------

Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>

2026-05-14 11:03:40 +02:00

..

autoparser: support case of JSON_NATIVE with per-call markers (test case: Reka-Edge) (#21892 )

2026-04-15 10:51:50 +02:00

.editorconfig

gguf : new file format with flexible meta data (beta) (#2398 )

2023-08-21 23:07:43 +03:00

ggml-vocab-aquila.gguf

Work on the BPE tokenizer (#3252 )

2023-10-03 09:16:26 +02:00

ggml-vocab-baichuan.gguf

Add more tokenizer tests (#3742 )

2023-10-24 09:17:17 +02:00

ggml-vocab-bert-bge.gguf

llama : fix BPE pre-tokenization (#6920 )

2024-04-29 16:58:41 +03:00

ggml-vocab-bert-bge.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-bert-bge.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-command-r.gguf

command-r : add BPE pre-tokenization (#7063 )

2024-05-05 08:19:30 +03:00

ggml-vocab-command-r.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-command-r.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-deepseek-coder.gguf

llama : fix BPE pre-tokenization (#6920 )

2024-04-29 16:58:41 +03:00

ggml-vocab-deepseek-coder.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-deepseek-coder.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-deepseek-llm.gguf

llama : fix BPE pre-tokenization (#6920 )

2024-04-29 16:58:41 +03:00

ggml-vocab-deepseek-llm.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-deepseek-llm.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-falcon.gguf

llama : fix BPE pre-tokenization (#6920 )

2024-04-29 16:58:41 +03:00

ggml-vocab-falcon.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-falcon.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-gemma-4.gguf

vocab: add gemma4 tokenizer tests, fix edge case (#21534 )

2026-04-09 11:41:14 +02:00

ggml-vocab-gemma-4.gguf.inp

vocab: add gemma4 tokenizer tests, fix edge case (#21534 )

2026-04-09 11:41:14 +02:00

ggml-vocab-gemma-4.gguf.out

vocab: add gemma4 tokenizer tests, fix edge case (#21534 )

2026-04-09 11:41:14 +02:00

ggml-vocab-gpt-2.gguf

llama : fix BPE pre-tokenization (#6920 )

2024-04-29 16:58:41 +03:00

ggml-vocab-gpt-2.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-gpt-2.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-gpt-neox.gguf

Add more tokenizer tests (#3742 )

2023-10-24 09:17:17 +02:00

ggml-vocab-llama-bpe.gguf

llama : fix BPE pre-tokenization (#6920 )

2024-04-29 16:58:41 +03:00

ggml-vocab-llama-bpe.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-llama-bpe.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-llama-spm.gguf

llama : fix BPE pre-tokenization (#6920 )

2024-04-29 16:58:41 +03:00

ggml-vocab-llama-spm.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-llama-spm.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-mpt.gguf

llama : fix BPE pre-tokenization (#6920 )

2024-04-29 16:58:41 +03:00

ggml-vocab-mpt.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-mpt.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-nomic-bert-moe.gguf

tests : improve UGM tokenizer test coverage (#13773 )

2025-05-25 16:22:29 +02:00

ggml-vocab-phi-3.gguf

Per token attributes (#7685 )

2024-06-04 09:17:17 +02:00

ggml-vocab-phi-3.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-phi-3.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-qwen2.gguf

llama : add BPE pre-tokenization for Qwen2 (#7114 )

2024-05-08 15:06:43 +03:00

ggml-vocab-qwen2.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-qwen2.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-qwen35.gguf

unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110 )

2026-05-14 11:03:40 +02:00

ggml-vocab-qwen35.gguf.inp

unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110 )

2026-05-14 11:03:40 +02:00

ggml-vocab-qwen35.gguf.out

unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110 )

2026-05-14 11:03:40 +02:00

ggml-vocab-refact.gguf

tests : add test-tokenizer-0.sh + fix some tokenizers (#7036 )

2024-05-04 08:32:32 +03:00

ggml-vocab-refact.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-refact.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-starcoder.gguf

llama : fix BPE pre-tokenization (#6920 )

2024-04-29 16:58:41 +03:00

ggml-vocab-starcoder.gguf.inp

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00

ggml-vocab-starcoder.gguf.out

convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )

2025-05-30 12:24:37 +02:00