Files
llama.cpp/tests
Francis Couture-Harpin f9d42c598b convert_hf : identify more added control tokens for SPM tokenziers
This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens
2024-07-07 23:28:38 -04:00
..
2024-03-09 14:17:11 +02:00
2024-01-29 15:50:50 -05:00