Compare commits

...

18 Commits

Author SHA1 Message Date
Francis Couture-Harpin
aaf7bc89e4 Merge branch 'master' into compilade/gguf-py-fix-old-numpy 2024-07-09 00:10:06 -04:00
Georgi Gerganov
7fdb6f73e3 flake.lock: Update (#8342)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01)
  → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03)
• Updated input 'flake-parts/nixpkgs-lib':
    'eb9ceca17d.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01)
  → '5daf051448.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27)
  → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-08 15:36:38 -07:00
Alberto Cabrera Pérez
a130eccef4 labeler : updated sycl to match docs and code refactor (#8373) 2024-07-08 22:35:17 +02:00
b4b4o
c4dd11d1d3 readme : fix web link error [no ci] (#8347) 2024-07-08 17:19:24 +03:00
Alberto Cabrera Pérez
2ec846d558 sycl : fix powf call in device code (#8368) 2024-07-08 14:22:41 +01:00
Georgi Gerganov
3f2d538b81 scripts : fix sync for sycl 2024-07-08 13:51:31 +03:00
Georgi Gerganov
2ee44c9a18 sync : ggml
ggml-ci
2024-07-08 12:23:00 +03:00
Georgi Gerganov
6847d54c4f tests : fix whitespace (#0) 2024-07-08 12:23:00 +03:00
John Balis
fde13b3bb9 feat: cuda implementation for ggml_conv_transpose_1d (ggml/854)
* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-08 12:23:00 +03:00
Kevin Wang
470939d483 common : preallocate sampling token data vector (#8363)
`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op.

Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.
2024-07-08 10:26:53 +03:00
Georgi Gerganov
6f0dbf6ab0 infill : assert prefix/suffix tokens + remove old space logic (#8351) 2024-07-08 09:34:35 +03:00
Kevin Wang
ffd00797d8 common : avoid unnecessary logits fetch (#8358) 2024-07-08 09:31:55 +03:00
toyer
04ce3a8b19 readme : add supported glm models (#8360) 2024-07-08 08:57:19 +03:00
compilade
3fd62a6b1c py : type-check all Python scripts with Pyright (#8341)
* py : type-check all Python scripts with Pyright

* server-tests : use trailing slash in openai base_url

* server-tests : add more type annotations

* server-tests : strip "chat" from base_url in oai_chat_completions

* server-tests : model metadata is a dict

* ci : disable pip cache in type-check workflow

The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.

* py : fix new type errors from master branch

* tests : fix test-tokenizer-random.py

Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.

* ci : only show warnings and errors in python type-check

The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
2024-07-07 15:04:39 -04:00
Denis Spasyuk
a8db2a9ce6 Update llama-cli documentation (#8315)
* Update README.md

* Update README.md

* Update README.md

fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas

* Update README.md

* Update README.md

* Update README.md
2024-07-07 17:08:28 +02:00
Alex Tuddenham
4090ea5501 ci : add checks for cmake,make and ctest in ci/run.sh (#8200)
* Added checks for cmake,make and ctest

* Removed erroneous whitespace
2024-07-07 17:59:14 +03:00
Andy Tai
f1948f1e10 readme : update bindings list (#8222)
* adding guile_llama_cpp  to binding list

* fix formatting

* fix formatting
2024-07-07 16:21:37 +03:00
Francis Couture-Harpin
8334b5becb gguf-py : do not use internal numpy types 2024-05-22 14:29:50 -04:00
49 changed files with 551 additions and 264 deletions

View File

@@ -89,6 +89,22 @@ let
ps.tiktoken
ps.torchWithoutCuda
ps.transformers
# server bench
ps.matplotlib
# server tests
ps.openai
ps.behave
ps.prometheus-client
# for examples/pydantic-models-to-grammar-examples.py
ps.docstring-parser
ps.pydantic
# for scripts/compare-llama-bench.py
ps.gitpython
ps.tabulate
]
);

4
.github/labeler.yml vendored
View File

@@ -16,7 +16,9 @@ SYCL:
- any-glob-to-any-file:
- ggml/include/ggml-sycl.h
- ggml/src/ggml-sycl.cpp
- README-sycl.md
- ggml/src/ggml-sycl/**
- docs/backend/SYCL.md
- examples/sycl/**
Nvidia GPU:
- changed-files:
- any-glob-to-any-file:

38
.github/workflows/python-type-check.yml vendored Normal file
View File

@@ -0,0 +1,38 @@
name: Python Type-Check
on:
push:
paths:
- '.github/workflows/python-type-check.yml'
- '**.py'
- '**/requirements*.txt'
pull_request:
paths:
- '.github/workflows/python-type-check.yml'
- '**.py'
- '**/requirements*.txt'
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
jobs:
python-type-check:
runs-on: ubuntu-latest
name: pyright type-check
steps:
- name: Check out source repository
uses: actions/checkout@v4
- name: Set up Python environment
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Python dependencies
# TODO: use a venv
run: pip install -r requirements/requirements-all.txt
- name: Type-check with Pyright
uses: jakebailey/pyright-action@v2
with:
version: 1.1.370
level: warning
warnings: true

View File

@@ -96,8 +96,9 @@ Typically finetunes of the base models below are supported as well.
- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
- [x] [OLMo](https://allenai.org/olmo)
- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
(instructions for supporting more models: [HOWTO-add-model.md](./docs/HOWTO-add-model.md))
(instructions for supporting more models: [HOWTO-add-model.md](./docs/development/HOWTO-add-model.md))
**Multimodal models:**
@@ -131,6 +132,7 @@ Typically finetunes of the base models below are supported as well.
- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
**UI:**

View File

@@ -103,6 +103,9 @@ function gg_run_ctest_debug {
set -e
# Check cmake, make and ctest are installed
gg_check_build_requirements
(time cmake -DCMAKE_BUILD_TYPE=Debug ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
@@ -131,6 +134,9 @@ function gg_run_ctest_release {
set -e
# Check cmake, make and ctest are installed
gg_check_build_requirements
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
@@ -701,6 +707,20 @@ function gg_run_embd_bge_small {
set +e
}
function gg_check_build_requirements {
if ! command -v cmake &> /dev/null; then
gg_printf 'cmake not found, please install'
fi
if ! command -v make &> /dev/null; then
gg_printf 'make not found, please install'
fi
if ! command -v ctest &> /dev/null; then
gg_printf 'ctest not found, please install'
fi
}
function gg_sum_embd_bge_small {
gg_printf '### %s\n\n' "${ci}"

View File

@@ -630,7 +630,7 @@ inline std::string LOG_TOKENS_TOSTR_PRETTY(const C & ctx, const T & tokens)
buf << "[ ";
bool first = true;
for (const auto &token : tokens)
for (const auto & token : tokens)
{
if (!first) {
buf << ", ";

View File

@@ -282,8 +282,6 @@ static llama_token llama_sampling_sample_impl(
GGML_ASSERT(!original_logits.empty());
}
llama_token id = 0;
// Get a pointer to the logits
float * logits = llama_get_logits_ith(ctx_main, idx);
if (temp < 0.0) {
// greedy sampling, with probs
@@ -324,6 +322,9 @@ static llama_token llama_sampling_sample_impl(
}
if (ctx_sampling->grammar != NULL && !is_resampling) {
// Get a pointer to the logits
float * logits = llama_get_logits_ith(ctx_main, idx);
// Create an array with a single token data element for the sampled id
llama_token_data single_token_data = {id, logits[id], 0.0f};
llama_token_data_array single_token_data_array = { &single_token_data, 1, false };
@@ -377,7 +378,7 @@ static llama_token_data_array llama_sampling_prepare_impl(
if (ctx_sampling->grammar != NULL && !apply_grammar) {
GGML_ASSERT(original_logits != NULL);
// Only make a copy of the original logits if we are not applying grammar checks, not sure if I actually have to do this.
*original_logits = {logits, logits + llama_n_vocab(llama_get_model(ctx_main))};
*original_logits = {logits, logits + n_vocab};
}
// apply params.logit_bias map
@@ -390,10 +391,10 @@ static llama_token_data_array llama_sampling_prepare_impl(
llama_sample_apply_guidance(ctx_main, logits, logits_guidance, params.cfg_scale);
}
cur.clear();
cur.resize(n_vocab);
for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
cur.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
cur[token_id] = llama_token_data{token_id, logits[token_id], 0.0f};
}
llama_token_data_array cur_p = { cur.data(), cur.size(), false };

View File

@@ -265,7 +265,7 @@ class Model:
break
for new_name, data in ((n, d.squeeze().numpy()) for n, d in self.modify_tensors(data_torch, name, bid)):
data: np.ndarray = data # type hint
data: np.ndarray # type hint
n_dims = len(data.shape)
data_dtype = data.dtype
data_qtype: gguf.GGMLQuantizationType | None = None
@@ -599,10 +599,6 @@ class Model:
tokenizer_path = self.dir_model / 'tokenizer.model'
tokens: list[bytes] = []
scores: list[float] = []
toktypes: list[int] = []
if not tokenizer_path.is_file():
raise FileNotFoundError(f"File not found: {tokenizer_path}")
@@ -2120,7 +2116,7 @@ class InternLM2Model(Model):
logger.error(f'Error: Missing {tokenizer_path}')
sys.exit(1)
sentencepiece_model = model.ModelProto()
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue]
sentencepiece_model.ParseFromString(open(tokenizer_path, "rb").read())
add_prefix = sentencepiece_model.normalizer_spec.add_dummy_prefix
@@ -2972,16 +2968,16 @@ class T5Model(Model):
if not tokenizer_path.is_file():
raise FileNotFoundError(f"File not found: {tokenizer_path}")
sentencepiece_model = model.ModelProto()
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue]
sentencepiece_model.ParseFromString(open(tokenizer_path, "rb").read())
# some models like Pile-T5 family use BPE tokenizer instead of Unigram
if sentencepiece_model.trainer_spec.model_type == 2: # BPE
if sentencepiece_model.trainer_spec.model_type == 2: # BPE
# assure the tokenizer model file name is correct
assert tokenizer_path.name == 'tokenizer.model'
return self._set_vocab_sentencepiece()
else:
assert sentencepiece_model.trainer_spec.model_type == 1 # UNIGRAM
assert sentencepiece_model.trainer_spec.model_type == 1 # UNIGRAM
add_prefix = sentencepiece_model.normalizer_spec.add_dummy_prefix
remove_whitespaces = sentencepiece_model.normalizer_spec.remove_extra_whitespaces
@@ -3152,7 +3148,7 @@ class JaisModel(Model):
# but Jais's PyTorch model simply precalculates the slope values and places them
# in relative_pes.slopes
n_head_closest_log2 = 2 ** math.floor(math.log2(self.hparams["n_head"]))
first_val = float(data_torch._data[0])
first_val = float(data_torch[0].item())
self.max_alibi_bias = -round(math.log2(first_val) * n_head_closest_log2)
return tensors
@@ -3186,7 +3182,7 @@ class ChatGLMModel(Model):
def set_vocab_chatglm3(self):
dir_model = self.dir_model
hparams = self.hparams
tokens: list[bytearray] = []
tokens: list[bytes] = []
toktypes: list[int] = []
scores: list[float] = []
@@ -3335,7 +3331,7 @@ class ChatGLMModel(Model):
special_vocab.add_to_gguf(self.gguf_writer)
def set_gguf_parameters(self):
self.gguf_writer.add_name(self.hparams.get("_name_or_path").split("/")[1]) # THUDM/glm4-9b-chat or THUDM/chatglm3-6b
self.gguf_writer.add_name(self.hparams["_name_or_path"].split("/")[1]) # THUDM/glm4-9b-chat or THUDM/chatglm3-6b
n_embed = self.hparams.get("hidden_size", self.hparams.get("n_embed"))
n_head = self.hparams.get("n_head", self.hparams.get("num_attention_heads"))
n_head_kv = self.hparams.get("multi_query_group_num", n_head)

View File

@@ -354,7 +354,8 @@ class GGMLToGGUF:
def handle_metadata(cfg, hp):
import convert
import examples.convert_legacy_llama as convert
assert cfg.model_metadata_dir.is_dir(), 'Metadata dir is not a directory'
hf_config_path = cfg.model_metadata_dir / "config.json"
orig_config_path = cfg.model_metadata_dir / "params.json"

View File

@@ -353,7 +353,7 @@ class Metadata:
version: Optional[str] = None
url: Optional[str] = None
description: Optional[str] = None
licence: Optional[str] = None
license: Optional[str] = None
source_url: Optional[str] = None
source_hf_repo: Optional[str] = None
@@ -492,12 +492,13 @@ class LazyTensor:
LazyModel: TypeAlias = 'dict[str, LazyTensor]'
ModelFormat: TypeAlias = Literal['ggml', 'torch', 'safetensors', 'none']
@dataclass
class ModelPlus:
model: LazyModel
paths: list[Path] # Where this was read from.
format: Literal['ggml', 'torch', 'safetensors', 'none']
format: ModelFormat
vocab: BaseVocab | None # For GGML models (which have vocab built in), the vocab.
@@ -536,7 +537,7 @@ def merge_sharded(models: list[LazyModel]) -> LazyModel:
def merge_multifile_models(models_plus: list[ModelPlus]) -> ModelPlus:
formats = set(mp.format for mp in models_plus)
formats: set[ModelFormat] = set(mp.format for mp in models_plus)
assert len(formats) == 1, "different formats?"
format = formats.pop()
paths = [path for mp in models_plus for path in mp.paths]
@@ -555,7 +556,7 @@ def merge_multifile_models(models_plus: list[ModelPlus]) -> ModelPlus:
else:
model = merge_sharded([mp.model for mp in models_plus])
return ModelPlus(model, paths, format, vocab) # pytype: disable=wrong-arg-types
return ModelPlus(model, paths, format, vocab)
def permute_lazy(lazy_tensor: LazyTensor, n_head: int, n_head_kv: int) -> LazyTensor:
@@ -805,7 +806,7 @@ class OutputFile:
def __init__(self, fname_out: Path, endianess:gguf.GGUFEndian = gguf.GGUFEndian.LITTLE):
self.gguf = gguf.GGUFWriter(fname_out, gguf.MODEL_ARCH_NAMES[ARCH], endianess=endianess)
def add_meta_model(self, params: Params, metadata: Metadata) -> None:
def add_meta_model(self, params: Params, metadata: Metadata | None) -> None:
# Metadata About The Model And Its Provenence
name = "LLaMA"
if metadata is not None and metadata.name is not None:
@@ -827,8 +828,8 @@ class OutputFile:
self.gguf.add_url(metadata.url)
if metadata.description is not None:
self.gguf.add_description(metadata.description)
if metadata.licence is not None:
self.gguf.add_licence(metadata.licence)
if metadata.license is not None:
self.gguf.add_licence(metadata.license)
if metadata.source_url is not None:
self.gguf.add_source_url(metadata.source_url)
if metadata.source_hf_repo is not None:
@@ -943,7 +944,7 @@ class OutputFile:
@staticmethod
def write_vocab_only(
fname_out: Path, params: Params, vocab: Vocab, svocab: gguf.SpecialVocab,
endianess: gguf.GGUFEndian = gguf.GGUFEndian.LITTLE, pad_vocab: bool = False, metadata: Metadata = None,
endianess: gguf.GGUFEndian = gguf.GGUFEndian.LITTLE, pad_vocab: bool = False, metadata: Metadata | None = None,
) -> None:
check_vocab_size(params, vocab, pad_vocab=pad_vocab)
@@ -977,7 +978,7 @@ class OutputFile:
fname_out: Path, ftype: GGMLFileType, params: Params, model: LazyModel, vocab: BaseVocab, svocab: gguf.SpecialVocab,
concurrency: int = DEFAULT_CONCURRENCY, endianess: gguf.GGUFEndian = gguf.GGUFEndian.LITTLE,
pad_vocab: bool = False,
metadata: Metadata = None,
metadata: Metadata | None = None,
) -> None:
check_vocab_size(params, vocab, pad_vocab=pad_vocab)
@@ -1396,6 +1397,8 @@ def main(args_in: list[str] | None = None) -> None:
if model_plus.vocab is not None and args.vocab_dir is None and not args.no_vocab:
vocab = model_plus.vocab
assert params is not None
logger.info(f"Vocab info: {vocab}")
logger.info(f"Special vocab info: {special_vocab}")
model = model_plus.model

View File

@@ -74,7 +74,7 @@ class Tensor:
if len(self.ne) == 0:
self.nbytes = 0
else:
self.nbytes = int(np.product(self.ne)) * 4
self.nbytes = int(np.prod(self.ne)) * 4
else:
raise ValueError(f"Unhandled data type '{self.dtype}'")

View File

@@ -204,21 +204,17 @@ int main(int argc, char ** argv) {
GGML_ASSERT(llama_add_eos_token(model) != 1);
LOG("add_bos: %d\n", add_bos);
bool suff_rm_leading_spc = params.escape;
if (suff_rm_leading_spc && params.input_suffix.find_first_of(' ') == 0 && params.input_suffix.size() > 1) {
params.input_suffix.erase(0, 1);
suff_rm_leading_spc = false;
}
std::vector<llama_token> embd_inp;
std::vector<llama_token> embd_end;
std::vector<llama_token> inp_pfx = ::llama_tokenize(ctx, params.input_prefix, false);
std::vector<llama_token> inp_sfx = ::llama_tokenize(ctx, params.input_suffix, false);
const int space_token = 29871;
if (suff_rm_leading_spc && inp_sfx[0] == space_token) {
inp_sfx.erase(inp_sfx.begin());
}
GGML_ASSERT(llama_token_prefix(model) >= 0);
GGML_ASSERT(llama_token_suffix(model) >= 0);
inp_pfx.insert(inp_pfx.begin(), llama_token_prefix(model));
inp_sfx.insert(inp_sfx.begin(), llama_token_suffix(model));
embd_inp = params.spm_infill ? inp_sfx : inp_pfx;
embd_end = params.spm_infill ? inp_pfx : inp_sfx;
if (add_bos) {
@@ -516,19 +512,14 @@ int main(int argc, char ** argv) {
string_process_escapes(params.input_prefix);
string_process_escapes(params.input_suffix);
}
suff_rm_leading_spc = params.escape;
if (suff_rm_leading_spc && params.input_suffix.find_first_of(' ') == 0 && params.input_suffix.size() > 1) {
params.input_suffix.erase(0, 1);
suff_rm_leading_spc = false;
}
// tokenize new prefix and suffix
std::vector<llama_token> inp_pfx = ::llama_tokenize(ctx, params.input_prefix, false);
std::vector<llama_token> inp_sfx = ::llama_tokenize(ctx, params.input_suffix, false);
if (suff_rm_leading_spc && inp_sfx[0] == space_token) {
inp_sfx.erase(inp_sfx.begin());
}
inp_pfx.insert(inp_pfx.begin(), llama_token_prefix(model));
inp_sfx.insert(inp_sfx.begin(), llama_token_suffix(model));
embd_inp = params.spm_infill ? inp_sfx : inp_pfx;
embd_end = params.spm_infill ? inp_pfx : inp_sfx;
if (add_bos) {

View File

@@ -3,7 +3,7 @@
#! pip install pydantic
#! python json_schema_pydantic_example.py
from pydantic import BaseModel, Extra, TypeAdapter
from pydantic import BaseModel, Field, TypeAdapter
from annotated_types import MinLen
from typing import Annotated, List, Optional
import json, requests
@@ -17,6 +17,9 @@ if True:
The response_model param takes a type (+ supports Pydantic) and behaves just as w/ Instructor (see below)
'''
response_format = None
type_adapter = None
if response_model:
type_adapter = TypeAdapter(response_model)
schema = type_adapter.json_schema()

View File

@@ -1,4 +1,6 @@
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import itertools
import json
@@ -188,7 +190,7 @@ def _generate_min_max_int(min_value: Optional[int], max_value: Optional[int], ou
raise RuntimeError("At least one of min_value or max_value must be set")
class BuiltinRule:
def __init__(self, content: str, deps: list = None):
def __init__(self, content: str, deps: list | None = None):
self.content = content
self.deps = deps or []
@@ -248,7 +250,7 @@ class SchemaConverter:
def _format_literal(self, literal):
escaped = GRAMMAR_LITERAL_ESCAPE_RE.sub(
lambda m: GRAMMAR_LITERAL_ESCAPES.get(m.group(0)), literal
lambda m: GRAMMAR_LITERAL_ESCAPES.get(m.group(0)) or m.group(0), literal
)
return f'"{escaped}"'
@@ -403,11 +405,11 @@ class SchemaConverter:
i = 0
length = len(pattern)
def to_rule(s: Tuple[str, bool]) -> str:
def to_rule(s: tuple[str, bool]) -> str:
(txt, is_literal) = s
return "\"" + txt + "\"" if is_literal else txt
def transform() -> Tuple[str, bool]:
def transform() -> tuple[str, bool]:
'''
Parse a unit at index i (advancing it), and return its string representation + whether it's a literal.
'''
@@ -420,7 +422,7 @@ class SchemaConverter:
# We only need a flat structure here to apply repetition operators to the last item, and
# to merge literals at the and (we're parsing grouped ( sequences ) recursively and don't treat '|' specially
# (GBNF's syntax is luckily very close to regular expressions!)
seq: list[Tuple[str, bool]] = []
seq: list[tuple[str, bool]] = []
def get_dot():
if self._dotall:

View File

@@ -185,6 +185,8 @@ else:
fout.add_description("two-tower CLIP model")
if has_text_encoder:
assert t_hparams is not None
assert tokens is not None
# text_model hparams
fout.add_uint32(k(KEY_CONTEXT_LENGTH, TEXT), t_hparams["max_position_embeddings"])
fout.add_uint32(k(KEY_EMBEDDING_LENGTH, TEXT), t_hparams["hidden_size"])
@@ -259,8 +261,8 @@ if has_vision_encoder:
if processor is not None:
image_mean = processor.image_processor.image_mean if args.image_mean is None or args.image_mean == default_image_mean else args.image_mean
image_std = processor.image_processor.image_std if args.image_std is None or args.image_std == default_image_std else args.image_std
image_mean = processor.image_processor.image_mean if args.image_mean is None or args.image_mean == default_image_mean else args.image_mean # pyright: ignore[reportAttributeAccessIssue]
image_std = processor.image_processor.image_std if args.image_std is None or args.image_std == default_image_std else args.image_std # pyright: ignore[reportAttributeAccessIssue]
else:
image_mean = args.image_mean if args.image_mean is not None else default_image_mean
image_std = args.image_std if args.image_std is not None else default_image_std
@@ -272,7 +274,7 @@ fout.add_bool("clip.use_gelu", use_gelu)
if has_llava_projector:
model.vision_model.encoder.layers.pop(-1)
model.vision_model.encoder.layers.pop(-1) # pyright: ignore[reportAttributeAccessIssue]
projector = torch.load(args.llava_projector)
for name, data in projector.items():
name = get_tensor_name(name)
@@ -286,7 +288,7 @@ if has_llava_projector:
print("Projector tensors added\n")
state_dict = model.state_dict()
state_dict = model.state_dict() # pyright: ignore[reportAttributeAccessIssue]
for name, data in state_dict.items():
if should_skip_tensor(name, has_text_encoder, has_vision_encoder, has_llava_projector):
# we don't need this

View File

@@ -2,7 +2,9 @@ import argparse
import glob
import os
import torch
from safetensors.torch import load as safe_load, save as safe_save, safe_open, save_file
from safetensors import safe_open
from safetensors.torch import save_file
from typing import Any, ContextManager, cast
# Function to determine if file is a SafeTensor file
def is_safetensor_file(file_path):
@@ -13,7 +15,7 @@ def is_safetensor_file(file_path):
def load_model(file_path):
if is_safetensor_file(file_path):
tensors = {}
with safe_open(file_path, framework="pt", device="cpu") as f:
with cast(ContextManager[Any], safe_open(file_path, framework="pt", device="cpu")) as f:
for key in f.keys():
tensors[key] = f.get_tensor(key).clone()
# output shape
@@ -134,7 +136,7 @@ if len(mm_tensors) == 0:
if last_checkpoint is not None:
for k, v in last_checkpoint.items():
print(k)
print(f"Found {len(mm_tensors)} tensors to extract out of {len(last_checkpoint)} tensors.")
print(f"Found {len(mm_tensors)} tensors to extract out of {len(last_checkpoint) if last_checkpoint is not None else 0} tensors.")
print("No tensors found. Is this a LLaVA model?")
exit()
@@ -143,8 +145,10 @@ print(f"Found additional {len(first_mm_tensors)} tensors to extract.")
# projector = {name: checkpoint.[name].float() for name in mm_tensors}
projector = {}
for name in mm_tensors:
assert last_checkpoint is not None
projector[name] = last_checkpoint[name].float()
for name in first_mm_tensors:
assert first_checkpoint is not None
projector[name] = first_checkpoint[name].float()
if len(projector) > 0:

View File

@@ -1,6 +1,6 @@
# llama.cpp/examples/main
This example program allows you to use various LLaMA language models in an easy and efficient way. It is specifically designed to work with the [llama.cpp](https://github.com/ggerganov/llama.cpp) project, which provides a plain C/C++ implementation with optional 4-bit quantization support for faster, lower memory inference, and is optimized for desktop CPUs. This program can be used to perform various inference tasks with LLaMA models, including generating text based on user-provided prompts and chat-like interactions with reverse prompts.
This example program allows you to use various LLaMA language models easily and efficiently. It is specifically designed to work with the [llama.cpp](https://github.com/ggerganov/llama.cpp) project, which provides a plain C/C++ implementation with optional 4-bit quantization support for faster, lower memory inference, and is optimized for desktop CPUs. This program can be used to perform various inference tasks with LLaMA models, including generating text based on user-provided prompts and chat-like interactions with reverse prompts.
## Table of Contents
@@ -17,60 +17,59 @@ This example program allows you to use various LLaMA language models in an easy
To get started right away, run the following command, making sure to use the correct path for the model you have:
#### Unix-based systems (Linux, macOS, etc.):
First, we will need to download a model. In these examples, we will use the Gemma model from the ggml-org repo on Hugging Face.
[https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true](https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true)
Once downloaded, place your model in the models folder in llama.cpp.
### Unix-based systems (Linux, macOS, etc.):
##### Input prompt (One-and-done)
```bash
./llama-cli -m models/7B/ggml-model.bin --prompt "Once upon a time"
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf --prompt "Once upon a time"
```
#### Windows:
```powershell
llama-cli.exe -m models\7B\ggml-model.bin --prompt "Once upon a time"
```
For an interactive experience, try this command:
#### Unix-based systems (Linux, macOS, etc.):
##### Conversation mode (Allow for continuous interaction with the model)
```bash
./llama-cli -m models/7B/ggml-model.bin -n -1 --color -r "User:" --in-prefix " " -i -p \
'User: Hi
AI: Hello. I am an AI chatbot. Would you like to talk?
User: Sure!
AI: What would you like to talk about?
User:'
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf -cnv --chat-template gemma
```
#### Windows:
```powershell
llama-cli.exe -m models\7B\ggml-model.bin -n -1 --color -r "User:" --in-prefix " " -i -e -p "User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:"
```
The following command generates "infinite" text from a starting prompt (you can use `Ctrl-C` to stop it):
#### Unix-based systems (Linux, macOS, etc.):
##### Infinite text from a starting prompt (you can use `Ctrl-C` to stop it):
```bash
./llama-cli -m models/7B/ggml-model.bin --ignore-eos -n -1
./llama-cli -m models\gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
```
#### Windows:
### Windows:
##### Input prompt (One-and-done)
```powershell
./llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --prompt "Once upon a time"
```
##### Conversation mode (Allow for continuous interaction with the model)
```powershell
llama-cli.exe -m models\7B\ggml-model.bin --ignore-eos -n -1
./llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf -cnv --chat-template gemma
```
#### Infinite text from a starting prompt (you can use `Ctrl-C` to stop it):
```powershell
llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
```
## Common Options
In this section, we cover the most commonly used options for running the `llama-cli` program with the LLaMA models:
- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.gguf`; inferred from `--model-url` if set).
- `-mu MODEL_URL --model-url MODEL_URL`: Specify a remote http url to download the file (e.g https://huggingface.co/ggml-org/models/resolve/main/phi-2/ggml-model-q4_0.gguf).
- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/gemma-1.1-7b-it.Q4_K_M.gguf`; inferred from `--model-url` if set).
- `-mu MODEL_URL --model-url MODEL_URL`: Specify a remote http url to download the file (e.g [https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true](https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true)).
- `-i, --interactive`: Run the program in interactive mode, allowing you to provide input directly and receive real-time responses.
- `-n N, --n-predict N`: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
- `-mli, --multiline-input`: Allows you to write or paste multiple lines without ending each in '\'
- `-t N, --threads N`: Set the number of threads to use during generation. For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has.
- - `-ngl N, --n-gpu-layers N`: When compiled with GPU support, this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
## Input Prompts
@@ -90,6 +89,7 @@ In interactive mode, users can participate in text generation by injecting their
- `-i, --interactive`: Run the program in interactive mode, allowing users to engage in real-time conversations or provide specific instructions to the model.
- `--interactive-first`: Run the program in interactive mode and immediately wait for user input before starting the text generation.
- `-cnv, --conversation`: Run the program in conversation mode (does not print special tokens and suffix/prefix, use default chat template) (default: false)
- `--color`: Enable colorized output to differentiate visually distinguishing between prompts, user input, and generated text.
By understanding and utilizing these interaction options, you can create engaging and dynamic experiences with the LLaMA models, tailoring the text generation process to your specific needs.
@@ -117,6 +117,13 @@ The `--in-suffix` flag is used to add a suffix after your input. This is useful
```sh
./llama-cli -r "User:" --in-prefix " " --in-suffix "Assistant:"
```
When --in-prefix or --in-suffix options are enabled the chat template ( --chat-template ) is disabled
### Chat templates
`--chat-template JINJA_TEMPLATE`: This option sets a custom jinja chat template. It accepts a string, not a file name. Default: template taken from model's metadata. Llama.cpp only supports [some pre-defined templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template). These include llama2, llama3, gemma, monarch, chatml, orion, vicuna, vicuna-orca, deepseek, command-r, zephyr. When --in-prefix or --in-suffix options are enabled the chat template ( --chat-template ) is disabled.
Example usage: `--chat-template gemma`
## Context Management
@@ -124,9 +131,7 @@ During text generation, LLaMA models have a limited context size, which means th
### Context Size
The `--ctx-size` option allows you to set the size of the prompt context used by the LLaMA models during text generation. A larger context size helps the model to better comprehend and generate responses for longer input or conversations.
- `-c N, --ctx-size N`: Set the size of the prompt context (default: 512). The LLaMA models were built with a context of 2048, which will yield the best results on longer input/inference. However, increasing the context size beyond 2048 may lead to unpredictable results.
- `-c N, --ctx-size N`: Set the size of the prompt context (default: 0, 0 = loaded from model). The LLaMA models were built with a context of 2048-8192, which will yield the best results on longer input/inference.
### Extended Context Size
@@ -148,15 +153,15 @@ The following options allow you to control the text generation process and fine-
### Number of Tokens to Predict
- `-n N, --n-predict N`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity, -2 = until context filled)
- `-n N, --predict N`: Set the number of tokens to predict when generating text (default: -1, -1 = infinity, -2 = until context filled)
The `--n-predict` option controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text.
The `--predict` option controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text.
A value of -1 will enable infinite text generation, even though we have a finite context window. When the context window is full, some of the earlier tokens (half of the tokens after `--n-keep`) will be discarded. The context must then be re-evaluated before generation can resume. On large models and/or large context windows, this will result in significant pause in output.
A value of -1 will enable infinite text generation, even though we have a finite context window. When the context window is full, some of the earlier tokens (half of the tokens after `--keep`) will be discarded. The context must then be re-evaluated before generation can resume. On large models and/or large context windows, this will result in a significant pause in output.
If the pause is undesirable, a value of -2 will stop generation immediately when the context is filled.
It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n-predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the `--ignore-eos` parameter.
It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode, text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `--predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the `--ignore-eos` parameter.
### Temperature
@@ -164,15 +169,15 @@ It is important to note that the generated text may be shorter than the specifie
Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The default value is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run.
Example usage: `--temp 0.5`
Example usage: `--temp 0`
### Repeat Penalty
- `--repeat-penalty N`: Control the repetition of token sequences in the generated text (default: 1.1).
- `--repeat-penalty N`: Control the repetition of token sequences in the generated text default: 1.0, 1.0 = disabled).
- `--repeat-last-n N`: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx-size).
- `--no-penalize-nl`: Disable penalization for newline tokens when applying the repeat penalty.
The `repeat-penalty` option helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1.
The `repeat-penalty` option helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.
The `repeat-last-n` option controls the number of tokens in the history to consider for penalizing repetition. A larger value will look further back in the generated text to prevent repetitions, while a smaller value will only consider recent tokens. A value of 0 disables the penalty, and a value of -1 sets the number of tokens considered equal to the context size (`ctx-size`).
@@ -196,19 +201,19 @@ Top-p sampling, also known as nucleus sampling, is another text generation metho
Example usage: `--top-p 0.95`
### Min P Sampling
### Min-P Sampling
- `--min-p N`: Sets a minimum base probability threshold for token selection (default: 0.05).
- `--min-p N`: Sets a minimum base probability threshold for token selection (default: 0.1).
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with *p*=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out.
Example usage: `--min-p 0.05`
### Tail Free Sampling (TFS)
### Tail-Free Sampling (TFS)
- `--tfs N`: Enable tail free sampling with parameter z (default: 1.0, 1.0 = disabled).
Tail free sampling (TFS) is a text generation technique that aims to reduce the impact of less likely tokens, which may be less relevant, less coherent, or nonsensical, on the output. Similar to Top-P it tries to determine the bulk of the most likely tokens dynamically. But TFS filters out logits based on the second derivative of their probabilities. Adding tokens is stopped after the sum of the second derivatives reaches the parameter z. In short: TFS looks how quickly the probabilities of the tokens decrease and cuts off the tail of unlikely tokens using the parameter z. Typical values for z are in the range of 0.9 to 0.95. A value of 1.0 would include all tokens, and thus disables the effect of TFS.
Tail-free sampling (TFS) is a text generation technique that aims to reduce the impact of less likely tokens, which may be less relevant, less coherent, or nonsensical, on the output. Similar to Top-P it tries to determine the bulk of the most likely tokens dynamically. But TFS filters out logits based on the second derivative of their probabilities. Adding tokens is stopped after the sum of the second derivatives reaches the parameter z. In short: TFS looks at how quickly the probabilities of the tokens decrease and cuts off the tail of unlikely tokens using the parameter z. Typical values for z are in the range of 0.9 to 0.95. A value of 1.0 would include all tokens and thus disables the effect of TFS.
Example usage: `--tfs 0.95`
@@ -307,10 +312,8 @@ These options provide extra functionality and customization when running the LLa
- `-h, --help`: Display a help message showing all available options and their default values. This is particularly useful for checking the latest options and default values, as they can change frequently, and the information in this document may become outdated.
- `--verbose-prompt`: Print the prompt before generating text.
- `-ngl N, --n-gpu-layers N`: When compiled with GPU support, this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
- `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used.
- `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance.
- `--lora FNAME`: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.
- `--lora-base FNAME`: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the `--lora` flag, and specifies the base model for the adaptation.
- `-hfr URL --hf-repo URL`: The url to the Hugging Face model repository. Used in conjunction with `--hf-file` or `-hff`. The model is downloaded and stored in the file provided by `-m` or `--model`. If `-m` is not provided, the model is auto-stored in the path specified by the `LLAMA_CACHE` environment variable or in an OS-specific local cache.

View File

@@ -6,10 +6,10 @@ import re
from copy import copy
from enum import Enum
from inspect import getdoc, isclass
from typing import TYPE_CHECKING, Any, Callable, List, Optional, Union, get_args, get_origin, get_type_hints
from typing import TYPE_CHECKING, Any, Callable, List, Optional, Union, get_args, get_origin
from docstring_parser import parse
from pydantic import BaseModel, Field, create_model
from pydantic import BaseModel, create_model
if TYPE_CHECKING:
from types import GenericAlias
@@ -17,6 +17,9 @@ else:
# python 3.8 compat
from typing import _GenericAlias as GenericAlias
# TODO: fix this
# pyright: reportAttributeAccessIssue=information
class PydanticDataType(Enum):
"""
@@ -234,8 +237,9 @@ def generate_gbnf_float_rules(max_digit=None, min_digit=None, max_precision=None
# Define the integer part rule
integer_part_rule = (
"integer-part" + (f"-max{max_digit}" if max_digit is not None else "") + (
f"-min{min_digit}" if min_digit is not None else "")
"integer-part"
+ (f"-max{max_digit}" if max_digit is not None else "")
+ (f"-min{min_digit}" if min_digit is not None else "")
)
# Define the fractional part rule based on precision constraints
@@ -458,7 +462,7 @@ def generate_gbnf_grammar(model: type[BaseModel], processed_models: set[type[Bas
if not issubclass(model, BaseModel):
# For non-Pydantic classes, generate model_fields from __annotations__ or __init__
if hasattr(model, "__annotations__") and model.__annotations__:
model_fields = {name: (typ, ...) for name, typ in model.__annotations__.items()}
model_fields = {name: (typ, ...) for name, typ in model.__annotations__.items()} # pyright: ignore[reportGeneralTypeIssues]
else:
init_signature = inspect.signature(model.__init__)
parameters = init_signature.parameters
@@ -680,7 +684,7 @@ def generate_markdown_documentation(
str: Generated text documentation.
"""
documentation = ""
pyd_models = [(model, True) for model in pydantic_models]
pyd_models: list[tuple[type[BaseModel], bool]] = [(model, True) for model in pydantic_models]
for model, add_prefix in pyd_models:
if add_prefix:
documentation += f"{model_prefix}: {model.__name__}\n"
@@ -700,7 +704,7 @@ def generate_markdown_documentation(
# Indenting the fields section
documentation += f" {fields_prefix}:\n"
else:
documentation += f" Fields:\n"
documentation += f" Fields:\n" # noqa: F541
if isclass(model) and issubclass(model, BaseModel):
for name, field_type in model.__annotations__.items():
# if name == "markdown_code_block":
@@ -778,7 +782,7 @@ def generate_field_markdown(
return field_text
if field_description != "":
field_text += f" Description: " + field_description + "\n"
field_text += f" Description: {field_description}\n"
# Check for and include field-specific examples if available
if hasattr(model, "Config") and hasattr(model.Config,
@@ -833,7 +837,7 @@ def generate_text_documentation(
str: Generated text documentation.
"""
documentation = ""
pyd_models = [(model, True) for model in pydantic_models]
pyd_models: list[tuple[type[BaseModel], bool]] = [(model, True) for model in pydantic_models]
for model, add_prefix in pyd_models:
if add_prefix:
documentation += f"{model_prefix}: {model.__name__}\n"
@@ -1164,7 +1168,7 @@ def create_dynamic_model_from_function(func: Callable[..., Any]):
dynamic_fields[param.name] = (
param.annotation if param.annotation != inspect.Parameter.empty else str, default_value)
# Creating the dynamic model
dynamic_model = create_model(f"{func.__name__}", **dynamic_fields) # type: ignore[call-overload]
dynamic_model = create_model(f"{func.__name__}", **dynamic_fields)
for name, param_doc in param_docs:
dynamic_model.model_fields[name].description = param_doc.description
@@ -1228,9 +1232,6 @@ def map_grammar_names_to_pydantic_model_class(pydantic_model_list):
return output
from enum import Enum
def json_schema_to_python_types(schema):
type_map = {
"any": Any,
@@ -1275,7 +1276,7 @@ def convert_dictionary_to_pydantic_model(dictionary: dict[str, Any], model_name:
if items != {}:
array = {"properties": items}
array_type = convert_dictionary_to_pydantic_model(array, f"{model_name}_{field_name}_items")
fields[field_name] = (List[array_type], ...) # type: ignore[valid-type]
fields[field_name] = (List[array_type], ...)
else:
fields[field_name] = (list, ...)
elif field_type == "object":
@@ -1285,7 +1286,8 @@ def convert_dictionary_to_pydantic_model(dictionary: dict[str, Any], model_name:
required = field_data.get("enum", [])
for key, field in fields.items():
if key not in required:
fields[key] = (Optional[fields[key][0]], ...)
optional_type = fields[key][0]
fields[key] = (Optional[optional_type], ...)
else:
field_type = json_schema_to_python_types(field_type)
fields[field_name] = (field_type, ...)
@@ -1305,6 +1307,7 @@ def convert_dictionary_to_pydantic_model(dictionary: dict[str, Any], model_name:
required = dictionary.get("required", [])
for key, field in fields.items():
if key not in required:
fields[key] = (Optional[fields[key][0]], ...)
optional_type = fields[key][0]
fields[key] = (Optional[optional_type], ...)
custom_model = create_model(model_name, **fields)
return custom_model

View File

@@ -1,6 +1,7 @@
# Function calling example using pydantic models.
from __future__ import annotations
import datetime
import importlib
import json
from enum import Enum
from typing import Optional, Union
@@ -215,9 +216,9 @@ for call in json_data:
if call["function"] == "Calculator":
print(Calculator(**call["params"]).run())
elif call["function"] == "get_current_datetime":
print(current_datetime_model(**call["params"]).run())
print(current_datetime_model(**call["params"]).run()) # pyright: ignore[reportAttributeAccessIssue]
elif call["function"] == "get_current_weather":
print(current_weather_tool_model(**call["params"]).run())
print(current_weather_tool_model(**call["params"]).run()) # pyright: ignore[reportAttributeAccessIssue]
# Should output something like this:
# 2024-01-14 13:36:06
# {"location": "London", "temperature": "42", "unit": "celsius"}

View File

@@ -1,3 +1,5 @@
from __future__ import annotations
import argparse
import json
import os
@@ -59,10 +61,11 @@ def main(args_in: list[str] | None = None) -> None:
sys.exit(1)
# start the benchmark
iterations = 0
data = {}
try:
start_benchmark(args)
iterations = 0
with open("results.github.env", 'w') as github_env:
# parse output
with open('k6-results.json', 'r') as bench_results:
@@ -129,7 +132,7 @@ def main(args_in: list[str] | None = None) -> None:
timestamps, metric_values = zip(*values)
metric_values = [float(value) for value in metric_values]
prometheus_metrics[metric] = metric_values
timestamps_dt = [datetime.fromtimestamp(int(ts)) for ts in timestamps]
timestamps_dt = [str(datetime.fromtimestamp(int(ts))) for ts in timestamps]
plt.figure(figsize=(16, 10), dpi=80)
plt.plot(timestamps_dt, metric_values, label=metric)
plt.xticks(rotation=0, fontsize=14, horizontalalignment='center', alpha=.7)
@@ -156,7 +159,7 @@ def main(args_in: list[str] | None = None) -> None:
plt.close()
# Mermaid format in case images upload failed
with (open(f"{metric}.mermaid", 'w') as mermaid_f):
with open(f"{metric}.mermaid", 'w') as mermaid_f:
mermaid = (
f"""---
config:
@@ -278,7 +281,7 @@ def start_server_background(args):
}
server_process = subprocess.Popen(
args,
**pkwargs)
**pkwargs) # pyright: ignore[reportArgumentType, reportCallIssue]
def server_log(in_stream, out_stream):
for line in iter(in_stream.readline, b''):

View File

@@ -1,5 +1,4 @@
import asyncio
import collections
import json
import os
import re
@@ -8,19 +7,23 @@ import subprocess
import sys
import threading
import time
from collections.abc import Sequence
from contextlib import closing
from re import RegexFlag
from typing import Any, Literal, cast
import aiohttp
import numpy as np
import openai
from behave import step
from openai.types.chat import ChatCompletionChunk
from behave import step # pyright: ignore[reportAttributeAccessIssue]
from behave.api.async_step import async_run_until_complete
from prometheus_client import parser
# pyright: reportRedeclaration=false
@step("a server listening on {server_fqdn}:{server_port}")
def step_server_config(context, server_fqdn, server_port):
def step_server_config(context, server_fqdn: str, server_port: str):
context.server_fqdn = server_fqdn
context.server_port = int(server_port)
context.n_threads = None
@@ -74,34 +77,34 @@ def step_server_config(context, server_fqdn, server_port):
@step('a model file {hf_file} from HF repo {hf_repo}')
def step_download_hf_model(context, hf_file, hf_repo):
def step_download_hf_model(context, hf_file: str, hf_repo: str):
context.model_hf_repo = hf_repo
context.model_hf_file = hf_file
context.model_file = os.path.basename(hf_file)
@step('a model file {model_file}')
def step_model_file(context, model_file):
def step_model_file(context, model_file: str):
context.model_file = model_file
@step('a model url {model_url}')
def step_model_url(context, model_url):
def step_model_url(context, model_url: str):
context.model_url = model_url
@step('a model alias {model_alias}')
def step_model_alias(context, model_alias):
def step_model_alias(context, model_alias: str):
context.model_alias = model_alias
@step('{seed:d} as server seed')
def step_seed(context, seed):
def step_seed(context, seed: int):
context.server_seed = seed
@step('{ngl:d} GPU offloaded layers')
def step_n_gpu_layer(context, ngl):
def step_n_gpu_layer(context, ngl: int):
if 'N_GPU_LAYERS' in os.environ:
new_ngl = int(os.environ['N_GPU_LAYERS'])
if context.debug:
@@ -111,37 +114,37 @@ def step_n_gpu_layer(context, ngl):
@step('{n_threads:d} threads')
def step_n_threads(context, n_threads):
def step_n_threads(context, n_threads: int):
context.n_thread = n_threads
@step('{draft:d} as draft')
def step_draft(context, draft):
def step_draft(context, draft: int):
context.draft = draft
@step('{n_ctx:d} KV cache size')
def step_n_ctx(context, n_ctx):
def step_n_ctx(context, n_ctx: int):
context.n_ctx = n_ctx
@step('{n_slots:d} slots')
def step_n_slots(context, n_slots):
def step_n_slots(context, n_slots: int):
context.n_slots = n_slots
@step('{n_predict:d} server max tokens to predict')
def step_server_n_predict(context, n_predict):
def step_server_n_predict(context, n_predict: int):
context.n_server_predict = n_predict
@step('{slot_save_path} as slot save path')
def step_slot_save_path(context, slot_save_path):
def step_slot_save_path(context, slot_save_path: str):
context.slot_save_path = slot_save_path
@step('using slot id {id_slot:d}')
def step_id_slot(context, id_slot):
def step_id_slot(context, id_slot: int):
context.id_slot = id_slot
@@ -191,7 +194,7 @@ def step_start_server(context):
@step("the server is {expecting_status}")
@async_run_until_complete
async def step_wait_for_the_server_to_be_started(context, expecting_status):
async def step_wait_for_the_server_to_be_started(context, expecting_status: Literal['healthy', 'ready', 'idle', 'busy'] | str):
match expecting_status:
case 'healthy':
await wait_for_health_status(context, context.base_url, 200, 'ok',
@@ -221,7 +224,7 @@ async def step_wait_for_the_server_to_be_started(context, expecting_status):
@step('all slots are {expected_slot_status_string}')
@async_run_until_complete
async def step_all_slots_status(context, expected_slot_status_string):
async def step_all_slots_status(context, expected_slot_status_string: Literal['idle', 'busy'] | str):
match expected_slot_status_string:
case 'idle':
expected_slot_status = 0
@@ -237,7 +240,7 @@ async def step_all_slots_status(context, expected_slot_status_string):
@step('a completion request with {api_error} api error')
@async_run_until_complete
async def step_request_completion(context, api_error):
async def step_request_completion(context, api_error: Literal['raised'] | str):
expect_api_error = api_error == 'raised'
seeds = await completions_seed(context, num_seeds=1)
completion = await request_completion(context.prompts.pop(),
@@ -777,8 +780,8 @@ def step_assert_metric_value(context, metric_name, metric_value):
def step_available_models(context):
# openai client always expects an api_key
openai.api_key = context.user_api_key if context.user_api_key is not None else 'nope'
openai.api_base = f'{context.base_url}/v1'
context.models = openai.Model.list().data
openai.base_url = f'{context.base_url}/v1/'
context.models = openai.models.list().data
@step('{n_model:d} models are supported')
@@ -789,7 +792,7 @@ def step_supported_models(context, n_model):
@step('model {i_model:d} is {param} {preposition} {param_value}')
def step_supported_models(context, i_model, param, preposition, param_value):
def step_supported_models(context, i_model: int, param: Literal['identified', 'trained'] | str, preposition: str, param_value: str):
assert i_model < len(context.models)
model = context.models[i_model]
@@ -798,7 +801,7 @@ def step_supported_models(context, i_model, param, preposition, param_value):
case 'identified':
value = model.id
case 'trained':
value = str(model.meta.n_ctx_train)
value = str(model.meta["n_ctx_train"])
case _:
assert False, "param {param} not supported"
assert param_value == value, f"model param {param} {value} != {param_value}"
@@ -810,6 +813,7 @@ async def concurrent_requests(context, f_completion, *args, **kwargs):
print(f"starting {context.n_prompts} concurrent completion requests...")
assert context.n_prompts > 0
seeds = await completions_seed(context)
assert seeds is not None
for prompt_no in range(context.n_prompts):
shifted_args = [context.prompts.pop(), seeds[prompt_no], *args]
context.concurrent_tasks.append(asyncio.create_task(f_completion(*shifted_args, **kwargs)))
@@ -861,7 +865,7 @@ async def request_completion(prompt,
id_slot=None,
expect_api_error=None,
user_api_key=None,
temperature=None):
temperature=None) -> int | dict[str, Any]:
if debug:
print(f"Sending completion request: {prompt}")
origin = "my.super.domain"
@@ -899,8 +903,8 @@ async def request_completion(prompt,
async def oai_chat_completions(user_prompt,
seed,
system_prompt,
base_url,
base_path,
base_url: str,
base_path: str,
async_client,
debug=False,
temperature=None,
@@ -909,7 +913,7 @@ async def oai_chat_completions(user_prompt,
enable_streaming=None,
response_format=None,
user_api_key=None,
expect_api_error=None):
expect_api_error=None) -> int | dict[str, Any]:
if debug:
print(f"Sending OAI Chat completions request: {user_prompt}")
# openai client always expects an api key
@@ -989,32 +993,35 @@ async def oai_chat_completions(user_prompt,
else:
try:
openai.api_key = user_api_key
openai.api_base = f'{base_url}{base_path}'
chat_completion = openai.Completion.create(
openai.base_url = f'{base_url}{base_path.removesuffix("chat")}'
assert model is not None
chat_completion = openai.chat.completions.create(
messages=payload['messages'],
model=model,
max_tokens=n_predict,
stream=enable_streaming,
response_format=payload.get('response_format'),
response_format=payload.get('response_format') or openai.NOT_GIVEN,
seed=seed,
temperature=payload['temperature']
)
except openai.error.AuthenticationError as e:
except openai.AuthenticationError as e:
if expect_api_error is not None and expect_api_error:
return 401
else:
assert False, f'error raised: {e}'
if enable_streaming:
chat_completion = cast(openai.Stream[ChatCompletionChunk], chat_completion)
for chunk in chat_completion:
assert len(chunk.choices) == 1
delta = chunk.choices[0].delta
if 'content' in delta:
completion_response['content'] += delta['content']
if delta.content is not None:
completion_response['content'] += delta.content
completion_response['timings']['predicted_n'] += 1
completion_response['truncated'] = chunk.choices[0].finish_reason != 'stop'
else:
assert len(chat_completion.choices) == 1
assert chat_completion.usage is not None
completion_response = {
'content': chat_completion.choices[0].message.content,
'timings': {
@@ -1028,7 +1035,7 @@ async def oai_chat_completions(user_prompt,
return completion_response
async def request_embedding(content, seed, base_url=None):
async def request_embedding(content, seed, base_url=None) -> list[list[float]]:
async with aiohttp.ClientSession() as session:
async with session.post(f'{base_url}/embedding',
json={
@@ -1041,7 +1048,7 @@ async def request_embedding(content, seed, base_url=None):
async def request_oai_embeddings(input, seed,
base_url=None, user_api_key=None,
model=None, async_client=False):
model=None, async_client=False) -> list[list[float]]:
# openai client always expects an api_key
user_api_key = user_api_key if user_api_key is not None else 'nope'
if async_client:
@@ -1063,7 +1070,7 @@ async def request_oai_embeddings(input, seed,
response_json = await response.json()
assert response_json['model'] == model, f"invalid model received: {response_json['model']}"
assert response_json['object'] == 'list'
if isinstance(input, collections.abc.Sequence):
if isinstance(input, Sequence):
embeddings = []
for an_oai_embeddings in response_json['data']:
embeddings.append(an_oai_embeddings['embedding'])
@@ -1072,19 +1079,14 @@ async def request_oai_embeddings(input, seed,
return embeddings
else:
openai.api_key = user_api_key
openai.api_base = f'{base_url}/v1'
oai_embeddings = openai.Embedding.create(
openai.base_url = f'{base_url}/v1/'
assert model is not None
oai_embeddings = openai.embeddings.create(
model=model,
input=input,
)
if isinstance(input, collections.abc.Sequence):
embeddings = []
for an_oai_embeddings in oai_embeddings.data:
embeddings.append(an_oai_embeddings.embedding)
else:
embeddings = [oai_embeddings.data.embedding]
return embeddings
return [e.embedding for e in oai_embeddings.data]
def assert_n_tokens_predicted(completion_response, expected_predicted_n=None, re_content=None):
@@ -1122,7 +1124,7 @@ def assert_all_predictions_equal(completion_responses):
if i == j:
continue
content_j = response_j['content']
assert content_i == content_j, "contents not equal"
assert content_i == content_j, "contents not equal"
def assert_all_predictions_different(completion_responses):
@@ -1136,7 +1138,7 @@ def assert_all_predictions_different(completion_responses):
if i == j:
continue
content_j = response_j['content']
assert content_i != content_j, "contents not different"
assert content_i != content_j, "contents not different"
def assert_all_token_probabilities_equal(completion_responses):
@@ -1153,7 +1155,7 @@ def assert_all_token_probabilities_equal(completion_responses):
if i == j:
continue
probs_j = response_j['completion_probabilities'][pos]['probs']
assert probs_i == probs_j, "contents not equal"
assert probs_i == probs_j, "contents not equal"
async def gather_tasks_results(context):
@@ -1343,7 +1345,7 @@ def start_server_background(context):
}
context.server_process = subprocess.Popen(
[str(arg) for arg in [context.server_path, *server_args]],
**pkwargs)
**pkwargs) # pyright: ignore[reportArgumentType, reportCallIssue]
def server_log(in_stream, out_stream):
for line in iter(in_stream.readline, b''):

View File

@@ -1,6 +1,6 @@
aiohttp~=3.9.3
behave~=1.2.6
huggingface_hub~=0.20.3
numpy~=1.24.4
openai~=0.25.0
numpy~=1.26.4
openai~=1.30.3
prometheus-client~=0.20.0

View File

@@ -1,13 +1,15 @@
import asyncio
import asyncio.threads
import requests
import numpy as np
n = 8
result = []
async def requests_post_async(*args, **kwargs):
return await asyncio.to_thread(requests.post, *args, **kwargs)
return await asyncio.threads.to_thread(requests.post, *args, **kwargs)
async def main():
model_url = "http://127.0.0.1:6900"

View File

@@ -66,7 +66,7 @@ class Tensor:
if len(self.ne) == 0:
self.nbytes = 0
else:
self.nbytes = int(np.product(self.ne)) * 4
self.nbytes = int(np.prod(self.ne)) * 4
else:
raise ValueError(f"Unhandled data type '{self.dtype}'")

20
flake.lock generated
View File

@@ -5,11 +5,11 @@
"nixpkgs-lib": "nixpkgs-lib"
},
"locked": {
"lastModified": 1717285511,
"narHash": "sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw=",
"lastModified": 1719994518,
"narHash": "sha256-pQMhCCHyQGRzdfAkdJ4cIWiw+JNuWsTX7f0ZYSyz0VY=",
"owner": "hercules-ci",
"repo": "flake-parts",
"rev": "2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8",
"rev": "9227223f6d922fee3c7b190b2cc238a99527bbb7",
"type": "github"
},
"original": {
@@ -20,11 +20,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1719506693,
"narHash": "sha256-C8e9S7RzshSdHB7L+v9I51af1gDM5unhJ2xO1ywxNH8=",
"lastModified": 1720031269,
"narHash": "sha256-rwz8NJZV+387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "b2852eb9365c6de48ffb0dc2c9562591f652242a",
"rev": "9f4128e00b0ae8ec65918efeba59db998750ead6",
"type": "github"
},
"original": {
@@ -36,14 +36,14 @@
},
"nixpkgs-lib": {
"locked": {
"lastModified": 1717284937,
"narHash": "sha256-lIbdfCsf8LMFloheeE6N31+BMIeixqyQWbSr2vk79EQ=",
"lastModified": 1719876945,
"narHash": "sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI=",
"type": "tarball",
"url": "https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz"
"url": "https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz"
},
"original": {
"type": "tarball",
"url": "https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz"
"url": "https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz"
}
},
"root": {

View File

@@ -99,6 +99,8 @@ async def main():
tasks = []
base_dict = {"FLOAT_TYPE": "float"}
for fp16 in (False, True):
# MUL_MAT
matmul_shaders(tasks, fp16, False)
@@ -106,8 +108,6 @@ async def main():
matmul_shaders(tasks, fp16, True)
for tname in type_names:
base_dict = {"FLOAT_TYPE": "float"}
# mul mat vec
data_a_key = f"DATA_A_{tname.upper()}"
shader = f"mul_mat_vec_{tname}.comp" if tname.endswith("_k") else "mul_mat_vec.comp"

View File

@@ -29,6 +29,7 @@
#include "ggml-cuda/tsembd.cuh"
#include "ggml-cuda/unary.cuh"
#include "ggml-cuda/upscale.cuh"
#include "ggml-cuda/conv-transpose-1d.cuh"
#include <algorithm>
#include <array>
@@ -2261,6 +2262,9 @@ static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct gg
case GGML_OP_IM2COL:
ggml_cuda_op_im2col(ctx, dst);
break;
case GGML_OP_CONV_TRANSPOSE_1D:
ggml_cuda_op_conv_transpose_1d(ctx,dst);
break;
case GGML_OP_POOL_2D:
ggml_cuda_op_pool2d(ctx, dst);
break;
@@ -2804,6 +2808,15 @@ GGML_CALL static bool ggml_backend_cuda_supports_op(ggml_backend_t backend, cons
ggml_type src0_type = op->src[0]->type;
return src0_type != GGML_TYPE_I32 && src0_type != GGML_TYPE_I16;
} break;
case GGML_OP_CONV_TRANSPOSE_1D:
{
ggml_type src0_type = op->src[0]->type;
ggml_type src1_type = op->src[1]->type;
if (src0_type == GGML_TYPE_F32 && src1_type == GGML_TYPE_F32) {
return true;
}
return false;
} break;
case GGML_OP_NONE:
case GGML_OP_RESHAPE:
case GGML_OP_VIEW:

View File

@@ -0,0 +1,87 @@
#include "conv-transpose-1d.cuh"
static __global__ void conv_transpose_1d_kernel(
const int s0, const int p0, const int d0, const int output_size,
const int src0_ne0, const int src0_ne1, const int src0_ne2, const int src0_ne3,
const int src1_ne0, const int src1_ne1, const int src1_ne2, const int src1_ne3,
const int dst_ne0, const int dst_ne1, const int dst_ne2, const int dst_ne3,
const float * src0, const float * src1, float * dst) {
int global_index = threadIdx.x + blockIdx.x * blockDim.x;
if (global_index >= output_size) {
return;
}
int out_index = global_index / dst_ne0;
float accumulator = 0;
for (int c = 0; c < src0_ne2; c++) {
int idx = global_index % dst_ne0;
int kernel_offset = (src0_ne0 * src0_ne1 * c) + (out_index * src0_ne0);
int input_offset = src1_ne0 * c;
for (int i = 0; i < src1_ne0; i++) {
if (!(idx >= i*s0 && idx < i*s0 + src0_ne0)) {
continue;
}
int weight_idx = idx - i*s0;
float kernel_weight = src0[kernel_offset + weight_idx];
float input_value = src1[input_offset+i];
accumulator += kernel_weight * input_value;
}
}
dst[global_index] = accumulator;
}
static void conv_transpose_1d_f32_f32_cuda(
const int s0, const int p0, const int d0, const int output_size,
const int src0_ne0, const int src0_ne1, const int src0_ne2, const int src0_ne3,
const int src1_ne0, const int src1_ne1, const int src1_ne2, const int src1_ne3,
const int dst_ne0, const int dst_ne1, const int dst_ne2, const int dst_ne3,
const float * src0, const float * src1, float * dst,
cudaStream_t stream) {
const int num_blocks = (output_size + CUDA_CONV_TRANPOSE_1D_BLOCK_SIZE - 1) / CUDA_CONV_TRANPOSE_1D_BLOCK_SIZE;
conv_transpose_1d_kernel<<<num_blocks,CUDA_CONV_TRANPOSE_1D_BLOCK_SIZE, 0, stream>>>(
s0,p0,d0,output_size,
src0_ne0, src0_ne1, src0_ne2, src0_ne3,
src1_ne0, src1_ne1, src1_ne2, src1_ne3,
dst_ne0, dst_ne1, dst_ne2, dst_ne3,
src0,src1, dst);
}
void ggml_cuda_op_conv_transpose_1d(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0];
const float * src0_d = (const float *)src0->data;
const ggml_tensor * src1 = dst->src[1];
const float * src1_d = (const float *)src1->data;
float * dst_d = (float *)dst->data;
cudaStream_t stream = ctx.stream();
GGML_ASSERT(src0->type == GGML_TYPE_F32);
GGML_ASSERT( dst->type == GGML_TYPE_F32);
GGML_ASSERT(ggml_is_contiguous(src0));
GGML_ASSERT(ggml_is_contiguous(src1));
const int32_t * opts = (const int32_t *)dst->op_params;
const int s0 = opts[0];
const int p0 = 0;//opts[3];
const int d0 = 1;//opts[4];
const int64_t kernel_size = ggml_nelements(src0);
const int64_t input_size = ggml_nelements(src1);
const int64_t output_size = ggml_nelements(dst);
conv_transpose_1d_f32_f32_cuda(s0, p0, d0, output_size,
src0->ne[0], src0->ne[1], src0->ne[2], src0->ne[3],
src1->ne[0], src1->ne[1], src1->ne[2], src1->ne[3],
dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3],
src0_d, src1_d, dst_d, stream);
}

View File

@@ -0,0 +1,5 @@
#include "common.cuh"
#define CUDA_CONV_TRANPOSE_1D_BLOCK_SIZE 256
void ggml_cuda_op_conv_transpose_1d(ggml_backend_cuda_context & ctx, ggml_tensor * dst);

View File

@@ -55,7 +55,7 @@ static void rope_norm(
const int i = row*ne0 + i0;
const int i2 = row/p_delta_rows;
const float theta_base = pos[i2]*powf(theta_scale, i0/2.0f);
const float theta_base = pos[i2] * sycl::pow(theta_scale, i0 / 2.0f);
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;
@@ -98,7 +98,7 @@ static void rope_neox(
const int i = row*ne0 + i0/2;
const int i2 = row/p_delta_rows;
const float theta_base = pos[i2]*powf(theta_scale, i0/2.0f);
const float theta_base = pos[i2] * sycl::pow(theta_scale, i0 / 2.0f);
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;

View File

@@ -67,7 +67,7 @@ class ReaderTensor(NamedTuple):
class GGUFReader:
# I - same as host, S - swapped
byte_order: Literal['I'] | Literal['S'] = 'I'
byte_order: Literal['I', 'S'] = 'I'
alignment: int = GGUF_DEFAULT_ALIGNMENT
data_offset: int
@@ -86,7 +86,7 @@ class GGUFReader:
GGUFValueType.BOOL: np.bool_,
}
def __init__(self, path: os.PathLike[str] | str, mode: Literal['r'] | Literal['r+'] | Literal['c'] = 'r'):
def __init__(self, path: os.PathLike[str] | str, mode: Literal['r', 'r+', 'c'] = 'r'):
self.data = np.memmap(path, mode = mode)
offs = 0
@@ -140,7 +140,7 @@ class GGUFReader:
return self.tensors[idx]
def _get(
self, offset: int, dtype: npt.DTypeLike, count: int = 1, override_order: None | Literal['I'] | Literal['S'] | Literal['<'] = None,
self, offset: int, dtype: npt.DTypeLike, count: int = 1, override_order: None | Literal['I', 'S', '<'] = None,
) -> npt.NDArray[Any]:
count = int(count)
itemsize = int(np.empty([], dtype = dtype).itemsize)

View File

@@ -6,7 +6,6 @@ from typing import Any, Callable
from collections import deque
import numpy as np
from numpy._typing import _Shape
from numpy.typing import DTypeLike
@@ -16,16 +15,16 @@ logger = logging.getLogger(__name__)
class LazyMeta(ABCMeta):
def __new__(cls, name: str, bases: tuple[type, ...], namespace: dict[str, Any], **kwargs):
def __getattr__(self, __name: str) -> Any:
meta_attr = getattr(self._meta, __name)
def __getattr__(self, name: str) -> Any:
meta_attr = getattr(self._meta, name)
if callable(meta_attr):
return type(self)._wrap_fn(
(lambda s, *args, **kwargs: getattr(s, __name)(*args, **kwargs)),
(lambda s, *args, **kwargs: getattr(s, name)(*args, **kwargs)),
use_self=self,
)
elif isinstance(meta_attr, self._tensor_type):
# e.g. self.T with torch.Tensor should still be wrapped
return type(self)._wrap_fn(lambda s: getattr(s, __name))(self)
return type(self)._wrap_fn(lambda s: getattr(s, name))(self)
else:
# no need to wrap non-tensor properties,
# and they likely don't depend on the actual contents of the tensor
@@ -141,19 +140,21 @@ class LazyBase(ABC, metaclass=LazyMeta):
res = cls.meta_with_dtype_and_shape(meta_noop, res.shape)
if isinstance(res, cls._tensor_type):
def collect_replace(t: LazyBase):
if collect_replace.shared_lazy is None:
collect_replace.shared_lazy = t._lazy
else:
collect_replace.shared_lazy.extend(t._lazy)
t._lazy = collect_replace.shared_lazy
class CollectSharedLazy:
# emulating a static variable
shared_lazy: None | deque[LazyBase] = None
# emulating a static variable
collect_replace.shared_lazy = None
@staticmethod
def collect_replace(t: LazyBase):
if CollectSharedLazy.shared_lazy is None:
CollectSharedLazy.shared_lazy = t._lazy
else:
CollectSharedLazy.shared_lazy.extend(t._lazy)
t._lazy = CollectSharedLazy.shared_lazy
LazyBase._recurse_apply(args, collect_replace)
LazyBase._recurse_apply(args, CollectSharedLazy.collect_replace)
shared_lazy = collect_replace.shared_lazy
shared_lazy = CollectSharedLazy.shared_lazy
return cls(meta=cls.eager_to_meta(res), lazy=shared_lazy, args=args, func=lambda a: fn(*a, **kwargs))
else:
@@ -184,6 +185,7 @@ class LazyBase(ABC, metaclass=LazyMeta):
lt._args = cls._recurse_apply(lt._args, already_eager_to_eager)
lt._data = lt._func(lt._args)
# sanity check
assert lt._data is not None
assert lt._data.dtype == lt._meta.dtype
assert lt._data.shape == lt._meta.shape
@@ -216,7 +218,7 @@ class LazyNumpyTensor(LazyBase):
_tensor_type = np.ndarray
@classmethod
def meta_with_dtype_and_shape(cls, dtype: DTypeLike, shape: _Shape) -> np.ndarray[Any, Any]:
def meta_with_dtype_and_shape(cls, dtype: DTypeLike, shape: tuple[int, ...]) -> np.ndarray[Any, Any]:
# The initial idea was to use np.nan as the fill value,
# but non-float types like np.int16 can't use that.
# So zero it is.

View File

@@ -1,3 +1,5 @@
# pyright: reportUnusedImport=false
from .gguf_convert_endian import main as gguf_convert_endian_entrypoint
from .gguf_dump import main as gguf_dump_entrypoint
from .gguf_set_metadata import main as gguf_set_metadata_entrypoint

View File

@@ -63,9 +63,9 @@ def gguf_hash(reader: GGUFReader, filename: str, disable_progress_bar) -> None:
bar.update(sum_weights_in_tensor)
sha1_layer = hashlib.sha1()
sha1_layer.update(tensor.data)
sha1.update(tensor.data)
uuidv5_sha1.update(tensor.data)
sha1_layer.update(tensor.data.data)
sha1.update(tensor.data.data)
uuidv5_sha1.update(tensor.data.data)
print("sha1 {0} {1}:{2}".format(sha1_layer.hexdigest(), filename, tensor.name)) # noqa: NP100
# Flush Hash Progress Bar

View File

@@ -1,4 +1,6 @@
#!/usr/bin/env python3
from __future__ import annotations
import logging
import argparse
import os

View File

@@ -1,4 +1,4 @@
import gguf # noqa: F401
import gguf # noqa: F401 # pyright: ignore[reportUnusedImport]
# TODO: add tests

View File

@@ -1,3 +1,21 @@
{
"extraPaths": ["gguf-py"],
}
"pythonVersion": "3.9",
"pythonPlatform": "All",
"reportUnusedImport": "warning",
"reportDuplicateImport": "error",
"reportDeprecated": "warning",
"reportUnnecessaryTypeIgnoreComment": "warning",
"executionEnvironments": [
{
// TODO: make this version override work correctly
"root": "gguf-py",
"pythonVersion": "3.8",
},
{
// uses match expressions in steps.py
"root": "examples/server/tests",
"pythonVersion": "3.10",
},
],
}

View File

@@ -0,0 +1,12 @@
-r ../examples/llava/requirements.txt
-r ../examples/server/bench/requirements.txt
-r ../examples/server/tests/requirements.txt
-r ./requirements-compare-llama-bench.txt
-r ./requirements-pydantic.txt
-r ./requirements-test-tokenizer-random.txt
-r ./requirements-convert_hf_to_gguf.txt
-r ./requirements-convert_hf_to_gguf_update.txt
-r ./requirements-convert_legacy_llama.txt
-r ./requirements-convert_llama_ggml_to_gguf.txt

View File

@@ -0,0 +1,2 @@
tabulate~=0.9.0
GitPython~=3.1.43

View File

@@ -0,0 +1,2 @@
docstring_parser~=0.15
pydantic~=2.6.3

View File

@@ -0,0 +1 @@
cffi~=1.16.0

View File

@@ -108,6 +108,11 @@ check_convert_script() {
fatal "$py missing requirements. Expected: $reqs"
fi
# Check that all sub-requirements are added to top-level requirements.txt
if ! grep -qF "$reqs" requirements.txt; then
fatal "$reqs needs to be added to requirements.txt"
fi
local venv="$workdir/$pyname-venv"
python3 -m venv "$venv"
@@ -134,12 +139,7 @@ EOF
readonly ignore_eq_eq='check_requirements: ignore "=="'
for req in "$reqs_dir"/*; do
# Check that all sub-requirements are added to top-level requirements.txt
if ! grep -qF "$req" requirements.txt; then
fatal "$req needs to be added to requirements.txt"
fi
for req in */**/requirements*.txt; do
# Make sure exact release versions aren't being pinned in the requirements
# Filters out the ignore string
if grep -vF "$ignore_eq_eq" "$req" | grep -q '=='; then

View File

@@ -123,13 +123,13 @@ builds = cursor.execute("SELECT DISTINCT build_commit FROM test;").fetchall()
try:
repo = git.Repo(".", search_parent_directories=True)
except git.exc.InvalidGitRepositoryError:
except git.InvalidGitRepositoryError:
repo = None
def find_parent_in_data(commit):
def find_parent_in_data(commit: git.Commit):
"""Helper function to find the most recent parent measured in number of commits for which there is data."""
heap = [(0, commit)]
heap: list[tuple[int, git.Commit]] = [(0, commit)]
seen_hexsha8 = set()
while heap:
depth, current_commit = heapq.heappop(heap)
@@ -144,7 +144,7 @@ def find_parent_in_data(commit):
return None
def get_all_parent_hexsha8s(commit):
def get_all_parent_hexsha8s(commit: git.Commit):
"""Helper function to recursively get hexsha8 values for all parents of a commit."""
unvisited = [commit]
visited = []

View File

@@ -1,3 +1,5 @@
from __future__ import annotations
import array
import unicodedata
import requests
@@ -133,7 +135,7 @@ table_nfd.sort()
# group ranges with same flags
ranges_flags = [(0, codepoint_flags[0])] # start, flags
ranges_flags: list[tuple[int, int]] = [(0, codepoint_flags[0])] # start, flags
for codepoint, flags in enumerate(codepoint_flags):
if flags != ranges_flags[-1][1]:
ranges_flags.append((codepoint, flags))
@@ -141,11 +143,11 @@ ranges_flags.append((MAX_CODEPOINTS, 0x0000))
# group ranges with same nfd
ranges_nfd = [(0, 0, 0)] # start, last, nfd
ranges_nfd: list[tuple[int, int, int]] = [(0, 0, 0)] # start, last, nfd
for codepoint, norm in table_nfd:
start = ranges_nfd[-1][0]
if ranges_nfd[-1] != (start, codepoint - 1, norm):
ranges_nfd.append(None)
ranges_nfd.append(None) # type: ignore[arg-type] # dummy, will be replaced below
start = codepoint
ranges_nfd[-1] = (start, codepoint, norm)
@@ -179,13 +181,13 @@ for codepoint in table_whitespace:
out("};\n")
out("const std::unordered_map<uint32_t, uint32_t> unicode_map_lowercase = {")
for tuple in table_lowercase:
out("{0x%06X, 0x%06X}," % tuple)
for tuple_lw in table_lowercase:
out("{0x%06X, 0x%06X}," % tuple_lw)
out("};\n")
out("const std::unordered_map<uint32_t, uint32_t> unicode_map_uppercase = {")
for tuple in table_uppercase:
out("{0x%06X, 0x%06X}," % tuple)
for tuple_up in table_uppercase:
out("{0x%06X, 0x%06X}," % tuple_up)
out("};\n")
out("const std::vector<range_nfd> unicode_ranges_nfd = { // start, last, nfd")

View File

@@ -63,6 +63,7 @@ while read c; do
src/ggml*.metal \
src/ggml*.cu \
src/ggml-cuda/* \
src/ggml-sycl/* \
include/ggml*.h \
tests/test-opt.cpp \
tests/test-grad0.cpp \
@@ -113,6 +114,7 @@ if [ -f $SRC_LLAMA/ggml-src.patch ]; then
# src/ggml-quants.c -> ggml/src/ggml-quants.c
# src/ggml-quants.h -> ggml/src/ggml-quants.h
# src/ggml-rpc.cpp -> ggml/src/ggml-rpc.cpp
# src/ggml-sycl/* -> ggml/src/ggml-sycl/
# src/ggml-sycl.cpp -> ggml/src/ggml-sycl.cpp
# src/ggml-vulkan.cpp -> ggml/src/ggml-vulkan.cpp
#
@@ -153,6 +155,7 @@ if [ -f $SRC_LLAMA/ggml-src.patch ]; then
-e 's/([[:space:]]|[ab]\/)src\/ggml-quants\.c/\1ggml\/src\/ggml-quants.c/g' \
-e 's/([[:space:]]|[ab]\/)src\/ggml-quants\.h/\1ggml\/src\/ggml-quants.h/g' \
-e 's/([[:space:]]|[ab]\/)src\/ggml-rpc\.cpp/\1ggml\/src\/ggml-rpc.cpp/g' \
-e 's/([[:space:]]|[ab]\/)src\/ggml-sycl\//\1ggml\/src\/ggml-sycl\//g' \
-e 's/([[:space:]]|[ab]\/)src\/ggml-sycl\.cpp/\1ggml\/src\/ggml-sycl.cpp/g' \
-e 's/([[:space:]]|[ab]\/)src\/ggml-vulkan\.cpp/\1ggml\/src\/ggml-vulkan.cpp/g' \
-e 's/([[:space:]]|[ab]\/)include\/ggml\.h/\1ggml\/include\/ggml.h/g' \

View File

@@ -1 +1 @@
5378ea0d3c2f25bcd330ecb226ad2db454be86d0
e3b3846976c94163f2b3dd128cc959782653edbb

View File

@@ -18,6 +18,7 @@ cp -rpv ../ggml/src/ggml-metal.metal ./ggml/src/ggml-metal.metal
cp -rpv ../ggml/src/ggml-quants.c ./ggml/src/ggml-quants.c
cp -rpv ../ggml/src/ggml-quants.h ./ggml/src/ggml-quants.h
cp -rpv ../ggml/src/ggml-rpc.cpp ./ggml/src/ggml-rpc.cpp
cp -rpv ../ggml/src/ggml-sycl/* ./ggml/src/ggml-sycl/
cp -rpv ../ggml/src/ggml-sycl.cpp ./ggml/src/ggml-sycl.cpp
cp -rpv ../ggml/src/ggml-vulkan.cpp ./ggml/src/ggml-vulkan.cpp

View File

@@ -1266,6 +1266,32 @@ struct test_pool2d : public test_case {
}
};
// GGML_OP_CONV_TRANSPOSE_1D
struct test_conv_transpose_1d : public test_case {
const std::array<int64_t, 4> ne_input;
const std::array<int64_t, 4> ne_kernel;
const int s0; // stride
const int p0; // padding
const int d0; // dilation
std::string vars() override {
return VARS_TO_STR5(ne_input, ne_kernel, s0, p0, d0);
}
test_conv_transpose_1d(std::array<int64_t, 4> ne_input = {197, 32, 1, 1}, // [input_width, input_height, input_channels, 1]
std::array<int64_t, 4> ne_kernel = {16, 32, 32, 1}, // [kernel_width, kernel_height, input_channels, 1]
int s0 = 1, int p0 = 0, int d0 = 1)
: ne_input(ne_input), ne_kernel(ne_kernel), s0(s0), p0(p0), d0(d0) {}
ggml_tensor * build_graph(ggml_context * ctx) override {
ggml_tensor * input = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne_input.data());
ggml_tensor * kernel = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne_kernel.data());
ggml_tensor * out = ggml_conv_transpose_1d(ctx, kernel, input, s0, p0, d0);
return out;
}
};
// GGML_OP_IM2COL
struct test_im2col : public test_case {
const ggml_type type_input;
@@ -1279,7 +1305,7 @@ struct test_im2col : public test_case {
// padding
const int p0;
const int p1;
// dilatation
// dilation
const int d0;
const int d1;
// mode
@@ -2098,6 +2124,16 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F32));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16));
test_cases.emplace_back(new test_conv_transpose_1d());
test_cases.emplace_back(new test_conv_transpose_1d({3,2,1,1}, {2,3,2,1}, 3, 0, 1));
test_cases.emplace_back(new test_conv_transpose_1d({3,2,1,1}, {2,3,2,1}, 2, 0, 1));
test_cases.emplace_back(new test_conv_transpose_1d({3,2,1,1}, {2,3,2,1}, 1, 0, 1));
test_cases.emplace_back(new test_conv_transpose_1d({3,2,1,1}, {3,2,2,1}, 2, 0, 1));
test_cases.emplace_back(new test_conv_transpose_1d({3,2,1,1}, {3,2,2,1}, 1, 0, 1));
test_cases.emplace_back(new test_conv_transpose_1d({3,2,1,1}, {3,1,2,1}, 1, 0, 1));
test_cases.emplace_back(new test_conv_transpose_1d({2,1,1,1}, {3,1,1,1}, 1, 0, 1));
test_cases.emplace_back(new test_repeat(GGML_TYPE_F32, {10, 10, 10, 10}, {1, 1, 1, 1}));
test_cases.emplace_back(new test_repeat(GGML_TYPE_F32, {10, 10, 10, 10}, {2, 1, 1, 1}));
test_cases.emplace_back(new test_repeat(GGML_TYPE_F32, {10, 10, 10, 10}, {1, 2, 1, 1}));

View File

@@ -6,6 +6,8 @@
# python3 tests/test-tokenizer-random.py ./models/ggml-vocab-llama-bpe.gguf ./models/tokenizers/llama-bpe
#
from __future__ import annotations
import time
import logging
import argparse
@@ -13,7 +15,9 @@ import subprocess
import random
import unicodedata
from typing import Iterator
from pathlib import Path
from typing import Any, Iterator, cast
from typing_extensions import Buffer
import cffi
from transformers import AutoTokenizer
@@ -28,15 +32,15 @@ class LibLlama:
DEFAULT_PATH_INCLUDES = ["./ggml/include/", "./include/"]
DEFAULT_PATH_LIBLLAMA = "./build/src/libllama.so" # CMakeLists.txt: BUILD_SHARED_LIBS ON
def __init__(self, path_llama_h: str = None, path_includes: list[str] = [], path_libllama: str = None):
def __init__(self, path_llama_h: str | None = None, path_includes: list[str] = [], path_libllama: str | None = None):
path_llama_h = path_llama_h or self.DEFAULT_PATH_LLAMA_H
path_includes = path_includes or self.DEFAULT_PATH_INCLUDES
path_libllama = path_libllama or self.DEFAULT_PATH_LIBLLAMA
(self.ffi, self.lib) = self._load_libllama_cffi(path_llama_h, path_includes, path_libllama)
self.lib.llama_backend_init()
def _load_libllama_cffi(self, path_llama_h: str, path_includes: list[str], path_libllama: str):
cmd = ["gcc", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)="]
def _load_libllama_cffi(self, path_llama_h: str, path_includes: list[str], path_libllama: str) -> tuple[cffi.FFI, Any]:
cmd = ["gcc", "-O0", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)="]
cmd += ["-I" + path for path in path_includes] + [path_llama_h]
res = subprocess.run(cmd, stdout=subprocess.PIPE)
assert (res.returncode == 0)
@@ -68,7 +72,7 @@ class LibLlama:
class LibLlamaModel:
def __init__(self, libllama: LibLlama, path_model: str, mparams={}, cparams={}):
self.lib = libllama.lib
self.lib: Any = libllama.lib
self.ffi = libllama.ffi
if isinstance(mparams, dict):
mparams = libllama.model_default_params(**mparams)
@@ -94,11 +98,11 @@ class LibLlamaModel:
self.lib = None
def tokenize(self, text: str, add_special: bool = False, parse_special: bool = False) -> list[int]:
text = text.encode("utf-8")
num = self.lib.llama_tokenize(self.model, text, len(text), self.token_ids, len(self.token_ids), add_special, parse_special)
encoded_text: bytes = text.encode("utf-8")
num = self.lib.llama_tokenize(self.model, encoded_text, len(encoded_text), self.token_ids, len(self.token_ids), add_special, parse_special)
while num < 0 and len(self.token_ids) < (16 << 20):
self.token_ids = self.ffi.new("llama_token[]", -2 * num)
num = self.lib.llama_tokenize(self.model, text, len(text), self.token_ids, len(self.token_ids), add_special, parse_special)
num = self.lib.llama_tokenize(self.model, encoded_text, len(encoded_text), self.token_ids, len(self.token_ids), add_special, parse_special)
return list(self.token_ids[0:num])
def detokenize(self, ids: list[int], remove_special: bool = False, unparse_special: bool = False) -> str:
@@ -110,7 +114,7 @@ class LibLlamaModel:
while num < 0 and len(self.text_buff) < (16 << 20):
self.text_buff = self.ffi.new("uint8_t[]", -2 * num)
num = self.lib.llama_detokenize(self.model, self.token_ids, len(ids), self.text_buff, len(self.text_buff), remove_special, unparse_special)
return str(self.ffi.buffer(self.text_buff, num), encoding="utf-8", errors="replace") # replace errors with '\uFFFD'
return str(cast(Buffer, self.ffi.buffer(self.text_buff, num)), encoding="utf-8", errors="replace") # replace errors with '\uFFFD'
class Tokenizer:
@@ -152,7 +156,7 @@ class TokenizerGroundtruth (Tokenizer):
class TokenizerLlamaCpp (Tokenizer):
libllama: LibLlama = None
libllama: LibLlama | None = None
def __init__(self, vocab_file: str):
if not self.libllama:
@@ -404,7 +408,7 @@ def generator_random_vocab_words(tokenizer: TokenizerGroundtruth, iterations=100
def compare_tokenizers(tokenizer1: TokenizerGroundtruth, tokenizer2: TokenizerLlamaCpp, generator: Iterator[str]):
def find_first_mismatch(ids1: list[int], ids2: list[int]):
def find_first_mismatch(ids1: list[int] | str, ids2: list[int] | str):
for i, (a, b) in enumerate(zip(ids1, ids2)):
if a != b:
return i
@@ -433,7 +437,7 @@ def compare_tokenizers(tokenizer1: TokenizerGroundtruth, tokenizer2: TokenizerLl
decode_errors = 0
MAX_ERRORS = 10
logger.info("%s: %s" % (generator.__name__, "ini"))
logger.info("%s: %s" % (generator.__qualname__, "ini"))
for text in generator:
# print(repr(text), text.encode())
# print(repr(text), hex(ord(text[0])), text.encode())
@@ -472,13 +476,13 @@ def compare_tokenizers(tokenizer1: TokenizerGroundtruth, tokenizer2: TokenizerLl
break
t_total = time.perf_counter() - t_start
logger.info(f"{generator.__name__}: end, {t_encode1=:.3f} {t_encode2=:.3f} {t_decode1=:.3f} {t_decode2=:.3f} {t_total=:.3f}")
logger.info(f"{generator.__qualname__}: end, {t_encode1=:.3f} {t_encode2=:.3f} {t_decode1=:.3f} {t_decode2=:.3f} {t_total=:.3f}")
def main(argv: list[str] = None):
def main(argv: list[str] | None = None):
parser = argparse.ArgumentParser()
parser.add_argument("vocab_file", help="path to vocab 'gguf' file")
parser.add_argument("dir_tokenizer", help="directory containing 'tokenizer.model' file")
parser.add_argument("vocab_file", type=str, help="path to vocab 'gguf' file")
parser.add_argument("dir_tokenizer", type=str, help="directory containing 'tokenizer.model' file")
parser.add_argument("--verbose", action="store_true", help="increase output verbosity")
args = parser.parse_args(argv)
@@ -520,7 +524,7 @@ if __name__ == "__main__":
format = "%(levelname)s %(message)s",
)
path_tokenizers = "./models/tokenizers/"
path_tokenizers = Path("./models/tokenizers/")
path_vocab_format = "./models/ggml-vocab-%s.gguf"
tokenizers = [
@@ -556,6 +560,6 @@ if __name__ == "__main__":
for tokenizer in tokenizers:
logger.info("-" * 50)
logger.info(f"TOKENIZER: '{tokenizer}'")
vocab_file = path_vocab_format % tokenizer
dir_tokenizer = path_tokenizers + "/" + tokenizer
main([vocab_file, dir_tokenizer, "--verbose"])
vocab_file = Path(path_vocab_format % tokenizer)
dir_tokenizer = path_tokenizers / tokenizer
main([str(vocab_file), str(dir_tokenizer), "--verbose"])