Q4_2 quantization with rmse-optimized scale and quants (#1062 )

* Q4_2 quantization with rmse-optimized scale and quants For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012 For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks. Quantization is slow (~90 seconds on my Mac for 7B) as not multi-threaded as in PR #896. * ggml : satisfy the sanitizer builds Not sure why this makes them fail * Better follow ggml conventions for function names * Fixed type as per reviewer comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ggml : use 8-bit precision for Q4_1 intermediate results (#1047 )
2026-05-05 08:34:21 +00:00 · 2023-04-19 20:20:14 +02:00 · 2023-04-19 20:10:08 +03:00 · 2023-04-19 19:07:54 +03:00 · 2023-04-19 19:06:37 +03:00
5 changed files with 292 additions and 732 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,11 +1,15 @@
 *.o
 *.a
+.DS_Store
+.build/
 .cache/
+.direnv/
+.envrc
+.swiftpm
+.venv
 .vs/
 .vscode/
-.DS_Store

-.build/
 build/
 build-em/
 build-debug/
@@ -30,12 +34,9 @@ models/*
 arm_neon.h
 compile_commands.json

-.envrc
-.direnv/
-
-.venv
 __pycache__
-.swiftpm

 zig-out/
 zig-cache/
+
+ppl-*.txt
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -174,7 +174,6 @@ if (LLAMA_ALL_WARNINGS)
            -Wshadow
            -Wstrict-prototypes
            -Wpointer-arith
-            -Wno-unused-function
        )
        set(cxx_flags
            -Wall
--- a/2
+++ b/2
@@ -36,7 +36,7 @@ CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
 LDFLAGS  =

 # warnings
-CFLAGS   += -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function
+CFLAGS   += -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith
 CXXFLAGS += -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar

 # OS specific
--- a/README.md
+++ b/README.md
@@ -7,6 +7,10 @@

 Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++

+**Warnings**
+
+- `Q4_2` and `Q4_3` are still in development. Do not expect any kind of backward compatibility until they are finalize
+
 **Hot topics:**

 - [Added LoRA support](https://github.com/ggerganov/llama.cpp/pull/820)
--- a/ggml.c
+++ b/ggml.c
Author	SHA1	Message	Date
Kawrakow	f7d05095b4	Q4_2 quantization with rmse-optimized scale and quants (#1062 ) * Q4_2 quantization with rmse-optimized scale and quants For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012 For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks. Quantization is slow (~90 seconds on my Mac for 7B) as not multi-threaded as in PR #896. * ggml : satisfy the sanitizer builds Not sure why this makes them fail * Better follow ggml conventions for function names * Fixed type as per reviewer comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-19 20:20:14 +02:00
Georgi Gerganov	884e7d7a2b	ggml : use 8-bit precision for Q4_1 intermediate results (#1047 ) * ggml : use 8-bit precision for Q4_1 intermediate results (ARM) * ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32 56 ms/token with Q4_1 ! * ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051) * gitignore : ignore ppl-*.txt files --------- Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>	2023-04-19 20:10:08 +03:00
Georgi Gerganov	7cd5c4a3e9	readme : add warning about Q4_2 and Q4_3	2023-04-19 19:07:54 +03:00
Stephan Walter	f3d4edf504	ggml : Q4 cleanup - remove 4-bit dot product code (#1061 ) * Q4 cleanup * Remove unused AVX512 Q4_0 code	2023-04-19 19:06:37 +03:00