Compare commits

...

6 Commits

Author SHA1 Message Date
Valentyn Bezshapkin
97940520e8 fix: add POSIX functionality for Linux compilation (#51)
* fix: add POSIX functionality for Linux compilation

* fix: older standard for compatibility
2023-03-22 19:20:25 +02:00
tjohnman
305ba6f0e6 Don't force immediate interactive without -i (#354)
* Don't force immediate interactive without -i

Sometimes we might want to use a reverse prompt but we want to let the
model generate tokens right after the initial prompt. So we don't force
user input mode if the -i flag wasn't specified and instead let it run
until we encounter the reverse prompt.

This gives use some more flexibility, since it doesn't force the user to
enter a newline if they want to let the model generate text right after
the initial prompt and only be asked for input if the reverse prompt is
encountered.

The `--interactive-first` flag is reintroduced to force the old
behavior. `-r` behaves like `-i` plus introduces a reverse prompt (it
can be specified more than once).

* Update help output.

---------

Co-authored-by: Johnman <tjohnman@github>
2023-03-22 19:16:35 +02:00
Erik Scholz
4122dffff9 cmake: make llama an actual library (#392) 2023-03-22 18:37:10 +02:00
Erik Scholz
56e659a0b2 fix perplexity after c-api refactor (#390)
* preallocate a buffer of fitting size for tokenization (utils.cpp)

* don't create a new std::string (especially here, where it's usually large)
2023-03-22 18:09:38 +02:00
Gary Linscott
40ea807a97 Add details on perplexity to README.md (#395) 2023-03-22 08:53:54 -07:00
Yusuf Kağan Hanoğlu
d5850c53ca Add missing header for memcpy (#386)
fixed: memcpy is not defined
2023-03-22 10:55:45 +02:00
7 changed files with 55 additions and 7 deletions

View File

@@ -217,6 +217,7 @@ add_library(utils OBJECT
target_include_directories(utils PUBLIC .)
target_compile_features(utils PUBLIC cxx_std_11) # don't bump
target_link_libraries(utils PRIVATE ${LLAMA_EXTRA_LIBS})
add_library(ggml OBJECT
ggml.c
@@ -226,12 +227,13 @@ target_include_directories(ggml PUBLIC .)
target_compile_features(ggml PUBLIC c_std_11) # don't bump
target_link_libraries(ggml PRIVATE Threads::Threads ${LLAMA_EXTRA_LIBS})
add_library(llama OBJECT
add_library(llama
llama.cpp
llama.h)
target_include_directories(llama PUBLIC .)
target_compile_features(llama PUBLIC cxx_std_11) # don't bump
target_link_libraries(llama PRIVATE utils ggml ${LLAMA_EXTRA_LIBS})
#
# Executables

View File

@@ -240,6 +240,40 @@ or
`shasum -a 256 --ignore-missing -c SHA256SUMS` on macOS
### Perplexity (Measuring model quality)
You can pass `--perplexity` as a command line option to measure perplexity over the given prompt. For more background,
see https://huggingface.co/docs/transformers/perplexity. However, in general, lower perplexity is better for LLMs.
#### Measurements
https://github.com/ggerganov/llama.cpp/pull/270 is the unofficial tracking page for now. llama.cpp is measuring very well
compared to the baseline implementations. Quantization has a small negative impact to quality, but, as you can see, running
13B at q4_0 beats the 7B f16 model by a significant amount.
All measurements are done against wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context).
Note that the changing the context length will have a significant impact on perplexity (longer context = better perplexity).
```
Perplexity - model options
5.5985 - 13B, q4_0
5.9565 - 7B, f16
6.3001 - 7B, q4_1
6.5949 - 7B, q4_0
6.5995 - 7B, q4_0, --memory_f16
```
#### How to run
1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
2. Run `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw`
3. Output:
```
Calculating perplexity over 655 chunks
24.43 seconds per pass - ETA 4.45 hours
[1]4.5970,[2]5.1807,[3]6.0382,...
```
And after 4.45 hours, you will have the final perplexity.
### Android
You can easily run `llama.cpp` on Android device with [termux](https://play.google.com/store/apps/details?id=com.termux).
@@ -290,7 +324,6 @@ docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models
## Limitations
- We don't know yet how much the quantization affects the quality of the generated text
- Probably the token sampling can be improved
- The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder,
there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simply don't

3
ggml.c
View File

@@ -1,3 +1,6 @@
// Defines CLOCK_MONOTONIC on Linux
#define _POSIX_C_SOURCE 199309L
#include "ggml.h"
#if defined(_MSC_VER) || defined(__MINGW32__)

View File

@@ -9,6 +9,7 @@
#include <queue>
#include <regex>
#include <cassert>
#include <cstring>
// determine number of model parts based on the dimension
static const std::unordered_map<int, int> LLAMA_N_PARTS = {

View File

@@ -85,7 +85,7 @@ void perplexity(llama_context * ctx, const gpt_params & params) {
// Download: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
// Run `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw`
// Output: `perplexity: 13.5106 [114/114]`
auto tokens = ::llama_tokenize(ctx, params.prompt.c_str(), true);
auto tokens = ::llama_tokenize(ctx, params.prompt, true);
int count = 0;
double nll = 0.0;
@@ -254,6 +254,10 @@ int main(int argc, char ** argv) {
params.interactive = true;
}
if (params.interactive_start) {
params.interactive = true;
}
fprintf(stderr, "\n");
fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());
fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
@@ -296,7 +300,7 @@ int main(int argc, char ** argv) {
#endif
" - Press Return to return control to LLaMa.\n"
" - If you want to submit another line, end your input in '\\'.\n\n");
is_interacting = true;
is_interacting = params.interactive_start;
}
int input_consumed = 0;

View File

@@ -63,6 +63,8 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
params.model = argv[++i];
} else if (arg == "-i" || arg == "--interactive") {
params.interactive = true;
} else if (arg == "--interactive-first") {
params.interactive_start = true;
} else if (arg == "-ins" || arg == "--instruct") {
params.instruct = true;
} else if (arg == "--color") {
@@ -96,9 +98,10 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
fprintf(stderr, "options:\n");
fprintf(stderr, " -h, --help show this help message and exit\n");
fprintf(stderr, " -i, --interactive run in interactive mode\n");
fprintf(stderr, " --interactive-first run in interactive mode and wait for input right away\n");
fprintf(stderr, " -ins, --instruct run in instruction mode (use with Alpaca models)\n");
fprintf(stderr, " -r PROMPT, --reverse-prompt PROMPT\n");
fprintf(stderr, " in interactive mode, poll user input upon seeing PROMPT (can be\n");
fprintf(stderr, " run in interactive mode and poll user input upon seeing PROMPT (can be\n");
fprintf(stderr, " specified more than once for multiple prompts).\n");
fprintf(stderr, " --color colorise output to distinguish prompt and user input from generations\n");
fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1, use random seed for <= 0)\n");
@@ -146,8 +149,10 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
// TODO: not great allocating this every time
std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
std::vector<llama_token> res(8096);
// initialize to prompt numer of chars, since n_tokens <= n_prompt_chars
std::vector<llama_token> res(text.size() + (int)add_bos);
int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
assert(n >= 0);
res.resize(n);
return res;

View File

@@ -38,7 +38,7 @@ struct gpt_params {
bool random_prompt = false; // do not randomize prompt if none provided
bool use_color = false; // use color to distinguish generations and inputs
bool interactive = false; // interactive mode
bool interactive_start = false; // reverse prompt immediately
bool interactive_start = false; // wait for user input immediately
bool instruct = false; // instruction mode (used for Alpaca models)
bool ignore_eos = false; // do not stop generating after eos
bool perplexity = false; // compute perplexity over the prompt