* server, webui: accept continue_final_message flag for vLLM API compat Add the continue_final_message body flag from the vLLM and transformers API. When set together with add_generation_prompt false, it triggers the existing prefill_assistant code path, regardless of the server side opt.prefill_assistant option. Mutual exclusion with add_generation_prompt true is enforced, matching vLLM behavior. WebUI sends continue_final_message and add_generation_prompt false on the Continue button, with the matching opt in option on the chat service. Pure API alignment, no change to the prefill logic itself. Paves the way for the upcoming per-template prefill plumbing in common/chat. * test: add coverage for continue_final_message vLLM compat flag Two cases on top of the existing assistant prefill coverage. First, continue_final_message true with add_generation_prompt false produces the same rendered prompt as the prefill_assistant heuristic, proving the new flag is a correct alias of the existing path. Second, both flags set to true is rejected with HTTP 400, matching the vLLM/transformers mutual exclusion contract. * chore: update webui build output
Server tests
Python based server tests scenario using pytest.
Tests target GitHub workflows job runners with 4 vCPU.
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail.
To mitigate it, you can increase values in n_predict, kv_size.
Install dependencies
pip install -r requirements.txt
Run tests
- Build the server
cd ../../..
cmake -B build
cmake --build build --target llama-server
- Start the test:
./tests.sh
It's possible to override some scenario steps values with environment variables:
| variable | description |
|---|---|
PORT |
context.server_port to set the listening port of the server during scenario, default: 8080 |
LLAMA_SERVER_BIN_PATH |
to change the server binary path, default: ../../../build/bin/llama-server |
DEBUG |
to enable steps and server verbose mode --verbose |
N_GPU_LAYERS |
number of model layers to offload to VRAM -ngl --n-gpu-layers |
LLAMA_CACHE |
by default server tests re-download models to the tmp subfolder. Set this to your cache (e.g. $HOME/Library/Caches/llama.cpp on Mac or $HOME/.cache/llama.cpp on Unix) to avoid this |
To run slow tests (will download many models, make sure to set LLAMA_CACHE if needed):
SLOW_TESTS=1 ./tests.sh
To run with stdout/stderr display in real time (verbose output, but useful for debugging):
DEBUG=1 ./tests.sh -s -v -x
To run all the tests in a file:
./tests.sh unit/test_chat_completion.py -v -x
To run a single test:
./tests.sh unit/test_chat_completion.py::test_invalid_chat_completion_req
Hint: You can compile and run test in single command, useful for local development:
cmake --build build -j --target llama-server && ./tools/server/tests/tests.sh
To see all available arguments, please refer to pytest documentation
Debugging external llama-server
It can sometimes be useful to run the server in a debugger when invesigating test
failures. To do this, the environment variable DEBUG_EXTERNAL=1 can be set
which will cause the test to skip starting a llama-server itself. Instead, the
server can be started in a debugger.
Example using gdb:
$ gdb --args ../../../build/bin/llama-server \
--host 127.0.0.1 --port 8080 \
--temp 0.8 --seed 42 \
--hf-repo ggml-org/models --hf-file tinyllamas/stories260K.gguf \
--batch-size 32 --no-slots --alias tinyllama-2 --ctx-size 512 \
--parallel 2 --n-predict 64
And a break point can be set in before running:
(gdb) br server.cpp:4604
(gdb) r
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
And then the test in question can be run in another terminal:
(venv) $ env DEBUG_EXTERNAL=1 ./tests.sh unit/test_chat_completion.py -v -x
And this should trigger the breakpoint and allow inspection of the server state in the debugger terminal.