mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-03-17 16:44:07 +00:00
* Update build doc * Add cgraph tensor output name to OV op name * Update openvino build instructions * Add initial NPU support * draft NPU support version 2: prefill + kvcache * NPU support version 2: prefill + kvcache * Change due to ggml cgraph changes, not correct yet * Change due to ggml cgraph changes, llama-3.2 CPU work * Add AMD64 to CMakeLists * Change due to ggml cgraph changes, all device work * Refactor: clean, fix warning * Update clang-format * Statful transformation for CPU GPU * Add SwiGLU * Fuse to SDPA * Replace Concat with Broadcast in MulMat for GQA * Pull out indices creation for kv cache update * Refactor: remove past_token_len from extra_inputs * Fix Phi3 SwiGLU and SoftMax * Pull out sin cos from rope * Reduce memory: free ov weights node after graph conversion * Fix CPY due to cgraph change * Added OpenVINO CI/CD. Updated docs * Fix llama-cli * Fix Phi3 ROPE; Add test-backend-ops * Fix NPU * Fix llama-bench; Clang-format * Fix llama-perplexity * temp. changes for mark decomp * matmul in fp32 * mulmat input conversion fix * mulmat type conversion update * add mark decomp pass * Revert changes in fuse_to_sdpa * Update build.md * Fix test-backend-ops * Skip test-thread-safety; Run ctest only in ci/run.sh * Use CiD for NPU * Optimize tensor conversion, improve TTFT * Support op SET_ROWS * Fix NPU * Remove CPY * Fix test-backend-ops * Minor updates for raising PR * Perf: RMS fused to OV internal RMS op * Fix after rebasing - Layout of cache k and cache v are unified: [seq, n_head, head_size] - Add CPY and FLASH_ATTN_EXT, flash attn is not used yet - Skip test-backend-ops due to flash attn test crash - Add mutex around graph conversion to avoid test-thread-safety fali in the future - Update NPU config - Update GPU config to disable SDPA opt to make phi-3 run * Change openvino device_type to GPU; Enable flash_attn * Update supports_buft and supports_op for quantized models * Add quant weight conversion functions from genai gguf reader * Quant models run with accuracy issue * Fix accuracy: disable cpu_repack * Fix CI; Disable test-backend-ops * Fix Q4_1 * Fix test-backend-ops: Treat quantized tensors as weights * Add NPU Q4_0 support * NPU perf: eliminate zp * Dequantize q4_1 q4_k q6_k for NPU * Add custom quant type: q8_1_c, q4_0_128 * Set m_is_static=false as default in decoder * Simpilfy translation of get_rows * Fix after rebasing * Improve debug util; Eliminate nop ReshapeReshape * STYLE: make get_types_to_requant a function * Support BF16 model * Fix NPU compile * WA for npu 1st token acc issue * Apply EliminateZP only for npu * Add GeGLU * Fix Hunyuan * Support iSWA * Fix NPU accuracy * Fix ROPE accuracy when freq_scale != 1 * Minor: not add attention_size_swa for non-swa model * Minor refactor * Add Q5_K to support phi-3-q4_k_m * Requantize Q6_K (gs16) to gs32 on GPU * Fix after rebasing * Always apply Eliminate_ZP to fix GPU compile issue on some platforms * kvcachefusion support * env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added * Fix for Phi3 * Fix llama-cli (need to run with --no-warmup) * Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working * fix after rebasing * Fix llama-3-8b and phi3-mini q4_0 NPU * Update to OV-2025.3 and CMakeLists.txt * Add OV CI cache * Apply CISC review and update CI to OV2025.3 * Update CI to run OV dep install before build * Update OV dockerfile to use OV2025.3 and update build docs * Style: use switch in supports_ops * Style: middle ptr and ref align, omit optional struct keyword * NPU Unify PD (#14) * Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims * Clean placeholders in ggml-openvino.cpp * NPU unify PD (handled internally) * change graph to 4d, support multi sequences * Fix llama-bench * Fix NPU * Update ggml-decoder.cpp Hitting error while compiling on windows: error C3861: 'unsetenv': identifier not found Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it. Proposed fix: Use _putenv_s() (Windows equivalent) This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment. This keeps cross-platform compatibility. * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Remove the second decoder for node. Moving the function into the model decoder * Fix error for naive * NPU prefill chunking * NPU fix llama-bench * fallback naive run with accuracy issue * NPU support llma-perplexity -b 512 --no-warmup * Refactor: split ov_graph_compute for dynamic and static * remove unused API GgmlOvDecoder::get_output_stride(const std::string & name) * minor update due to ov 2025.4 * remove unused API GgmlOvDecoder::get_output_names() * remove unused API get_output_shape(const std::string & name) * Modified API GgmlOvDecoder::get_output_type(const std::string & name) * Removed API GgmlOvDecoder::get_output_op_params(const std::string & name) * Removed API get_output_ggml_tensor(const std::string & name) * Removed API m_outputs * Removed m_output_names * Removed API GgmlOvDecoder::get_input_names() * Removed API GgmlOvDecoder::get_input_stride(const std::string& name) * Removed API get_input_type * Removed API get_input_type * Removed API GgmlOvDecoder::get_input_shape(const std::string & name) * Removed API GgmlOvDecoder::get_input_op_params(const std::string & name) * Fix error for decoder cache * Reuse cached decoder * GPU remove Q6_K requantization * NPU fix wrong model output shape * NPU fix q4 perf regression * Remove unused variable nodes * Fix decoder can_reuse for llama-bench * Update build.md for Windows * backend buffer: allocate on host * Use shared_buffer for GPU NPU; Refactor * Add ov_backend_host_buffer; Use cached remote context * Put kvcache on GPU * Use ggml_aligned_malloc * only use remote tensor for kvcache * only use remote tensor for kvcache for GPU * FIX: use remote tensor from singleton * Update build.md to include OpenCL * NPU always requant to q4_0_128 * Optimize symmetric quant weight extraction: use single zp * Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant * Update build.md * Support -ctk f32 * Initial stateful graph support * Update ggml/src/ggml-openvino/ggml-decoder.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * code cleanup * npu perf fix * requant to f16 for Q6 embed on NPU * Update ggml/src/ggml-openvino/ggml-decoder.cpp * Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp * Create OPENVINO.md in llama.cpp backend docs * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * Update build.md * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * kq_mask naming fix * Syntax correction for workflows build file * Change ov backend buffer is_host to false * Fix llama-bench -p -n where p<=256 * Fix --direct-io 0 * Don't put kvcache on GPU in stateful mode * Remove hardcode names * Fix stateful shapes * Simplification for stateful and update output shape processing * Remove hardcode names * Avoid re-compilation in llama-bench * Extract zp directly instead of bias * Refactor weight tensor processing * create_weight_node accept non-ov backend buffer * remove changes in llama-graph.cpp * stateful masking fix (#38) Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes. * Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add * hardcoded name handling for rope_freqs.weight * Suppress logging and add error handling to allow test-backend-ops to complete * Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases * Use bias instead of zp in test-backend-ops * Update OV in CI, Add OV CI Tests in GH Actions * Temp fix for multithreading bug * Update OV CI, fix review suggestions. * fix editorconfig-checker, update docs * Fix tabs to spaces for editorconfig-checker * fix editorconfig-checker * Update docs * updated model link to be GGUF model links * Remove GGML_CPU_REPACK=OFF * Skip permuted ADD and MUL * Removed static variables from utils.cpp * Removed initializing non-existing variable * Remove unused structs * Fix test-backend-ops for OV GPU * unify api calling * Update utils.cpp * When the dim is dynamic, throw an error, need to is stastic forst * Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using * No need to return * Fix test-backend-ops for OV GPU LNL * Fix test-thread-safety * use the shape from infer request of output tensor create to avoid issue * fix dynamic output shape issue * fix issue for the unused node in tests * Remove unused lock * Add comment * Update openvino docs * update to OV release version 2026.0 * add ci ov-gpu self hosted runner * fix editorconfig * Fix perplexity * Rewrite the model inputs finding mechanism (#54) * Rewrite the model inputs finding logistic * Put stateful shape handle in get input shape * Put the iteration logistic in func * Added ggml-ci-intel-openvino-gpu and doc update * .hpp files converted to .h * fix ggml-ci-x64-intel-openvino-gpu * Fix for stateful execution bug in llama-bench * Minor updates after stateful llama-bench fix * Update ggml/src/ggml-openvino/utils.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * Remove multiple get_shape calls * Bring back mutex into compute * Fix VIEW op, which slice the input node * Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access * Temp. fix for test requant errors * Update to OV ggml-ci to low-perf * ci : temporary disable "test-llama-archs" * ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag * docs : update url * Fix OV link in docker and Update docs --------- Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com> Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com> Co-authored-by: Arshath <arshath.ramzan@intel.com> Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com> Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
344 lines
16 KiB
Markdown
344 lines
16 KiB
Markdown
# OpenVINO Backend for llama.cpp
|
|
[OpenVINO](https://docs.openvino.ai/) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge.
|
|
This document describes the [OpenVINO backend for llama.cpp](../../src/ggml-openvino), which enables hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs** while remaining compatible with the existing **GGUF model ecosystem**. The backend translates GGML compute graphs into OpenVINO graphs and leverages graph compilation, kernel fusion, and device-specific optimizations to improve inference performance on supported Intel hardware.
|
|
|
|
The OpenVINO backend is implemented in `ggml/src/ggml-openvino` and provides a translation layer for core GGML operations. The OpenVINO backend replaces the standard GGML graph execution path with Intel's OpenVINO inference engine. This approach allows the same GGUF model file to run on Intel CPUs, Intel GPUs (integrated and discrete), and Intel NPUs without changes to the model or the rest of the llama.cpp stack. When a `ggml_cgraph` is dispatched to OpenVINO backend, it:
|
|
|
|
- Walks the GGML graph and identifies inputs, outputs, weights, and KV cache tensors.
|
|
- Translates the GGML operations into an `ov::Model` using OpenVINO's frontend API.
|
|
- Compiles and caches the model for the target device.
|
|
- Binds GGML tensor memory to OpenVINO inference tensors and runs inference.
|
|
|
|
## Supported Devices
|
|
|
|
OpenVINO backend supports the following hardware:
|
|
|
|
- Intel CPUs
|
|
- Intel GPUs (integrated and discrete)
|
|
- Intel NPUs
|
|
|
|
Although OpenVINO supports a wide range of [Intel hardware](https://docs.openvino.ai/2026/about-openvino/release-notes-openvino/system-requirements.html), the llama.cpp OpenVINO backend has been validated specifically on AI PCs such as the Intel® Core™ Ultra Series 1 and Series 2.
|
|
|
|
## Supported Model Precisions
|
|
|
|
- `FP16`
|
|
- `BF16` (on Intel Xeon)
|
|
- `Q8_0`
|
|
- `Q4_0`
|
|
- `Q4_1`
|
|
- `Q4_K`
|
|
- `Q4_K_M`
|
|
- `Q5_K` (converted to Q8_0_C at runtime)
|
|
- `Q6_K` (converted to Q8_0_C at runtime)
|
|
|
|
> [!NOTE]
|
|
> Accuracy validation and performance optimizations for quantized models are a work in progress.
|
|
|
|
## Quantization Support Details
|
|
|
|
### CPU and GPU
|
|
|
|
- **`Q4_0`, `Q4_1`, `Q4_K_M`, `Q6_K` models are supported**
|
|
- `Q5_K` and `Q6_K` tensors are converted to `Q8_0_C`
|
|
|
|
### NPU
|
|
|
|
- **Primary supported quantization scheme is `Q4_0`**
|
|
- `Q6_K` tensors are requantized to `Q4_0_128` in general. For embedding weights, `Q6_K` tensors are requantized to `Q8_0_C` except for the token embedding matrix which is dequantized to fp16
|
|
|
|
### Additional Notes
|
|
|
|
- Both `Q4_0` and `Q4_1` models use `Q6_K` for the token embedding tensor and the final matmul weight tensor (often the same tensor)
|
|
- `Q4_0` models may produce some `Q4_1` tensors if an imatrix is provided during quantization using `llama-quantize`
|
|
- `Q4_K_M` models may include both `Q6_K` and `Q5_K` tensors (observed in Phi-3)
|
|
|
|
## Validated Models
|
|
|
|
The following models have been validated for functionality on Intel® Core™ Ultra Series 1 and Series 2:
|
|
|
|
- [Llama-3.2-1B-Instruct-GGUF](https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/)
|
|
- [Llama-3.1-8B-Instruct](https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF)
|
|
- [microsoft/Phi-3-mini-4k-instruct-gguf](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)
|
|
- [Qwen/Qwen2.5-1.5B-Instruct-GGUF](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF)
|
|
- [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B-GGUF)
|
|
- [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf)
|
|
- [tencent/Hunyuan-7B-Instruct](https://huggingface.co/bartowski/tencent_Hunyuan-7B-Instruct-GGUF)
|
|
- [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF)
|
|
- [bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF)
|
|
|
|
## Build Instructions
|
|
|
|
### Prerequisites
|
|
|
|
- Linux or Windows system with Intel hardware (CPU, GPU, or NPU)
|
|
- **For Intel GPU or NPU Usage**: Install the appropriate hardware drivers for your Intel GPU or NPU. For detailed instructions, see: [Additional Configurations for Hardware Acceleration](https://docs.openvino.ai/2025/get-started/install-openvino/configurations.html).
|
|
|
|
- **Linux:**
|
|
- Git, CMake, and Ninja software tools are needed for building.
|
|
```bash
|
|
sudo apt-get update
|
|
sudo apt-get install -y build-essential libcurl4-openssl-dev libtbb12 cmake ninja-build python3-pip curl wget tar
|
|
```
|
|
- OpenCL
|
|
```bash
|
|
sudo apt install ocl-icd-opencl-dev opencl-headers opencl-clhpp-headers intel-opencl-icd
|
|
```
|
|
|
|
- **Windows:**
|
|
- Download and install [Microsoft Visual Studio 2022 Build Tools](https://aka.ms/vs/17/release/vs_BuildTools.exe). During installation, select the **"Desktop development with C++"** workload.
|
|
|
|
- Install required tools:
|
|
```powershell
|
|
# Windows PowerShell
|
|
winget install Git.Git
|
|
winget install GNU.Wget
|
|
winget install Ninja-build.Ninja
|
|
```
|
|
|
|
- Install **OpenCL** using **vcpkg**:
|
|
```powershell
|
|
# Windows PowerShell
|
|
cd C:\
|
|
git clone https://github.com/microsoft/vcpkg
|
|
cd vcpkg
|
|
.\bootstrap-vcpkg.bat
|
|
.\vcpkg install opencl
|
|
# Optional but recommended: Integrate vcpkg with Visual Studio / CMake:
|
|
.\vcpkg integrate install
|
|
```
|
|
|
|
### 1. Install OpenVINO Runtime
|
|
|
|
- Follow the guide to install OpenVINO Runtime from an archive file: [Linux](https://docs.openvino.ai/2026/get-started/install-openvino/install-openvino-archive-linux.html) | [Windows](https://docs.openvino.ai/2026/get-started/install-openvino/install-openvino-archive-windows.html)
|
|
|
|
- **Linux:**
|
|
|
|
<details>
|
|
<summary>📦 Click to expand OpenVINO installation from an archive file on Ubuntu</summary>
|
|
<br>
|
|
|
|
```bash
|
|
wget https://raw.githubusercontent.com/ravi9/misc-scripts/main/openvino/ov-archive-install/install-openvino-from-archive.sh
|
|
chmod +x install-openvino-from-archive.sh
|
|
./install-openvino-from-archive.sh
|
|
```
|
|
|
|
Verify OpenVINO is initialized properly:
|
|
```bash
|
|
echo $OpenVINO_DIR
|
|
```
|
|
</details>
|
|
|
|
|
|
### 2. Build llama.cpp with OpenVINO Backend
|
|
|
|
Clone the OpenVINO-enabled llama.cpp fork and build it:
|
|
|
|
```bash
|
|
git clone https://github.com/ggml-org/llama.cpp
|
|
cd llama.cpp
|
|
```
|
|
|
|
- **Linux:**
|
|
```bash
|
|
source /opt/intel/openvino/setupvars.sh
|
|
cmake -B build/ReleaseOV -G Ninja -DCMAKE_BUILD_TYPE=Release -DGGML_OPENVINO=ON
|
|
cmake --build build/ReleaseOV --parallel
|
|
```
|
|
|
|
- **Windows:**
|
|
```cmd
|
|
# x64 Native Tools Command Prompt for VS 2022
|
|
"C:\Program Files (x86)\Intel\openvino_2026.0\setupvars.bat"
|
|
cmake -B build\ReleaseOV -G Ninja -DCMAKE_BUILD_TYPE=Release -DGGML_OPENVINO=ON -DLLAMA_CURL=OFF -DCMAKE_TOOLCHAIN_FILE=C:\vcpkg\scripts\buildsystems\vcpkg.cmake
|
|
cmake --build build\ReleaseOV --parallel
|
|
```
|
|
> [!NOTE]
|
|
> Use `x64 Native Tools Command Prompt` for Windows build. After building, you could use either `cmd` or `PowerShell` to run the OpenVINO backend.
|
|
|
|
### 3. Download Sample Model
|
|
|
|
Download models for testing:
|
|
|
|
```bash
|
|
# Linux
|
|
mkdir -p ~/models/
|
|
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf \
|
|
-O ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
|
|
|
# Windows PowerShell
|
|
mkdir C:\models
|
|
Invoke-WebRequest -Uri https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf -OutFile C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf
|
|
|
|
# Windows Command Line
|
|
mkdir C:\models
|
|
curl -L https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf -o C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf
|
|
```
|
|
|
|
### 4. Run Inference with OpenVINO Backend
|
|
|
|
When using the OpenVINO backend, the first inference token may have slightly higher latency due to on-the-fly conversion to the OpenVINO graph. Subsequent tokens and runs will be faster.
|
|
|
|
```bash
|
|
# If device is unset or unavailable, defaults to CPU.
|
|
# If the system has multiple GPUs, use GPU.0 or GPU.1 to explicitly target a specific GPU.
|
|
|
|
# Linux
|
|
export GGML_OPENVINO_DEVICE=GPU
|
|
# To run llama-simple:
|
|
./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -n 50 "The story of AI is "
|
|
# To run in chat mode:
|
|
./build/ReleaseOV/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
|
|
|
# Windows Command Line
|
|
set GGML_OPENVINO_DEVICE=GPU
|
|
# Windows PowerShell
|
|
$env:GGML_OPENVINO_DEVICE = "GPU"
|
|
|
|
# To run llama-simple
|
|
build\ReleaseOV\bin\llama-simple.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -n 50 "The story of AI is "
|
|
# To run in chat mode:
|
|
build\ReleaseOV\bin\llama-cli.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf"
|
|
|
|
```
|
|
> [!NOTE]
|
|
> On systems with multiple GPUs, use `GPU.0` or `GPU.1` to explicitly target specific GPU. See [OpenVINO GPU Device](https://docs.openvino.ai/2026/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) for more details.
|
|
|
|
|
|
### Docker Build
|
|
|
|
You can build and run llama.cpp with OpenVINO backend using Docker.
|
|
|
|
```bash
|
|
# Build the base runtime image with compiled shared libraries and minimal dependencies.
|
|
docker build -t llama-openvino:base -f .devops/openvino.Dockerfile .
|
|
|
|
# Build the complete image with all binaries, Python tools, gguf-py library, and model conversion utilities.
|
|
docker build --target=full -t llama-openvino:full -f .devops/openvino.Dockerfile .
|
|
|
|
# Build a minimal CLI-only image containing just the llama-cli executable.
|
|
docker build --target=light -t llama-openvino:light -f .devops/openvino.Dockerfile .
|
|
|
|
# Builds a server-only image with llama-server executable, health check endpoint, and REST API support.
|
|
docker build --target=server -t llama-openvino:server -f .devops/openvino.Dockerfile .
|
|
|
|
# If you are behind a proxy:
|
|
docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy --target=light -t llama-openvino:light -f .devops/openvino.Dockerfile .
|
|
```
|
|
|
|
Run llama.cpp with OpenVINO backend Docker container.
|
|
Save sample models in `~/models` as [shown above](#3-download-sample-model). It will be mounted to the container in the examples below.
|
|
|
|
```bash
|
|
# Run Docker container
|
|
docker run --rm -it -v ~/models:/models llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
|
|
|
# With Intel GPU access (iGPU or dGPU)
|
|
docker run --rm -it -v ~/models:/models \
|
|
--device=/dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
|
|
llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
|
|
|
# With Intel NPU access
|
|
docker run --rm -it --env GGML_OPENVINO_DEVICE=NPU -v ~/models:/models \
|
|
--device=/dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
|
|
llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
|
```
|
|
|
|
Run Llama.cpp Server with OpenVINO Backend:
|
|
```bash
|
|
# Run the Server Docker container
|
|
docker run --rm -it -p 8080:8080 -v ~/models:/models llama-openvino:server --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
|
|
|
# In a NEW terminal, test the server with curl
|
|
|
|
# If you are behind a proxy, make sure to set NO_PROXY to avoid proxy for localhost
|
|
export NO_PROXY=localhost,127.0.0.1
|
|
|
|
# Test health endpoint
|
|
curl -f http://localhost:8080/health
|
|
|
|
# Test with a simple prompt
|
|
curl -X POST "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" \
|
|
-d '{"messages":[{"role":"user","content":"Write a poem about OpenVINO"}],"max_tokens":100}' | jq .
|
|
```
|
|
|
|
## Runtime Configuration
|
|
|
|
The OpenVINO backend can be configured using the following environment variables at runtime to control device selection, caching, debugging, and profiling behavior.
|
|
|
|
### Configuration Options
|
|
|
|
| Variable | Default | Description |
|
|
|-----------------------------------|------------|-------------------------------------------------------------------------------------------------------------|
|
|
| `GGML_OPENVINO_DEVICE` | `CPU` | Specify the target device (CPU, GPU, NPU). On systems with multiple GPUs, use `GPU.0` or `GPU.1` to explicitly target specific GPU. See [OpenVINO GPU Device](https://docs.openvino.ai/2026/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html). When set to **NPU**, static compilation mode is enabled for optimal performance. |
|
|
| `GGML_OPENVINO_CACHE_DIR` | `not set` | Directory for OpenVINO model caching (recommended: `/tmp/ov_cache`). Enables model caching when set. **Not supported on NPU devices.** |
|
|
| `GGML_OPENVINO_PREFILL_CHUNK_SIZE`| `256` | Token chunk size for **NPU** prefill. |
|
|
| `GGML_OPENVINO_STATEFUL_EXECUTION`| `0` | Enable stateful KV cache on for better performance. Recommended on CPU, GPU. |
|
|
| `GGML_OPENVINO_PROFILING` | `0` | Enable execution-time profiling. |
|
|
| `GGML_OPENVINO_DUMP_CGRAPH` | `0` | Dump the GGML compute graph to `cgraph_ov.txt`. |
|
|
| `GGML_OPENVINO_DUMP_IR` | `0` | Serialize OpenVINO IR files with timestamps. |
|
|
| `GGML_OPENVINO_DEBUG_INPUT` | `0` | Enable input debugging and print input tensor info. |
|
|
| `GGML_OPENVINO_DEBUG_OUTPUT` | `0` | Enable output debugging and print output tensor info. |
|
|
| `GGML_OPENVINO_PRINT_CGRAPH_TENSOR_ADDRESS` | `0` | Print tensor address map once. |
|
|
|
|
> [!NOTE]
|
|
>`GGML_OPENVINO_STATEFUL_EXECUTION` is an **Experimental** feature to allow stateful execution for managing the KV cache internally inside the OpenVINO model, improving performance on CPUs and GPUs. Stateful execution is not effective on NPUs, and not all models currently support this feature. This feature is experimental and has been validated only with the llama-simple, llama-cli, llama-bench, and llama-run applications and is recommended to enable for the best performance. Other applications, such as llama-server and llama-perplexity, are not yet supported.
|
|
|
|
### Example Usage
|
|
|
|
#### GPU Inference with Profiling
|
|
|
|
```bash
|
|
# If the system has multiple GPUs, use GPU.0 or GPU.1 to explicitly target a specific GPU.
|
|
|
|
# Linux
|
|
export GGML_OPENVINO_CACHE_DIR=/tmp/ov_cache
|
|
export GGML_OPENVINO_PROFILING=1
|
|
export GGML_OPENVINO_DEVICE=GPU
|
|
|
|
./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -n 50 "The story of AI is "
|
|
|
|
# Windows Command Line
|
|
set GGML_OPENVINO_CACHE_DIR=C:\tmp\ov_cache
|
|
set GGML_OPENVINO_PROFILING=1
|
|
set GGML_OPENVINO_DEVICE=GPU
|
|
|
|
# Windows PowerShell
|
|
$env:GGML_OPENVINO_CACHE_DIR = "C:\tmp\ov_cache"
|
|
$env:GGML_OPENVINO_PROFILING = "1"
|
|
$env:GGML_OPENVINO_DEVICE = "GPU"
|
|
|
|
build\ReleaseOV\bin\llama-simple.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -n 50 "The story of AI is "
|
|
|
|
```
|
|
|
|
#### llama-bench
|
|
|
|
```bash
|
|
# -fa 1 is required when running llama-bench with the OpenVINO backend.
|
|
GGML_OPENVINO_DEVICE=GPU ./llama-bench -fa 1
|
|
```
|
|
|
|
### NPU Notes
|
|
|
|
- Model caching is not yet supported
|
|
- Does not support llama-server -np > 1 (multiple parallel sequences)
|
|
- Only supports llama-perplexity -b 512 or smaller
|
|
|
|
## Llama.cpp Tools
|
|
|
|
The following tools work with the OpenVINO backend on CPU, GPU, NPU:
|
|
- llama-simple
|
|
- llama-run
|
|
- llama-cli
|
|
- llama-server
|
|
- llama-bench
|
|
- llama-perplexity
|
|
|
|
## Work in Progress
|
|
|
|
- Performance and memory optimizations
|
|
- Accuracy validation
|
|
- Broader quantization coverage
|
|
- Support for additional model architectures
|