mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-03-17 16:44:07 +00:00
* Update build doc * Add cgraph tensor output name to OV op name * Update openvino build instructions * Add initial NPU support * draft NPU support version 2: prefill + kvcache * NPU support version 2: prefill + kvcache * Change due to ggml cgraph changes, not correct yet * Change due to ggml cgraph changes, llama-3.2 CPU work * Add AMD64 to CMakeLists * Change due to ggml cgraph changes, all device work * Refactor: clean, fix warning * Update clang-format * Statful transformation for CPU GPU * Add SwiGLU * Fuse to SDPA * Replace Concat with Broadcast in MulMat for GQA * Pull out indices creation for kv cache update * Refactor: remove past_token_len from extra_inputs * Fix Phi3 SwiGLU and SoftMax * Pull out sin cos from rope * Reduce memory: free ov weights node after graph conversion * Fix CPY due to cgraph change * Added OpenVINO CI/CD. Updated docs * Fix llama-cli * Fix Phi3 ROPE; Add test-backend-ops * Fix NPU * Fix llama-bench; Clang-format * Fix llama-perplexity * temp. changes for mark decomp * matmul in fp32 * mulmat input conversion fix * mulmat type conversion update * add mark decomp pass * Revert changes in fuse_to_sdpa * Update build.md * Fix test-backend-ops * Skip test-thread-safety; Run ctest only in ci/run.sh * Use CiD for NPU * Optimize tensor conversion, improve TTFT * Support op SET_ROWS * Fix NPU * Remove CPY * Fix test-backend-ops * Minor updates for raising PR * Perf: RMS fused to OV internal RMS op * Fix after rebasing - Layout of cache k and cache v are unified: [seq, n_head, head_size] - Add CPY and FLASH_ATTN_EXT, flash attn is not used yet - Skip test-backend-ops due to flash attn test crash - Add mutex around graph conversion to avoid test-thread-safety fali in the future - Update NPU config - Update GPU config to disable SDPA opt to make phi-3 run * Change openvino device_type to GPU; Enable flash_attn * Update supports_buft and supports_op for quantized models * Add quant weight conversion functions from genai gguf reader * Quant models run with accuracy issue * Fix accuracy: disable cpu_repack * Fix CI; Disable test-backend-ops * Fix Q4_1 * Fix test-backend-ops: Treat quantized tensors as weights * Add NPU Q4_0 support * NPU perf: eliminate zp * Dequantize q4_1 q4_k q6_k for NPU * Add custom quant type: q8_1_c, q4_0_128 * Set m_is_static=false as default in decoder * Simpilfy translation of get_rows * Fix after rebasing * Improve debug util; Eliminate nop ReshapeReshape * STYLE: make get_types_to_requant a function * Support BF16 model * Fix NPU compile * WA for npu 1st token acc issue * Apply EliminateZP only for npu * Add GeGLU * Fix Hunyuan * Support iSWA * Fix NPU accuracy * Fix ROPE accuracy when freq_scale != 1 * Minor: not add attention_size_swa for non-swa model * Minor refactor * Add Q5_K to support phi-3-q4_k_m * Requantize Q6_K (gs16) to gs32 on GPU * Fix after rebasing * Always apply Eliminate_ZP to fix GPU compile issue on some platforms * kvcachefusion support * env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added * Fix for Phi3 * Fix llama-cli (need to run with --no-warmup) * Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working * fix after rebasing * Fix llama-3-8b and phi3-mini q4_0 NPU * Update to OV-2025.3 and CMakeLists.txt * Add OV CI cache * Apply CISC review and update CI to OV2025.3 * Update CI to run OV dep install before build * Update OV dockerfile to use OV2025.3 and update build docs * Style: use switch in supports_ops * Style: middle ptr and ref align, omit optional struct keyword * NPU Unify PD (#14) * Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims * Clean placeholders in ggml-openvino.cpp * NPU unify PD (handled internally) * change graph to 4d, support multi sequences * Fix llama-bench * Fix NPU * Update ggml-decoder.cpp Hitting error while compiling on windows: error C3861: 'unsetenv': identifier not found Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it. Proposed fix: Use _putenv_s() (Windows equivalent) This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment. This keeps cross-platform compatibility. * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Remove the second decoder for node. Moving the function into the model decoder * Fix error for naive * NPU prefill chunking * NPU fix llama-bench * fallback naive run with accuracy issue * NPU support llma-perplexity -b 512 --no-warmup * Refactor: split ov_graph_compute for dynamic and static * remove unused API GgmlOvDecoder::get_output_stride(const std::string & name) * minor update due to ov 2025.4 * remove unused API GgmlOvDecoder::get_output_names() * remove unused API get_output_shape(const std::string & name) * Modified API GgmlOvDecoder::get_output_type(const std::string & name) * Removed API GgmlOvDecoder::get_output_op_params(const std::string & name) * Removed API get_output_ggml_tensor(const std::string & name) * Removed API m_outputs * Removed m_output_names * Removed API GgmlOvDecoder::get_input_names() * Removed API GgmlOvDecoder::get_input_stride(const std::string& name) * Removed API get_input_type * Removed API get_input_type * Removed API GgmlOvDecoder::get_input_shape(const std::string & name) * Removed API GgmlOvDecoder::get_input_op_params(const std::string & name) * Fix error for decoder cache * Reuse cached decoder * GPU remove Q6_K requantization * NPU fix wrong model output shape * NPU fix q4 perf regression * Remove unused variable nodes * Fix decoder can_reuse for llama-bench * Update build.md for Windows * backend buffer: allocate on host * Use shared_buffer for GPU NPU; Refactor * Add ov_backend_host_buffer; Use cached remote context * Put kvcache on GPU * Use ggml_aligned_malloc * only use remote tensor for kvcache * only use remote tensor for kvcache for GPU * FIX: use remote tensor from singleton * Update build.md to include OpenCL * NPU always requant to q4_0_128 * Optimize symmetric quant weight extraction: use single zp * Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant * Update build.md * Support -ctk f32 * Initial stateful graph support * Update ggml/src/ggml-openvino/ggml-decoder.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * code cleanup * npu perf fix * requant to f16 for Q6 embed on NPU * Update ggml/src/ggml-openvino/ggml-decoder.cpp * Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp * Create OPENVINO.md in llama.cpp backend docs * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * Update build.md * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * kq_mask naming fix * Syntax correction for workflows build file * Change ov backend buffer is_host to false * Fix llama-bench -p -n where p<=256 * Fix --direct-io 0 * Don't put kvcache on GPU in stateful mode * Remove hardcode names * Fix stateful shapes * Simplification for stateful and update output shape processing * Remove hardcode names * Avoid re-compilation in llama-bench * Extract zp directly instead of bias * Refactor weight tensor processing * create_weight_node accept non-ov backend buffer * remove changes in llama-graph.cpp * stateful masking fix (#38) Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes. * Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add * hardcoded name handling for rope_freqs.weight * Suppress logging and add error handling to allow test-backend-ops to complete * Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases * Use bias instead of zp in test-backend-ops * Update OV in CI, Add OV CI Tests in GH Actions * Temp fix for multithreading bug * Update OV CI, fix review suggestions. * fix editorconfig-checker, update docs * Fix tabs to spaces for editorconfig-checker * fix editorconfig-checker * Update docs * updated model link to be GGUF model links * Remove GGML_CPU_REPACK=OFF * Skip permuted ADD and MUL * Removed static variables from utils.cpp * Removed initializing non-existing variable * Remove unused structs * Fix test-backend-ops for OV GPU * unify api calling * Update utils.cpp * When the dim is dynamic, throw an error, need to is stastic forst * Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using * No need to return * Fix test-backend-ops for OV GPU LNL * Fix test-thread-safety * use the shape from infer request of output tensor create to avoid issue * fix dynamic output shape issue * fix issue for the unused node in tests * Remove unused lock * Add comment * Update openvino docs * update to OV release version 2026.0 * add ci ov-gpu self hosted runner * fix editorconfig * Fix perplexity * Rewrite the model inputs finding mechanism (#54) * Rewrite the model inputs finding logistic * Put stateful shape handle in get input shape * Put the iteration logistic in func * Added ggml-ci-intel-openvino-gpu and doc update * .hpp files converted to .h * fix ggml-ci-x64-intel-openvino-gpu * Fix for stateful execution bug in llama-bench * Minor updates after stateful llama-bench fix * Update ggml/src/ggml-openvino/utils.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * Remove multiple get_shape calls * Bring back mutex into compute * Fix VIEW op, which slice the input node * Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access * Temp. fix for test requant errors * Update to OV ggml-ci to low-perf * ci : temporary disable "test-llama-archs" * ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag * docs : update url * Fix OV link in docker and Update docs --------- Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com> Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com> Co-authored-by: Arshath <arshath.ramzan@intel.com> Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com> Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
295 lines
11 KiB
C++
295 lines
11 KiB
C++
#pragma once
|
|
|
|
#include "ggml-quants.h"
|
|
#include "ggml.h"
|
|
#include "openvino/decoder.h"
|
|
|
|
#include <cstdint>
|
|
#include <cstring>
|
|
#include <map>
|
|
#include <memory>
|
|
#include <openvino/core/partial_shape.hpp>
|
|
#include <optional>
|
|
#include <vector>
|
|
|
|
struct ModelParams {
|
|
int ctx = -1;
|
|
int ctx_swa = -1;
|
|
int ctx_per_seq = -1;
|
|
int ctx_per_seq_swa = -1;
|
|
int n_seq = 1;
|
|
int n_heads = -1;
|
|
int n_heads_kv = -1;
|
|
int head_size = -1;
|
|
int32_t rope_params[15];
|
|
std::vector<int> swa_layers;
|
|
|
|
std::vector<std::string> kv_names;
|
|
size_t kv_buffer_ctx_id = 0;
|
|
|
|
bool same_rope_params(const ModelParams & other) const {
|
|
return memcmp(rope_params, other.rope_params, sizeof(int32_t) * 15) == 0;
|
|
}
|
|
|
|
bool can_reuse_dynamically(const ModelParams & other) const { return same_rope_params(other); }
|
|
|
|
bool can_reuse_statically(const ModelParams & other) const { return same_rope_params(other) && ctx == other.ctx; }
|
|
|
|
bool kv_buffer_changed(const ModelParams & other) const { return kv_buffer_ctx_id != other.kv_buffer_ctx_id; }
|
|
};
|
|
|
|
struct ComputeParams {
|
|
int n_seq_active = 1;
|
|
int seq_active_start = 0;
|
|
int attention_size = -1;
|
|
int attention_size_swa = -1;
|
|
int input_len = -1;
|
|
int token_len_per_seq = -1;
|
|
int past_kv_len = -1;
|
|
int output_len = 1;
|
|
};
|
|
|
|
class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {
|
|
public:
|
|
struct NodeInfo {
|
|
ggml_tensor * node;
|
|
std::string node_name;
|
|
std::string node_op_type;
|
|
std::map<std::string, ggml_tensor *> node_inputs;
|
|
std::vector<std::string> node_inputs_names;
|
|
ggml_tensor * node_output;
|
|
std::string node_output_name;
|
|
int node_op_case = 0;
|
|
void * data_addr;
|
|
};
|
|
// Graph decoder
|
|
GgmlOvDecoder(ggml_cgraph * cgraph,
|
|
ModelParams & model_params,
|
|
ComputeParams & compute_params,
|
|
std::map<std::string, std::shared_ptr<ov::Node>> & model_weights,
|
|
bool is_static,
|
|
bool is_stateful = false,
|
|
bool is_prefill = false,
|
|
int prefill_chunk_size = 256);
|
|
|
|
// Naive graph decoder
|
|
GgmlOvDecoder(ggml_cgraph * cgraph, std::map<std::string, std::shared_ptr<ov::Node>> & model_weights);
|
|
|
|
virtual ov::Any get_attribute(const std::string & name) const override {
|
|
return nullptr;
|
|
GGML_UNUSED(name);
|
|
}
|
|
|
|
virtual ov::PartialShape get_input_shape(int node_idx, const std::string & name) const override;
|
|
|
|
virtual std::vector<size_t> get_input_stride(int node_idx, const std::string & name) const override;
|
|
|
|
virtual ov::element::Type get_input_type(int node_idx, const std::string & name) const override;
|
|
|
|
virtual size_t get_input_size() const override;
|
|
|
|
virtual size_t get_input_size(int node_idx) const override;
|
|
|
|
virtual void get_input_node(size_t input_port_idx,
|
|
std::string & producer_name,
|
|
std::string & producer_output_port_name,
|
|
size_t & producer_output_port_index) const override {
|
|
GGML_UNUSED(input_port_idx);
|
|
GGML_UNUSED(producer_name);
|
|
GGML_UNUSED(producer_output_port_name);
|
|
GGML_UNUSED(producer_output_port_index);
|
|
}
|
|
|
|
virtual std::vector<std::string> get_input_names(int node_idx) const override;
|
|
|
|
virtual ov::PartialShape get_output_shape(int node_idx) const override;
|
|
|
|
virtual ov::element::Type get_output_type(int node_idx) const override;
|
|
|
|
virtual int32_t * get_input_op_params(int node_idx, const std::string & name) const override;
|
|
|
|
virtual int32_t * get_output_op_params(int node_idx) const override;
|
|
|
|
virtual std::vector<std::string> get_output_names(int node_idx) const override;
|
|
|
|
virtual const std::string & get_op_type() const override;
|
|
|
|
virtual const std::string & get_op_type(int node_idx) const override;
|
|
|
|
virtual const std::string & get_op_name() const override;
|
|
|
|
virtual const std::string & get_op_name(int node_idx) const override;
|
|
|
|
virtual void visit_subgraph(std::function<void(std::shared_ptr<GgmlDecoder>, int node_idx)> node_visitor) const override;
|
|
|
|
ggml_tensor * get_input_ggml_tensor(const std::string & name) const { return m_inputs.at(name); }
|
|
|
|
virtual int get_op_case(int node_idx) const override { return m_node_info_list[node_idx].node_op_case; }
|
|
|
|
virtual const std::map<std::string, std::shared_ptr<ov::Node>> & get_model_inputs() const override {
|
|
return m_model_inputs;
|
|
}
|
|
|
|
virtual const std::map<std::string, std::shared_ptr<ov::Node>> & get_model_extra_inputs() const override {
|
|
return m_model_extra_inputs;
|
|
}
|
|
|
|
virtual const std::map<std::string, std::shared_ptr<ov::Tensor>> & get_model_extra_input_values() const {
|
|
return m_model_extra_input_values;
|
|
}
|
|
|
|
virtual const std::map<std::string, std::shared_ptr<ov::Node>> & get_model_weights() const override {
|
|
return m_model_weights;
|
|
}
|
|
|
|
virtual std::vector<std::string> get_model_output_names() const override {
|
|
return m_model_output_names;
|
|
}
|
|
|
|
const std::map<std::string, ggml_tensor *> & get_model_outputs() const { return m_model_outputs; }
|
|
|
|
virtual int get_ctx_size() const { return m_model_params.ctx; }
|
|
|
|
virtual int get_ctx_swa_size() const { return m_model_params.ctx_swa; }
|
|
|
|
virtual int get_ctx_per_seq() const { return m_model_params.ctx_per_seq; }
|
|
|
|
virtual int get_ctx_per_seq_swa() const { return m_model_params.ctx_per_seq_swa; }
|
|
|
|
virtual int get_n_seq() const { return m_model_params.n_seq; }
|
|
|
|
virtual int is_swa_layer(int layer) const override {
|
|
return std::find(m_model_params.swa_layers.begin(), m_model_params.swa_layers.end(), layer) !=
|
|
m_model_params.swa_layers.end();
|
|
}
|
|
|
|
int get_past_kv_len() const { return m_compute_params.past_kv_len; }
|
|
|
|
int get_input_len() const { return m_compute_params.input_len; }
|
|
|
|
virtual int32_t * get_rope_params() const override { return const_cast<int32_t *>(m_model_params.rope_params); }
|
|
|
|
virtual std::map<std::string, std::string> get_kv_param_res_names() const override;
|
|
|
|
virtual bool is_static() const override { return m_is_static; }
|
|
|
|
virtual bool is_stateful() const override { return m_is_stateful; }
|
|
|
|
ov::PartialShape get_graph_input_shape(const ggml_tensor * op, const ggml_tensor * input) const;
|
|
|
|
static void dump_cgraph(const ggml_cgraph * cgraph, std::string & filename);
|
|
|
|
static std::shared_ptr<ov::Node> create_weight_node(ggml_tensor * tensor, bool naive = false);
|
|
|
|
static std::map<std::string, std::shared_ptr<ov::Node>> create_weight_nodes(ggml_cgraph * cgraph,
|
|
bool naive = false);
|
|
|
|
const ggml_tensor * get_tensor_used_op(const ggml_tensor * tensor) const;
|
|
|
|
const ggml_tensor * get_tensor_from_name(const std::string & name) const;
|
|
|
|
void clear_model_weights() { m_model_weights.clear(); }
|
|
|
|
static std::pair<ModelParams, ComputeParams> compute_llm_params(ggml_cgraph * cgraph, bool is_static);
|
|
|
|
ModelParams get_model_params() const { return m_model_params; }
|
|
|
|
ComputeParams get_compute_params() const { return m_compute_params; }
|
|
|
|
void set_model_params(const ModelParams & model_params) { m_model_params = model_params; }
|
|
|
|
void set_compute_params(const ComputeParams & compute_params) { m_compute_params = compute_params; }
|
|
|
|
bool m_is_static = false;
|
|
bool m_is_stateful = false;
|
|
bool m_is_prefill = false;
|
|
bool m_naive = false;
|
|
int m_prefill_chunk_size = 0;
|
|
|
|
static ov::Shape get_shape(const ggml_tensor * tensor);
|
|
static std::vector<size_t> get_stride(const ggml_tensor * tensor);
|
|
static ov::element::Type get_ov_type(const ggml_tensor * tensor);
|
|
static std::string compute_op_type(const ggml_tensor * node);
|
|
void add_extra_inputs();
|
|
|
|
void update_io(ggml_cgraph * cgraph);
|
|
|
|
inline static bool is_inp_tok(const ggml_tensor * tensor, const ggml_tensor * op) {
|
|
return op->op == GGML_OP_GET_ROWS && tensor == op->src[1] && op->src[0]->op == GGML_OP_NONE;
|
|
}
|
|
|
|
inline static bool is_inp_pos(const ggml_tensor * tensor, const ggml_tensor * op) {
|
|
return op->op == GGML_OP_ROPE && tensor == op->src[1];
|
|
}
|
|
|
|
inline static bool is_inp_emb(const ggml_tensor * tensor, const ggml_tensor * op) {
|
|
return tensor->op == GGML_OP_GET_ROWS && op->op == GGML_OP_RMS_NORM;
|
|
}
|
|
|
|
inline static bool is_inp_mask(const ggml_tensor * tensor, const ggml_tensor * op) {
|
|
return op->op == GGML_OP_CPY || (op->op == GGML_OP_FLASH_ATTN_EXT && tensor == op->src[3]);
|
|
}
|
|
|
|
inline static bool is_rope_freqs_weight(const ggml_tensor * tensor, const ggml_tensor * op) {
|
|
return op->op == GGML_OP_ROPE && tensor == op->src[2];
|
|
}
|
|
|
|
inline static bool is_kvcache(const ggml_tensor * tensor, const ggml_tensor * op) {
|
|
return op->op == GGML_OP_SET_ROWS && op->src[2] == tensor;
|
|
}
|
|
|
|
inline static bool is_kv_idx(const ggml_tensor * tensor, const ggml_tensor * op) {
|
|
return op->op == GGML_OP_SET_ROWS && op->src[1] == tensor;
|
|
}
|
|
|
|
inline static bool is_output_idx(const ggml_tensor * tensor, const ggml_tensor * op) {
|
|
return op->op == GGML_OP_GET_ROWS && tensor == op->src[1] && op->src[0]->op != GGML_OP_NONE;
|
|
}
|
|
|
|
static std::string get_graph_input_ov_name(const ggml_tensor * tensor, const ggml_tensor * op) {
|
|
if (is_inp_tok(tensor, op)) {
|
|
return "inp_tokens";
|
|
}
|
|
if (is_inp_pos(tensor, op)) {
|
|
return "inp_pos";
|
|
}
|
|
if (is_inp_emb(tensor, op)) {
|
|
return "embd";
|
|
}
|
|
if (is_output_idx(tensor, op)) {
|
|
return "inp_out_ids";
|
|
}
|
|
if (is_inp_mask(tensor, op)) {
|
|
return std::string(tensor->name).find("swa") == std::string::npos ? "self_kq_mask" : "self_kq_mask_swa";
|
|
}
|
|
return tensor->name;
|
|
}
|
|
|
|
private:
|
|
void set_input_output();
|
|
int compute_op_case(const ggml_tensor * node) const;
|
|
bool node_is_used_as_src(const int node_idx);
|
|
void compute_model_inputs();
|
|
void compute_model_outputs();
|
|
|
|
void validate_cgraph() const;
|
|
|
|
ggml_cgraph * m_cgraph = nullptr;
|
|
std::map<std::string, ggml_tensor *> m_inputs;
|
|
|
|
std::map<std::string, std::shared_ptr<ov::Node>> m_model_inputs;
|
|
std::map<std::string, std::shared_ptr<ov::Node>> m_model_extra_inputs;
|
|
std::map<std::string, std::shared_ptr<ov::Tensor>> m_model_extra_input_values;
|
|
std::map<std::string, std::shared_ptr<ov::Node>> m_model_weights;
|
|
std::map<std::string, ggml_tensor *> m_model_outputs;
|
|
std::vector<std::string> m_model_output_names;
|
|
std::vector<NodeInfo> m_node_info_list;
|
|
|
|
ModelParams m_model_params;
|
|
ComputeParams m_compute_params;
|
|
};
|
|
|
|
void print_tensor_address_map(const ggml_cgraph * cgraph);
|
|
|
|
int extract_layer_from_name(const std::string & name);
|