← all repositories
openvinotoolkit/model_server

Intel's inference server that speaks OpenAI, KServe, and TensorFlow

A C++ model server built for Intel hardware that exposes multiple API dialects so clients don't need to know what backend runs their inference.

886 stars C++ Inference · Serving
model_server
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

OpenVINO Model Server (OVMS) hosts ML models and serves inference over gRPC or REST. Clients send requests; the server runs inference via OpenVINO and returns results. It targets Docker, bare metal, and Kubernetes deployments with horizontal and vertical scaling.

The interesting bit

The generative API compatibility is the real pivot: it exposes OpenAI-style endpoints for LLMs, embeddings, reranking, image generation, and now speech tasks, plus KServe and TensorFlow Serving protocols. That’s a lot of API surface for a C++ server optimized around Intel’s inference stack. The DAG scheduler and MediaPipe graph support also let you chain preprocessing, custom nodes, and model execution without leaving the server.

Key highlights

  • OpenAI-compatible endpoints for text generation, embeddings, image generation, and speech (new)
  • KServe and TensorFlow Serving APIs for traditional model serving
  • C++ implementation, Intel-architecture optimized
  • Model versioning and runtime config updates without restarts
  • Prometheus-compatible metrics
  • Supports TensorFlow, PaddlePaddle, ONNX, and AI accelerators
  • Windows, Ubuntu, and RedHat tested; Docker images on Docker Hub and RedHat catalog

Caveats

  • The README notes testing on specific platforms but doesn’t detail performance characteristics versus other servers
  • “Efficient resource utilization” is claimed but no specific benchmarks or comparisons are included in the README itself

Verdict

Worth evaluating if you’re already in the Intel/OpenVINO ecosystem or need a single server that exposes OpenAI-style APIs without the usual Python serving stack. Less compelling if you’re committed to GPU-centric inference or non-Intel hardware.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.