Is intel-extension-for-transformers open source?

Yes — intel/intel-extension-for-transformers is open source, released under the Apache-2.0 license.

What language is intel-extension-for-transformers written in?

intel/intel-extension-for-transformers is primarily written in Python.

How popular is intel-extension-for-transformers?

intel/intel-extension-for-transformers has 2.2k stars on GitHub.

Where can I find intel-extension-for-transformers?

intel/intel-extension-for-transformers is on GitHub at https://github.com/intel/intel-extension-for-transformers.

← all repositories

intel/intel-extension-for-transformers

Intel's toolkit for squeezing LLMs onto its own silicon

It wraps Hugging Face transformers with Intel-specific compression, a chatbot framework, and C++ inference kernels tuned for Xeon, Arc, and Gaudi2.

★2.2k stars Python Inference · Serving LLMOps · Eval Language Models RAG · Search Chat Assistants

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does Intel Extension for Transformers is a toolkit that layers Intel-specific optimizations onto standard Hugging Face transformers workflows. It bundles model compression via Intel Neural Compressor, a chatbot framework called NeuralChat with plugins for retrieval and speech, and a separate C++ inference engine called Neural Speed that uses weight-only quantization for CPU and GPU. The goal is to keep you inside the familiar transformers API while squeezing performance out of Intel hardware from Sapphire Rapids Xeons to Arc GPUs.

The interesting bit The project does not just patch PyTorch; it ships a compression-aware runtime backed by NeurIPS research and offers OpenAI-compatible REST APIs through NeuralChat, turning a local Intel box into a drop-in chat backend. It also pushes QLoRA fine-tuning down to client CPUs, which is unusual for a vendor toolkit usually obsessed with data-center inference.

Key highlights

Extends Hugging Face transformers APIs rather than replacing them, leveraging Intel Neural Compressor for quantization.
NeuralChat framework supports plugins like Knowledge Retrieval, Speech Interaction, Query Caching, and Security Guardrails.
LLM inference kernels written in pure C/C++ (Neural Speed) with weight-only quantization, targeting AMX, VNNI, AVX512F, and AVX2 instruction sets.
Validated on Intel Gaudi2, Xeon Scalable (4th–6th Gen), Xeon CPU Max, Data Center GPU Max, Arc A-Series, and Core processors, with INT4/FP4/NF4 and INT8/FP8 support varying by chip.
Includes optimized examples for Stable Diffusion, GPT-J, BLOOM-176B, Llama 3, Qwen2, and others.

Caveats

4-bit inference is not available on Gaudi2, and several GPU fine-tuning paths remain marked “WIP” in the hardware matrix.
The validated OS list stops at Ubuntu 22.04 and CentOS 8, which feels dated if you are on a newer distro.
Dependency versions are tightly pinned; stray from the exact PyTorch, driver, and transformers combinations and you are on your own.

Verdict Worth a look if you are already committed to Intel silicon and want a vendor-supported path to quantized inference and chatbot serving without leaving the Hugging Face ecosystem. If your hardware is AMD, ARM, or NVIDIA, this is essentially a very detailed compatibility wall.

Frequently asked

What is intel/intel-extension-for-transformers?: It wraps Hugging Face transformers with Intel-specific compression, a chatbot framework, and C++ inference kernels tuned for Xeon, Arc, and Gaudi2.
Is intel-extension-for-transformers open source?: Yes — intel/intel-extension-for-transformers is open source, released under the Apache-2.0 license.
What language is intel-extension-for-transformers written in?: intel/intel-extension-for-transformers is primarily written in Python.
How popular is intel-extension-for-transformers?: intel/intel-extension-for-transformers has 2.2k stars on GitHub.
Where can I find intel-extension-for-transformers?: intel/intel-extension-for-transformers is on GitHub at https://github.com/intel/intel-extension-for-transformers.