Neural nets on a diet: 8-bit math for your phone
A C library that shoves quantized neural network operators into mobile CPUs without the bloat of full-precision inference.

What it does
QNNPACK is a low-level C library that implements common neural network operators—convolution, pooling, fully-connected layers, activations—using 8-bit quantized tensors. It’s designed as plumbing for deep learning frameworks, not for direct use by researchers. PyTorch 1.0’s Caffe2 backend already drinks from this well.
The interesting bit
The build system treats mobile as a first-class citizen, not an afterthought. There are dedicated cross-compilation scripts for six iOS architectures (down to iPhone 3GS-era armv7) and three Android ABIs, with explicit NEON detection on armeabi-v7a to avoid crashes on older hardware. That’s the kind of defensive engineering you need when you’re shipping to a billion pocket computers.
Key highlights
- 8-bit quantized implementations of core operators: 2D conv/deconv, channel shuffle, fully connected, max/average/global pooling, sigmoid, Leaky ReLU, clamp, SoftMax
- CMake-based builds with ready-made scripts for native, Android (armeabi-v7a with NEON requirement, arm64-v8a, x86), and iOS (armv7 through arm64e, plus simulators)
- End-to-end benchmarking via Caffe2/PyTorch 1.0 using a pre-trained quantized MobileNet v2
- Integration with Facebook’s FAI-PEP benchmarking platform for cross-backend comparison
- BSD licensed
Caveats
- Two operators are still on the wishlist: Locally Connected and Group Normalization
- The README warns explicitly against setting
-DANDROID_ARM_NEON=1during compilation—doing so can crash on non-NEON hardware rather than failing gracefully at initialization - iOS builds require pulling in a separate
ios-cmaketoolchain repository
Verdict
Worth a look if you’re building mobile inference pipelines in PyTorch/Caffe2 and need to squeeze performance from quantized models. Skip it if you’re on a framework that doesn’t integrate with QNNPACK or if you need the missing operators.