A TPU you can tape to a Raspberry Pi
A resource-sipping matrix-math coprocessor in VHDL that brings Google's TPU architecture down to FPGA scale.

What it does
tinyTPU is a VHDL implementation of a Google-style Tensor Processing Unit shrunk to fit commodity FPGAs. It attaches to a host processor over AXI, accepts 10-byte instructions, and crunches fixed-point matrix multiplies through a configurable systolic array. A complete MNIST inference pipeline—weight buffers, unified memory, diagonalization registers, accumulators, and fused activations (ReLU, sigmoid)—is included.
The interesting bit
The whole point is configurability: the matrix multiply unit, buffers, and datapath widths are independently tunable so the same design spans Zynq IoT boards and (theoretically) larger fabric. The author validated this by benchmarking MNIST across 6×6 to 14×14 systolic grids at 177.77 MHz, then clocking inference latency against an Intel i5 and a Raspberry Pi 3.
Key highlights
- Fixed-point only: weights and activations must sit in -1..127/128 or 0..255/256; no floating point
- Systolic 2-D multiply-add grid with diagonalized input staging and accumulator bypass/merge
- 10-byte instruction set (documented in
doc/TPU_ISA.md) fed over AXI into a small FIFO - Evaluated on Xilinx Zynq-7020; getting-started guide targets Vivado and Zynq SoCs
- Bachelor thesis from HAW Hamburg (German);
getting_started.pdfincluded
Caveats
- “Theorethical” [sic] 72.18 GOPS is the peak; real throughput depends heavily on matrix dimensions and instruction count
- No mention of toolchain versions, resource utilization numbers, or power draw in the README
- Only two activations implemented; no batch normalization or other modern layers mentioned
Verdict
Grab this if you’re building embedded vision or sensor-fusion on Zynq and can live with quantized models. Skip it if you need floating-point, TensorFlow Lite compatibility out of the box, or a maintained software stack—the project appears to be thesis-complete.