Training transformers on Apple's secret silicon
A weekend research hack that reverse-engineers private APIs to run backpropagation on the Neural Engine Apple reserves for inference only.

What it does
This project trains small transformers — Stories110M and Qwen3-0.6B — directly on Apple’s Neural Engine (ANE), the inference-only accelerator locked behind CoreML. It bypasses Apple’s restrictions by reverse-engineering undocumented _ANEClient and _ANECompiler private APIs, then feeding them hand-rolled MIL (Model Intermediate Language) programs that include forward and backward passes. No Metal, no GPU, no CoreML training APIs.
The interesting bit
The author treats the ANE less like a black-box coprocessor and more like a programmable DSP that Apple simply won’t document. The real craft is in the workaround architecture: weights are packed into spatial dimensions so they can change without recompilation, forward “taps” expose hidden states via concat outputs to avoid CPU recompute, and the whole thing restarts via exec() every ~119 compiles to dodge a resource leak in Apple’s compiler. It’s less a framework than a detailed field report on what this silicon can actually do if you speak to it directly.
Key highlights
- 91 ms/step for Stories110M (109M params), 412 ms/step for Qwen3-0.6B (596M params) on M4
- INT8 W8A8 quantization hits 35.1 TOPS, 1.88× over FP16, by halving L2 SRAM bandwidth between tiles
- GPU↔ANE zero-copy pipeline via shared IOSurface: GPU prefill (6.7ms) → ANE decode (1.9ms)
- GCD-async cblas overlap for dW gradients, deferred waits pushed into the next forward pass
- No dependencies beyond system frameworks; private APIs resolved at runtime via
objc_msgSend
Caveats
- ANE utilization is only ~5-9% of peak; many element-wise ops still fall back to CPU
- Causal SDPA masking is split across ANE and CPU because the hardware ignores
attn_mask - FP16 backward matmuls underflow without global loss scaling (
256 * NLAYERS) - The author explicitly calls this a research project, not a maintained framework, and warns it does not replace GPU training for anything beyond small research models
Verdict
Grab this if you’re researching edge AI compilers, probing Apple Silicon’s undocumented corners, or need a reference for direct ANE access outside CoreML. Skip it if you want a production training stack — the author will tell you the same.