← all repositories
maderix/ANE

Training transformers on Apple's secret silicon

A weekend research hack that reverse-engineers private APIs to run backpropagation on the Neural Engine Apple reserves for inference only.

6.7k stars Objective-C ML Frameworks
ANE
Velocity · 7d
+67
★ / day
Trend
steady
star history

What it does

This project trains small transformers — Stories110M and Qwen3-0.6B — directly on Apple’s Neural Engine (ANE), the inference-only accelerator locked behind CoreML. It bypasses Apple’s restrictions by reverse-engineering undocumented _ANEClient and _ANECompiler private APIs, then feeding them hand-rolled MIL (Model Intermediate Language) programs that include forward and backward passes. No Metal, no GPU, no CoreML training APIs.

The interesting bit

The author treats the ANE less like a black-box coprocessor and more like a programmable DSP that Apple simply won’t document. The real craft is in the workaround architecture: weights are packed into spatial dimensions so they can change without recompilation, forward “taps” expose hidden states via concat outputs to avoid CPU recompute, and the whole thing restarts via exec() every ~119 compiles to dodge a resource leak in Apple’s compiler. It’s less a framework than a detailed field report on what this silicon can actually do if you speak to it directly.

Key highlights

  • 91 ms/step for Stories110M (109M params), 412 ms/step for Qwen3-0.6B (596M params) on M4
  • INT8 W8A8 quantization hits 35.1 TOPS, 1.88× over FP16, by halving L2 SRAM bandwidth between tiles
  • GPU↔ANE zero-copy pipeline via shared IOSurface: GPU prefill (6.7ms) → ANE decode (1.9ms)
  • GCD-async cblas overlap for dW gradients, deferred waits pushed into the next forward pass
  • No dependencies beyond system frameworks; private APIs resolved at runtime via objc_msgSend

Caveats

  • ANE utilization is only ~5-9% of peak; many element-wise ops still fall back to CPU
  • Causal SDPA masking is split across ANE and CPU because the hardware ignores attn_mask
  • FP16 backward matmuls underflow without global loss scaling (256 * NLAYERS)
  • The author explicitly calls this a research project, not a maintained framework, and warns it does not replace GPU training for anything beyond small research models

Verdict

Grab this if you’re researching edge AI compilers, probing Apple Silicon’s undocumented corners, or need a reference for direct ANE access outside CoreML. Skip it if you want a production training stack — the author will tell you the same.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.