Your phone, but it listens to natural language
An open-source framework that lets a vision-language model see your screen and control your Android, HarmonyOS, or iOS device via ADB.

What it does
Open-AutoGLM is a phone agent framework from Zhipu AI. You type a command like “open Xiaohongshu and search for food” in Chinese or English; a 9B vision-language model looks at your screen through ADB (or HDC for HarmonyOS, WebDriverAgent for iOS), plans the next tap or swipe, and executes it. The model runs either via third-party APIs (BigModel, ModelScope) or self-hosted through vLLM/SGLang.
The interesting bit
The model doesn’t just OCR your screen—it reasons about UI layout visually. The README shows a chain-of-thought example where the agent compares shampoo prices across JD.com and Taobao, launching apps, searching, and deciding where to buy. There’s also a human-handoff mechanism for logins and CAPTCHAs, which is the pragmatic admission that full autonomy still breaks at the payment wall.
Key highlights
- Supports Android 7.0+, HarmonyOS NEXT, and iOS through WebDriverAgent
- Two model variants: Chinese-optimized
AutoGLM-Phone-9Band a multilingual version - Can self-host with vLLM or SGLang; exact launch parameters provided in docs
- Integrates with Midscene.js for cross-platform UI automation workflows
- Includes a deployment check script that validates model output quality (short or garbled chains mean your setup is wrong)
Caveats
- Requires developer mode, USB debugging, and for Android, a separate ADB Keyboard APK installed and enabled
- The README is primarily in Chinese; English documentation exists but is secondary
- Self-hosting demands careful dependency management (noted transformer conflicts, specific CUDA/cuDNN versions)
Verdict
Worth a look if you’re building mobile automation, accessibility tools, or testing vision-language agents in the wild. Skip it if you need something that works out of the box without sideloading keyboards and toggling developer settings.