← all repositories
zai-org/Open-AutoGLM

Your phone, but it listens to natural language

An open-source framework that lets a vision-language model see your screen and control your Android, HarmonyOS, or iOS device via ADB.

25.5k stars Python Agents
Open-AutoGLM
Velocity · 7d
+140
★ / day
Trend
steady
star history

What it does

Open-AutoGLM is a phone agent framework from Zhipu AI. You type a command like “open Xiaohongshu and search for food” in Chinese or English; a 9B vision-language model looks at your screen through ADB (or HDC for HarmonyOS, WebDriverAgent for iOS), plans the next tap or swipe, and executes it. The model runs either via third-party APIs (BigModel, ModelScope) or self-hosted through vLLM/SGLang.

The interesting bit

The model doesn’t just OCR your screen—it reasons about UI layout visually. The README shows a chain-of-thought example where the agent compares shampoo prices across JD.com and Taobao, launching apps, searching, and deciding where to buy. There’s also a human-handoff mechanism for logins and CAPTCHAs, which is the pragmatic admission that full autonomy still breaks at the payment wall.

Key highlights

  • Supports Android 7.0+, HarmonyOS NEXT, and iOS through WebDriverAgent
  • Two model variants: Chinese-optimized AutoGLM-Phone-9B and a multilingual version
  • Can self-host with vLLM or SGLang; exact launch parameters provided in docs
  • Integrates with Midscene.js for cross-platform UI automation workflows
  • Includes a deployment check script that validates model output quality (short or garbled chains mean your setup is wrong)

Caveats

  • Requires developer mode, USB debugging, and for Android, a separate ADB Keyboard APK installed and enabled
  • The README is primarily in Chinese; English documentation exists but is secondary
  • Self-hosting demands careful dependency management (noted transformer conflicts, specific CUDA/cuDNN versions)

Verdict

Worth a look if you’re building mobile automation, accessibility tools, or testing vision-language agents in the wild. Skip it if you need something that works out of the box without sideloading keyboards and toggling developer settings.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.