Your UI tests just got eyes and a natural-language mouth
Midscene.js lets you automate web, mobile, and desktop interfaces by describing goals in plain English, then watches screenshots to decide where to click.

What it does Midscene.js is a UI automation framework that replaces brittle selectors with vision-language models. You write goals like “register a GitHub account and pass all validations” in JavaScript or YAML; the library takes screenshots, reasons about the interface, and executes clicks, scrolls, and text input. It wraps Puppeteer and Playwright for web, adb for Android, and WebDriverAgent for iOS, with a bridge mode for desktop browsers.
The interesting bit
The project is deliberately all-in on pure-vision for actions — no DOM parsing for clicking, no XPath, no element IDs. That means it works on <canvas>-heavy apps and mobile UIs where the DOM is opaque or nonexistent. DOM is still available as an opt-in for data extraction, but the core philosophy is: look at the screen, decide, act. The README claims this cuts token usage and speeds up runs, which is plausible since you’re not shipping HTML trees to the model.
Key highlights
- Supports Qwen3-VL, Doubao-1.6-vision, Gemini 3 Pro, and ByteDance’s own UI-TARS model for self-hosting
- Three API layers: interaction (
aiAction), extraction (aiQuery), and assertion (aiAssert,aiWaitFor) - MCP server exposes atomic actions so upstream agents can compose Midscene into larger workflows
- Built-in caching to replay scripts without re-querying the model
- Chrome Extension and device playgrounds for zero-code experimentation
Caveats
- The “pure-vision” approach means you’re paying for image tokens on every action; costs scale with screenshot frequency, not page complexity
- Mobile automation requires local device setup (adb, WebDriverAgent) — no cloud farm integration is mentioned
- Community has already spawned Python and Java SDKs, suggesting the core TypeScript API may not satisfy all shops
Verdict
Worth a look if your test suite keeps breaking because someone moved a div three pixels left. Less compelling if you already have stable selectors and want the cheapest possible CI run. The open-source model support is the real differentiator against closed alternatives.