reworkd/tarsier
Vision utility library that tags interactable webpage elements with IDs for LLMs to issue commands like CLICK [23].

Tarsier provides visual perception capabilities for web interaction agents by tagging buttons, links, and input fields with brackets and IDs, creating a mapping between DOM elements and identifiers that LLMs can reference. It integrates with Playwright and Selenium to capture webpage state, supports OCR for visual structure understanding, and works alongside GPT-4V for multimodal page understanding. The library is published as a PyPI package and is used by Reworkd to power autonomous web agents across thousands of real tasks.