Copy-paste detective that speaks fluent LLM
A 5.7k-star duplication detector rebuilt itself for the agentic era: token-efficient reporters, MCP server, and skills your AI assistant can actually use.
What it does jscpd hunts down duplicated code across 223 programming languages and document formats using the Rabin-Karp algorithm. Run it as a CLI tool, embed it via TypeScript API, or spin up a local server to check snippets over HTTP.
The interesting bit
The project didn’t just slap an “AI-ready” sticker on the box. It built three distinct integration paths: an ai reporter that compresses output to ~1,100 tokens (79% fewer than the default console reporter), installable agent skills that teach assistants how to invoke jscpd and refactor what it finds, and a full MCP server so Claude Desktop et al. can call check_duplication as a native tool. The v4.2.x release also replaced prismjs with a custom reprism-based tokenizer, yielding an 11.5% speedup and enabling cross-format detection — a <script> block in a .vue file can now match a plain .ts file.
Key highlights
- Supports 223 formats, up from 152 in recent releases, including shebang detection for extensionless scripts
- Monorepo architecture: core algorithm, finder, tokenizer, and reporters are separately installable packages
- Multiple output formats: console, HTML, badge, SARIF (GitHub Code Scanning compatible), and the token-efficient
aireporter - LevelDB-backed store option for large repositories, plus a persistent memory store for incremental scans
- Used by GitHub Super Linter, Mega-Linter, Codacy, and Code-Inspector
Caveats
- The MCP server and AI skills are relatively new; the README notes them but doesn’t show real-world agent integration examples beyond config snippets
- LevelDB store is explicitly marked “slower than default store” — the trade-off for handling bigger repos is performance
- Recent bug fixes reveal the codebase has had edge-case issues: entire-file duplicates were silently dropped until #728, and a ReDoS vulnerability in Lisp tokenization required a regex rewrite
Verdict Worth a look if you’re running a polyglot codebase or wiring duplication checks into an AI-assisted workflow. Skip if you only need basic clone detection in a single language — simpler tools will do without the cognitive overhead of MCP configs and skill installation.