adbar/trafilatura
Python library for crawling web pages and extracting clean text, metadata, and structured data from raw HTML.

Velocity · 7d
+2.3
★ / day
Trend
→steady
star history
Trafilatura is a Python package and CLI tool that crawls, downloads, and extracts main text, metadata, and comments from web pages. It processes raw HTML into structured formats like JSON, CSV, XML, Markdown, or plain text, applying readability heuristics to filter boilerplate and focus on article content. It is commonly used to build corpora and feed web-sourced data into LLM pipelines, RAG systems, and NLP workflows.