← all repositories

adbar/trafilatura

Python library for crawling web pages and extracting clean text, metadata, and structured data from raw HTML.

6.1k stars Python Data ToolingRAG · Search
trafilatura
Velocity · 7d
+2.3
★ / day
Trend
steady
star history

Trafilatura is a Python package and CLI tool that crawls, downloads, and extracts main text, metadata, and comments from web pages. It processes raw HTML into structured formats like JSON, CSV, XML, Markdown, or plain text, applying readability heuristics to filter boilerplate and focus on article content. It is commonly used to build corpora and feed web-sourced data into LLM pipelines, RAG systems, and NLP workflows.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.