← all repositories
approximatelabs/sketch

Pandas copilot that actually reads your data first

Sketch feeds column summaries into LLMs so its code suggestions know what they're working with.

2.3k stars Python Coding AssistantsLLMOps · Eval
sketch
Velocity · 7d
+1.6
★ / day
Trend
steady
star history

What it does

Sketch is a Python library that bolts a .sketch accessor onto any pandas DataFrame. You can ask natural-language questions about your data (df.sketch.ask), request generated code snippets (df.sketch.howto), or even run LLM-powered transformations row-by-row (df.sketch.apply). No IDE plugin required—just import sketch and go.

The interesting bit

The hook is in the name: “sketch” refers to data sketches, the approximation algorithms that summarize your columns cheaply. Rather than dumping the whole DataFrame into the prompt (expensive, slow, privacy nightmare), Sketch compresses the schema and statistics into context the LLM can actually use. It’s a pragmatic compression layer between your data and a language model that otherwise works blind.

Key highlights

  • Three modes: ask for exploration, howto for code generation, apply for data generation/transforms
  • Runs against a hosted endpoint by default (prompts.approx.dev) for zero-config startup
  • Can switch to local Hugging Face models (MPT-7B, StarCoder) or your own OpenAI key via environment variables
  • Built on the team’s own lambdaprompt library for templated LLM calls
  • Explicitly targets the “glue work” of data cleaning, feature extraction, and compliance masking

Caveats

  • The apply mode requires an OpenAI API key; the free hosted endpoint won’t cover everything
  • Local model setup involves three environment variables and downloading weights—“usable in seconds” really means “usable in seconds if you use their cloud endpoint”
  • The README’s future hope of “custom made data + language foundation models” is just that: future hope

Verdict

Worth a spin if you live in pandas and want quick, context-aware code stubs without leaving your notebook. Skip it if you need deterministic, auditable data pipelines—this is exploratory acceleration, not production infrastructure.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.