← all repositories
chatopera/insuranceqa-corpus-zh

A Chinese insurance Q&A dataset that makes you buy a license

Real-world insurance questions and expert answers, translated and packaged for machine learning—but the data itself sits behind a store checkout.

insuranceqa-corpus-zh
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

This is a Chinese question-answering corpus for the insurance domain, translated from the original English insuranceQA dataset. It pairs roughly 12,889 training questions with 21,325 answers, plus validation and test splits, and includes 200 hard negative candidates per question for answer-selection tasks.

The interesting bit

The project ships two flavors: raw translated Q&A text, and a “pool” version that’s already tokenized, stop-word-stripped, and labeled—ready to feed into models without the usual NLP janitorial work. The negatives are retrieval-based, so they’re plausible distractors rather than random noise.

Key highlights

  • ~27K expert-curated answers across train/valid/test splits
  • Each question carries 1–5 positive answers and 200 retrieval-built negatives
  • Dual-format release: raw translated corpus or preprocessed ML-ready data
  • Python package (insuranceqa_data) handles loading via simple API calls
  • Bundled baseline models: deep QA, CNN/TensorFlow, n-grams, word2vec

Caveats

  • The actual corpus download requires purchasing a license from the Chatopera store; the PyPI package is just a downloader stub
  • Data is research-use-only with attribution requirements (Chunsong License + original paper citation)
  • README still lists Python 2.x as supported, which may signal stale maintenance

Verdict

Worth a look if you’re building Chinese insurance chatbots or benchmarking answer-selection models in a narrow domain. Skip it if you need open, frictionless data or a modern, actively maintained pipeline.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.