← all repositories
budzianowski/multiwoz

The dataset that launched a thousand chatbots

MultiWOZ is the standard benchmark for task-oriented dialogue systems—10k conversations across hotels, restaurants, trains, and more, with annotated belief states that track what the user actually wants.

949 stars Python Data ToolingChat Assistants
multiwoz
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

MultiWOZ provides 10,000 human-human dialogues spanning multiple domains (hotel, restaurant, attraction, train, taxi, hospital, police). Each dialogue includes goals, user/system utterances, and belief states tracking slot values across turns. It’s designed to train and evaluate dialogue systems that must understand user intent, track state, and generate appropriate responses—typically split into train/test/dev sets with 1k examples each in validation and test.

The interesting bit

The dataset has been through three major corrections (2.0 → 2.1 by Amazon, 2.2 by Google), which tells you something about how messy real dialogue annotation is. The README openly notes that “the goal sometimes was wrongly followed by the turkers” and that some dialogues weren’t finished—rare honesty in benchmark documentation. The joint accuracy metric includes ALL slots, so there’s nowhere to hide partial understanding.

Key highlights

  • 3,406 single-domain + 7,032 multi-domain dialogues (up to 5 domains)
  • Belief state structure: semi (domain slots), book (booking slots), booked (confirmed booking)
  • Hospital and police domains excluded from validation/test sets for fair comparison
  • System utterances only have manual dialogue-act annotations; user acts added heuristically in 2.1 via ConvLab
  • Benchmark tables track DST progress from 15.57% joint accuracy (MDBT, 2018) to 63.79% (TOATOD, 2023) on 2.2
  • Zero-shot LLM results now included (GPT-3.5, Codex) for comparison against fine-tuned models

Caveats

  • No 1-to-1 mapping between dialogue acts and sentences
  • MUL/PMUL vs SNG/SSNG/WOZ filename conventions are easy to mix up
  • Some evaluation scripts (like SimpleTOD’s) inflate scores by conflating dontcare and none

Verdict

Essential if you’re building or benchmarking task-oriented dialogue systems; skip if you’re doing open-domain chitchat or don’t want to wrestle with six-year-old data collection artifacts. The corrected 2.2 version is what you actually want to use.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.