The dataset that launched a thousand chatbots
MultiWOZ is the standard benchmark for task-oriented dialogue systems—10k conversations across hotels, restaurants, trains, and more, with annotated belief states that track what the user actually wants.

What it does
MultiWOZ provides 10,000 human-human dialogues spanning multiple domains (hotel, restaurant, attraction, train, taxi, hospital, police). Each dialogue includes goals, user/system utterances, and belief states tracking slot values across turns. It’s designed to train and evaluate dialogue systems that must understand user intent, track state, and generate appropriate responses—typically split into train/test/dev sets with 1k examples each in validation and test.
The interesting bit
The dataset has been through three major corrections (2.0 → 2.1 by Amazon, 2.2 by Google), which tells you something about how messy real dialogue annotation is. The README openly notes that “the goal sometimes was wrongly followed by the turkers” and that some dialogues weren’t finished—rare honesty in benchmark documentation. The joint accuracy metric includes ALL slots, so there’s nowhere to hide partial understanding.
Key highlights
- 3,406 single-domain + 7,032 multi-domain dialogues (up to 5 domains)
- Belief state structure:
semi(domain slots),book(booking slots),booked(confirmed booking) - Hospital and police domains excluded from validation/test sets for fair comparison
- System utterances only have manual dialogue-act annotations; user acts added heuristically in 2.1 via ConvLab
- Benchmark tables track DST progress from 15.57% joint accuracy (MDBT, 2018) to 63.79% (TOATOD, 2023) on 2.2
- Zero-shot LLM results now included (GPT-3.5, Codex) for comparison against fine-tuned models
Caveats
- No 1-to-1 mapping between dialogue acts and sentences
MUL/PMULvsSNG/SSNG/WOZfilename conventions are easy to mix up- Some evaluation scripts (like SimpleTOD’s) inflate scores by conflating
dontcareandnone
Verdict
Essential if you’re building or benchmarking task-oriented dialogue systems; skip if you’re doing open-domain chitchat or don’t want to wrestle with six-year-old data collection artifacts. The corrected 2.2 version is what you actually want to use.