← all repositories
DestinyLinker/MingLi-Bench

LLMs face their toughest test: passing a fortune-telling exam

A benchmark that grades AI models on Chinese astrology, because reasoning about career and marriage from birth charts is apparently a legitimate ML evaluation now.

1.6k stars Python LLMOps · EvalLanguage Models
MingLi-Bench
Velocity · 7d
+34
★ / day
Trend
steady
star history

What it does

MingLi-Bench runs multiple-choice tests from the annual Global Fortune Teller Competition (2022–2025) against LLMs via a tidy Python CLI. It covers Bazi (八字) and Ziwei Doushu (紫微斗数) across twelve life categories—career, health, marriage, wealth, and the rest of the human condition. Scoring is exact-match against ground truth, no partial credit for poetic ambiguity.

The interesting bit

The --astro flag is the clever isolation layer: it injects pre-computed astrological charts so you’re testing reasoning, not whether the model can correctly convert a lunar birth date into heavenly stems and earthly branches. The authors also recommend --cot so the model can talk itself through the chart before committing to an answer—essentially chain-of-thought for chi distribution.

Key highlights

  • 160 normalized questions from an actual professional competition, not synthetic fluff
  • Pre-computed charts via iztro separate chart derivation from interpretive reasoning
  • CLI auto-routes through OpenRouter or native providers (OpenAI, Anthropic, Google, DeepSeek, Doubao/Volcengine)
  • Filter by year, category, or sample size; shuffle options to catch position bias
  • Outputs per-question JSON, summary text, and raw response files for post-mortem debugging

Caveats

  • The README doesn’t publish any actual model scores or leaderboards, so you’ll be running your own comparisons blind
  • 160 questions is modest; year-filtering drops it further
  • No mention of how human fortune tellers score on the same set, so “benchmark” is a generous framing

Verdict

Grab this if you’re building Chinese-cultural LLM evals or just want to watch GPT-4o reason about someone’s 灾劫 (calamity) cycle. Skip it if you need established, peer-reviewed benchmarks with published baselines—this is more niche tooling than settled science.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.