Skip to Content

Hallucination Benchmark Tool: Measure Fabrication Risk on Your Stack

NEO built a benchmark that measures factual confabulation, confident wrongness, and resistance to adversarial prompts. The goal is to compare models and prompts on trust and calibration, not only fluency or generic accuracy.


Problem Statement

We asked NEO to automate tests where answers must align with verifiable facts or supplied context. Teams need hallucination rates by domain, confidence calibration (whether stated certainty matches correctness), and scores on adversarial prompts that tempt models to invent details. A single accuracy number misses the failure modes that hurt production most.


Solution Overview

  1. Grounded and open-domain probes: Question sets with verified answers across domains such as general knowledge, science, history, health, law, math, and recent events.
  2. Three failure modes: Factual confabulation, confident wrongness, and adversarial hallucination under misleading prompts.
  3. Calibration: Expected calibration error (ECE), reliability-style views, and overconfidence on wrong answers.
  4. Adversarial suites: False premises, citation traps, recency probes, and leading questions, each with its own resistance score.
  5. Outputs: JSON and HTML reports; optional CI thresholds with non-zero exit codes on regression.

Hallucination benchmark pipeline

What “Hallucination” Covers Here

Factual confabulation is plausible but false content about verifiable facts: wrong dates, invented papers, bad statistics stated as fact. Confident wrongness is worse for users: the model is incorrect yet signals high confidence, which increases misplaced trust. Adversarial hallucination appears when prompts embed false premises, ask for citations in obscure areas, or presuppose nonexistent reports. The benchmark scores each category separately so you can target mitigations (retrieval, disclaimers, input filtering) to the actual failure shape.


Workflow / Pipeline

StepDescription
1. ConfigureChoose model endpoint, domain sets, and adversarial intensity
2. RunBatch prompts through the model; collect answers and confidence signals
3. VerifyCompare answers to ground truth; extract confidence via logprobs or verbal scales
4. CalibrateCompute ECE, overconfidence rate, and per-domain hallucination rates
5. ReportHTML for humans, JSON for automation; optional dynamic mode with multi-model consensus for ground truth

Interpreting Results for Deployment

High hallucination rate in specific domains suggests narrowing scope, adding retrieval for those topics, or switching models. Poor calibration points to surfacing uncertainty, response filtering, or human review on low-confidence paths. Low adversarial resistance suggests stronger input validation or output checks on high-stakes queries. The report is structured so each metric maps to a practical mitigation, not only a leaderboard position.


Example CLI

After cloning the repository and installing dependencies, you can run the CLI with a topic, optional dynamic ground-truth generation, model id, and entry count. Results include hallucination rate and related metrics per entry, with JSON suitable for dashboards and CI gates.


Repository & Artifacts

dakshjain-1616/Hallucination-Benchmark-ToolView on GitHub

Generated Artifacts:


References

View source on GitHub


Learn More