Prompt A/B Testing Tool
Scientific A/B testing for AI prompts — compare prompt variants with statistical rigor, LLM-as-judge scoring, and beautiful self-contained HTML reports
Problem Statement
We Asked NEO to: Build a scientific A/B testing framework for AI prompts with:
- Side-by-side comparison of two prompt variants across real datasets
- LLM-as-judge quality scoring on a 1–10 scale
- Rigorous statistical analysis — t-tests, p-values, Cohen’s d, and confidence intervals
- Per-request tracking of response time, token count, and cost
- Interactive HTML reports with ROI projections — all from a single CLI command
Solution Overview
NEO built a comprehensive prompt evaluation framework that brings statistical rigor to prompt engineering:
- Multi-Provider Test Engine runs both variants concurrently against Anthropic, OpenAI, and OpenRouter across built-in or custom datasets
- LLM-as-Judge Evaluator uses Claude to score each response 1–10 with structured per-response reasoning
- Statistical Analysis Pipeline computes t-tests, p-values, Cohen’s d, and 95% confidence intervals to determine winners with quantified confidence
- Interactive Report Builder generates self-contained HTML reports with Chart.js visualizations, side-by-side comparisons, and cost savings projections
The system identifies winning prompts with up to 97%+ statistical confidence — turning gut-feel prompt decisions into data-driven engineering.
Workflow / Pipeline
| Step | Description |
|---|---|
| 1. Prompt & Dataset Input | Accept two prompt variants (with {"{input}"} placeholder) and a dataset — built-in (customer_support, code_tasks, creative_prompts) or custom JSON |
| 2. Provider Selection | Route requests to Anthropic (Claude Sonnet 4), OpenAI (GPT-4o), or OpenRouter — configurable via CLI flag or interactive mode |
| 3. Parallel Response Generation | Both variants run against every dataset item, capturing response text, latency, token counts, and estimated cost per call |
| 4. LLM-as-Judge Scoring | Claude evaluates each response 1–10 with structured reasoning — consistent, bias-controlled evaluation across all test cases |
| 5. Statistical Analysis | Independent samples t-test, p-value against 0.05 threshold, Cohen’s d effect size, and 95% confidence intervals via stats_calculator.py |
| 6. Winner Declaration | Statistical significance check declares the winner — confidence level, p-value, and quality improvement percentage surfaced in terminal and report |
| 7. ROI Analysis | Cost-per-response delta projected to scale (10K, 100K, 1M requests) — quantifies the business case for switching prompts |
| 8. HTML Report Generation | Self-contained report with Chart.js visualizations, expandable test case details, metrics comparison table, and PDF/Markdown export |
Repository & Artifacts
Generated Artifacts:
- CLI test runner with interactive and argument modes (
neo_test.py) - Multi-provider API integration — Anthropic, OpenAI, OpenRouter (
evaluator.py) - LLM-as-judge scoring engine with 1–10 quality scale and reasoning
- Statistical analysis module — t-tests, p-values, Cohen’s d, confidence intervals (
stats_calculator.py) - Self-contained HTML report builder with Chart.js visualizations (
report_builder.py) - 3 built-in datasets: customer support (20 cases), code tasks (20 cases), creative prompts (20 cases)
- Custom dataset support via JSON file input
- ROI calculator projecting cost savings at 10K / 100K / 1M request scale
- Jinja2 HTML template with PDF and Markdown export buttons
- Shell script runner for batch test automation (
run_test.sh)
Technical Details
-
Test Engine:
- Both variants run against the full dataset for every test
{"{input}"}placeholder for variable injection into prompts- Per-request metrics: response time, token counts, cost estimate
- Supports inline prompts or file path references for long system prompts
-
LLM-as-Judge:
- Claude scores each response 1–10 with structured reasoning per case
- Same evaluation criteria applied uniformly across both variants
- Judge scores are the primary quality signal for statistical comparison
-
Statistical Analysis:
- Independent samples t-test via
scipy.stats.ttest_ind - Significance threshold: p-value < 0.05
- Cohen’s d effect size (small: 0.2, medium: 0.5, large: 0.8)
- 95% confidence intervals on mean quality scores
- Percentage improvement:
(mean_A - mean_B) / mean_B × 100
- Independent samples t-test via
-
Multi-Provider Support:
- Anthropic:
claude-sonnet-4-20250514(default) - OpenAI:
gpt-4o(default) - OpenRouter: unified API for Claude, GPT, Mistral, Llama, and more
- Provider and model overridable via
--providerand--modelflags
- Anthropic:
-
Reporting:
- Jinja2 + inline Chart.js — zero external dependencies at view time
- Quality score bar charts and response time comparisons
- Expandable accordion for per-test-case response details
- ROI projection table at 10K, 100K, and 1M request scale
Results
- Statistical Confidence: Up to 97.45% confidence in winner identification on 20-case datasets
- Quality Signal: LLM-as-judge surfaces 8.06% quality improvements that human review misses at scale
- Cost Visibility: Per-response cost delta projected to dollar-level savings at production scale
- Report Portability: Self-contained HTML — shareable over Slack, email, or committed to repos
- Provider Flexibility: Re-run the same test against Claude, GPT-4o, or OpenRouter without code changes
Example Terminal Output
📊 Test Results Summary
┌─────────────────────┬───────────┬───────────┐
│ Metric │ Prompt A │ Prompt B │
├─────────────────────┼───────────┼───────────┤
│ Quality Score │ 8.45 │ 7.82 │
│ Response Time (s) │ 1.234 │ 1.456 │
│ Tokens/Response │ 245 │ 312 │
│ Cost/Response ($) │ 0.0042 │ 0.0051 │
└─────────────────────┴───────────┴───────────┘
🏆 Winner: Prompt A
Confidence: 97.45%
p-value: 0.0255
Effect Size (Cohen's d): 0.68 (medium)
Quality Improvement: 8.06%
Cost Savings at 1M requests: $900Common Test Scenarios & Results
Tone Test (Professional vs Conversational):
Winner: Professional | Confidence: 94.2% | Δ Quality: +6.3%
Length Test (Concise vs Detailed):
Winner: Detailed | Confidence: 89.7% | Δ Quality: +4.1%
Structure Test (Freeform vs Formatted):
Winner: Formatted | Confidence: 96.1% | Δ Quality: +9.4%
Empathy Test (Empathetic vs Direct):
Winner: Empathetic | Confidence: 91.3% | Δ Quality: +5.7%Best Practices & Lessons Learned
- Test one variable at a time — changing tone, length, and structure together makes it impossible to know what drove the result
- LLM-as-judge beats human review at scale — Claude applies the same rubric to every response, eliminating rater fatigue and bias
- 20 test cases is the minimum — below this, p-values become unreliable and Cohen’s d loses meaning
- Always include
{"{input}"}— the tool rejects configs without it, enforcing variable injection discipline - Test across providers before committing — a winning prompt on Claude may underperform on GPT-4o
- Lead with ROI when talking to stakeholders — “8% quality improvement” is abstract; “$9,000 saved per million requests” gets budget approved
- Use custom datasets for production decisions — built-ins are great for exploration, not for shipping
- Check Cohen’s d alongside p-value — a significant result with d < 0.2 may not be worth a prompt migration
- Re-run after model updates — prompt rankings can shift when providers release new versions
Next Steps
- Add A/B/n testing — rank three or more prompts in a single run
- Implement Bayesian A/B testing as an alternative for early stopping
- Build GitHub Actions integration to gate prompt library PRs on test results
- Add multi-turn conversation testing — currently single-turn only
- Create a prompt diff viewer in the HTML report
- Add Weights & Biases integration for experiment history tracking
- Support vision and multimodal prompt inputs (image + text)
- Build a prompt leaderboard backed by a persistent SQLite database
- Add confidence interval visualization in terminal output via Rich