Prompt A/B Testing Tool

Scientific A/B testing for AI prompts — compare prompt variants with statistical rigor, LLM-as-judge scoring, and beautiful self-contained HTML reports

Problem Statement

We Asked NEO to: Build a scientific A/B testing framework for AI prompts with:

Side-by-side comparison of two prompt variants across real datasets
LLM-as-judge quality scoring on a 1–10 scale
Rigorous statistical analysis — t-tests, p-values, Cohen’s d, and confidence intervals
Per-request tracking of response time, token count, and cost
Interactive HTML reports with ROI projections — all from a single CLI command

Solution Overview

NEO built a comprehensive prompt evaluation framework that brings statistical rigor to prompt engineering:

Multi-Provider Test Engine runs both variants concurrently against Anthropic, OpenAI, and OpenRouter across built-in or custom datasets
LLM-as-Judge Evaluator uses Claude to score each response 1–10 with structured per-response reasoning
Statistical Analysis Pipeline computes t-tests, p-values, Cohen’s d, and 95% confidence intervals to determine winners with quantified confidence
Interactive Report Builder generates self-contained HTML reports with Chart.js visualizations, side-by-side comparisons, and cost savings projections

The system identifies winning prompts with up to 97%+ statistical confidence — turning gut-feel prompt decisions into data-driven engineering.

Workflow / Pipeline

Step	Description
1. Prompt & Dataset Input	Accept two prompt variants (with `{"{input}"}` placeholder) and a dataset — built-in (customer_support, code_tasks, creative_prompts) or custom JSON
2. Provider Selection	Route requests to Anthropic (Claude Sonnet 4), OpenAI (GPT-4o), or OpenRouter — configurable via CLI flag or interactive mode
3. Parallel Response Generation	Both variants run against every dataset item, capturing response text, latency, token counts, and estimated cost per call
4. LLM-as-Judge Scoring	Claude evaluates each response 1–10 with structured reasoning — consistent, bias-controlled evaluation across all test cases
5. Statistical Analysis	Independent samples t-test, p-value against 0.05 threshold, Cohen’s d effect size, and 95% confidence intervals via `stats_calculator.py`
6. Winner Declaration	Statistical significance check declares the winner — confidence level, p-value, and quality improvement percentage surfaced in terminal and report
7. ROI Analysis	Cost-per-response delta projected to scale (10K, 100K, 1M requests) — quantifies the business case for switching prompts
8. HTML Report Generation	Self-contained report with Chart.js visualizations, expandable test case details, metrics comparison table, and PDF/Markdown export

Repository & Artifacts

Dakshjain1604/Prompt-AB-testing-ToolView on GitHub

Generated Artifacts:

CLI test runner with interactive and argument modes (neo_test.py)
Multi-provider API integration — Anthropic, OpenAI, OpenRouter (evaluator.py)
LLM-as-judge scoring engine with 1–10 quality scale and reasoning
Statistical analysis module — t-tests, p-values, Cohen’s d, confidence intervals (stats_calculator.py)
Self-contained HTML report builder with Chart.js visualizations (report_builder.py)
3 built-in datasets: customer support (20 cases), code tasks (20 cases), creative prompts (20 cases)
Custom dataset support via JSON file input
ROI calculator projecting cost savings at 10K / 100K / 1M request scale
Jinja2 HTML template with PDF and Markdown export buttons
Shell script runner for batch test automation (run_test.sh)

Technical Details

Test Engine:
- Both variants run against the full dataset for every test
- {"{input}"} placeholder for variable injection into prompts
- Per-request metrics: response time, token counts, cost estimate
- Supports inline prompts or file path references for long system prompts
LLM-as-Judge:
- Claude scores each response 1–10 with structured reasoning per case
- Same evaluation criteria applied uniformly across both variants
- Judge scores are the primary quality signal for statistical comparison
Statistical Analysis:
- Independent samples t-test via scipy.stats.ttest_ind
- Significance threshold: p-value < 0.05
- Cohen’s d effect size (small: 0.2, medium: 0.5, large: 0.8)
- 95% confidence intervals on mean quality scores
- Percentage improvement: (mean_A - mean_B) / mean_B × 100
Multi-Provider Support:
- Anthropic: claude-sonnet-4-20250514 (default)
- OpenAI: gpt-4o (default)
- OpenRouter: unified API for Claude, GPT, Mistral, Llama, and more
- Provider and model overridable via --provider and --model flags
Reporting:
- Jinja2 + inline Chart.js — zero external dependencies at view time
- Quality score bar charts and response time comparisons
- Expandable accordion for per-test-case response details
- ROI projection table at 10K, 100K, and 1M request scale

Results

Statistical Confidence: Up to 97.45% confidence in winner identification on 20-case datasets
Quality Signal: LLM-as-judge surfaces 8.06% quality improvements that human review misses at scale
Cost Visibility: Per-response cost delta projected to dollar-level savings at production scale
Report Portability: Self-contained HTML — shareable over Slack, email, or committed to repos
Provider Flexibility: Re-run the same test against Claude, GPT-4o, or OpenRouter without code changes

Example Terminal Output

📊 Test Results Summary
┌─────────────────────┬───────────┬───────────┐
│ Metric              │ Prompt A  │ Prompt B  │
├─────────────────────┼───────────┼───────────┤
│ Quality Score       │ 8.45      │ 7.82      │
│ Response Time (s)   │ 1.234     │ 1.456     │
│ Tokens/Response     │ 245       │ 312       │
│ Cost/Response ($)   │ 0.0042    │ 0.0051    │
└─────────────────────┴───────────┴───────────┘

🏆 Winner: Prompt A
Confidence: 97.45%
p-value: 0.0255
Effect Size (Cohen's d): 0.68 (medium)
Quality Improvement: 8.06%
Cost Savings at 1M requests: $900

Common Test Scenarios & Results

Tone Test (Professional vs Conversational):
  Winner: Professional  |  Confidence: 94.2%  |  Δ Quality: +6.3%

Length Test (Concise vs Detailed):
  Winner: Detailed      |  Confidence: 89.7%  |  Δ Quality: +4.1%

Structure Test (Freeform vs Formatted):
  Winner: Formatted     |  Confidence: 96.1%  |  Δ Quality: +9.4%

Empathy Test (Empathetic vs Direct):
  Winner: Empathetic    |  Confidence: 91.3%  |  Δ Quality: +5.7%

Best Practices & Lessons Learned

Test one variable at a time — changing tone, length, and structure together makes it impossible to know what drove the result
LLM-as-judge beats human review at scale — Claude applies the same rubric to every response, eliminating rater fatigue and bias
20 test cases is the minimum — below this, p-values become unreliable and Cohen’s d loses meaning
Always include {"{input}"} — the tool rejects configs without it, enforcing variable injection discipline
Test across providers before committing — a winning prompt on Claude may underperform on GPT-4o
Lead with ROI when talking to stakeholders — “8% quality improvement” is abstract; “$9,000 saved per million requests” gets budget approved
Use custom datasets for production decisions — built-ins are great for exploration, not for shipping
Check Cohen’s d alongside p-value — a significant result with d < 0.2 may not be worth a prompt migration
Re-run after model updates — prompt rankings can shift when providers release new versions

Next Steps

Add A/B/n testing — rank three or more prompts in a single run
Implement Bayesian A/B testing as an alternative for early stopping
Build GitHub Actions integration to gate prompt library PRs on test results
Add multi-turn conversation testing — currently single-turn only
Create a prompt diff viewer in the HTML report
Add Weights & Biases integration for experiment history tracking
Support vision and multimodal prompt inputs (image + text)
Build a prompt leaderboard backed by a persistent SQLite database
Add confidence interval visualization in terminal output via Rich

References

View source on GitHub

SciPy Statistical Tests·Cohen’s d Effect Size·Anthropic Claude API·OpenAI API Reference·OpenRouter API·Chart.js Visualizations·LLM-as-Judge Paper

Learn More

VS Code Extension

Install Neo and work directly with local code and data.

Platform Features

Understand Neo’s capabilities across web and IDE environments.

FAQ

Review security, privacy, limits, and troubleshooting information.