Skip to Content

Prompt A/B Testing Tool

Scientific A/B testing for AI prompts — compare prompt variants with statistical rigor, LLM-as-judge scoring, and beautiful self-contained HTML reports


Problem Statement

We Asked NEO to: Build a scientific A/B testing framework for AI prompts with:


Solution Overview

NEO built a comprehensive prompt evaluation framework that brings statistical rigor to prompt engineering:

  1. Multi-Provider Test Engine runs both variants concurrently against Anthropic, OpenAI, and OpenRouter across built-in or custom datasets
  2. LLM-as-Judge Evaluator uses Claude to score each response 1–10 with structured per-response reasoning
  3. Statistical Analysis Pipeline computes t-tests, p-values, Cohen’s d, and 95% confidence intervals to determine winners with quantified confidence
  4. Interactive Report Builder generates self-contained HTML reports with Chart.js visualizations, side-by-side comparisons, and cost savings projections

The system identifies winning prompts with up to 97%+ statistical confidence — turning gut-feel prompt decisions into data-driven engineering.


Workflow / Pipeline

StepDescription
1. Prompt & Dataset InputAccept two prompt variants (with {"{input}"} placeholder) and a dataset — built-in (customer_support, code_tasks, creative_prompts) or custom JSON
2. Provider SelectionRoute requests to Anthropic (Claude Sonnet 4), OpenAI (GPT-4o), or OpenRouter — configurable via CLI flag or interactive mode
3. Parallel Response GenerationBoth variants run against every dataset item, capturing response text, latency, token counts, and estimated cost per call
4. LLM-as-Judge ScoringClaude evaluates each response 1–10 with structured reasoning — consistent, bias-controlled evaluation across all test cases
5. Statistical AnalysisIndependent samples t-test, p-value against 0.05 threshold, Cohen’s d effect size, and 95% confidence intervals via stats_calculator.py
6. Winner DeclarationStatistical significance check declares the winner — confidence level, p-value, and quality improvement percentage surfaced in terminal and report
7. ROI AnalysisCost-per-response delta projected to scale (10K, 100K, 1M requests) — quantifies the business case for switching prompts
8. HTML Report GenerationSelf-contained report with Chart.js visualizations, expandable test case details, metrics comparison table, and PDF/Markdown export

Repository & Artifacts

Dakshjain1604/Prompt-AB-testing-ToolView on GitHub

Generated Artifacts:


Technical Details


Results

Example Terminal Output

📊 Test Results Summary
┌─────────────────────┬───────────┬───────────┐
│ Metric              │ Prompt A  │ Prompt B  │
├─────────────────────┼───────────┼───────────┤
│ Quality Score       │ 8.45      │ 7.82      │
│ Response Time (s)   │ 1.234     │ 1.456     │
│ Tokens/Response     │ 245       │ 312       │
│ Cost/Response ($)   │ 0.0042    │ 0.0051    │
└─────────────────────┴───────────┴───────────┘

🏆 Winner: Prompt A
Confidence: 97.45%
p-value: 0.0255
Effect Size (Cohen's d): 0.68 (medium)
Quality Improvement: 8.06%
Cost Savings at 1M requests: $900

Common Test Scenarios & Results

Tone Test (Professional vs Conversational):
  Winner: Professional  |  Confidence: 94.2%  |  Δ Quality: +6.3%

Length Test (Concise vs Detailed):
  Winner: Detailed      |  Confidence: 89.7%  |  Δ Quality: +4.1%

Structure Test (Freeform vs Formatted):
  Winner: Formatted     |  Confidence: 96.1%  |  Δ Quality: +9.4%

Empathy Test (Empathetic vs Direct):
  Winner: Empathetic    |  Confidence: 91.3%  |  Δ Quality: +5.7%

Best Practices & Lessons Learned


Next Steps


References

View source on GitHub


Learn More