Skip to Content

Prompt A/B Testing Tool

Scientific A/B testing for AI prompts — compare prompt variants with statistical rigor, LLM-as-judge scoring, and beautiful self-contained HTML reports


Problem Statement

We Asked NEO to: Build a scientific A/B testing framework for AI prompts with:


Solution Overview

NEO built a comprehensive prompt evaluation framework that brings statistical rigor to prompt engineering:

  1. Multi-Provider Test Engine runs both variants concurrently against Anthropic, OpenAI, and OpenRouter across built-in or custom datasets
  2. LLM-as-Judge Evaluator uses Claude to score each response 1–10 with structured per-response reasoning
  3. Statistical Analysis Pipeline computes t-tests, p-values, Cohen’s d, and 95% confidence intervals to determine winners with quantified confidence
  4. Interactive Report Builder generates self-contained HTML reports with Chart.js visualizations, side-by-side comparisons, and cost savings projections

The system identifies winning prompts with up to 97%+ statistical confidence — turning gut-feel prompt decisions into data-driven engineering.


Workflow / Pipeline

StepDescription
1. Prompt & Dataset InputAccept two prompt variants (with {"{input}"} placeholder) and a dataset — built-in (customer_support, code_tasks, creative_prompts) or custom JSON
2. Provider SelectionRoute requests to Anthropic (Claude Sonnet 4), OpenAI (GPT-4o), or OpenRouter — configurable via CLI flag or interactive mode
3. Parallel Response GenerationBoth variants run against every dataset item, capturing response text, latency, token counts, and estimated cost per call
4. LLM-as-Judge ScoringClaude evaluates each response 1–10 with structured reasoning — consistent, bias-controlled evaluation across all test cases
5. Statistical AnalysisIndependent samples t-test, p-value against 0.05 threshold, Cohen’s d effect size, and 95% confidence intervals via stats_calculator.py
6. Winner DeclarationStatistical significance check declares the winner — confidence level, p-value, and quality improvement percentage surfaced in terminal and report
7. ROI AnalysisCost-per-response delta projected to scale (10K, 100K, 1M requests) — quantifies the business case for switching prompts
8. HTML Report GenerationSelf-contained report with Chart.js visualizations, expandable test case details, metrics comparison table, and PDF/Markdown export

Repository & Artifacts

Dakshjain1604/Prompt-AB-testing-ToolView on GitHub

Generated Artifacts:


Technical Details


Results

Example Terminal Output

📊 Test Results Summary ┌─────────────────────┬───────────┬───────────┐ │ Metric │ Prompt A │ Prompt B │ ├─────────────────────┼───────────┼───────────┤ │ Quality Score │ 8.45 │ 7.82 │ │ Response Time (s) │ 1.234 │ 1.456 │ │ Tokens/Response │ 245 │ 312 │ │ Cost/Response ($) │ 0.0042 │ 0.0051 │ └─────────────────────┴───────────┴───────────┘ 🏆 Winner: Prompt A Confidence: 97.45% p-value: 0.0255 Effect Size (Cohen's d): 0.68 (medium) Quality Improvement: 8.06% Cost Savings at 1M requests: $900

Common Test Scenarios & Results

Tone Test (Professional vs Conversational): Winner: Professional | Confidence: 94.2% | Δ Quality: +6.3% Length Test (Concise vs Detailed): Winner: Detailed | Confidence: 89.7% | Δ Quality: +4.1% Structure Test (Freeform vs Formatted): Winner: Formatted | Confidence: 96.1% | Δ Quality: +9.4% Empathy Test (Empathetic vs Direct): Winner: Empathetic | Confidence: 91.3% | Δ Quality: +5.7%

Best Practices & Lessons Learned


Next Steps


References

View source on GitHub


Learn More