LLM Consistency Monitor
Stress-test LLM reliability by generating semantically identical paraphrases, running them concurrently, and surfacing contradictions before they reach your users
Problem Statement
We Asked NEO to: Build a production-grade Python CLI tool that detects inconsistencies in LLM responses by generating semantically identical paraphrases and analyzing response patterns, implement concurrent testing across multiple LLM providers, use sentence-transformer embeddings with DBSCAN clustering to group and compare responses, and generate interactive HTML reports with consistency scores, heatmaps, and actionable prompt engineering recommendations.
Solution Overview
NEO built a comprehensive LLM consistency testing framework that quantifies prompt brittleness before it reaches production:
- Paraphrase Generation Engine uses Claude API to create 20 syntactically diverse variations across 4 style categories (formal, casual, short-form, statement-form)
- Concurrent Testing Layer queries any target LLM provider in parallel using asyncio, cutting wall-clock time from minutes to under 60 seconds
- Semantic Analysis Pipeline computes embeddings, runs DBSCAN clustering, extracts key facts per cluster, and detects cross-cluster contradictions
- Interactive Report Builder generates self-contained HTML reports with similarity heatmaps, cluster distribution charts, and ranked recommendations
The system achieves 87%+ consistency scores on well-engineered prompts and clearly flags poor prompts (below 60%) with specific, actionable fixes.
Workflow / Pipeline
| Step | Description |
|---|---|
| 1. Paraphrase Generation | Claude API generates 20 syntactically diverse variants — 5 formal, 5 casual, 5 short-form, 5 statement-form — representing how real users phrase the same question |
| 2. Concurrent LLM Testing | All 20 paraphrases hit the target model in parallel via asyncio, reducing test time from 3–5 minutes to under 60 seconds |
| 3. Semantic Embedding | sentence-transformers (all-MiniLM-L6-v2) encodes each response into a 384-dimensional vector capturing meaning, not just surface wording |
| 4. DBSCAN Clustering | Cosine similarity between embedding vectors grouped via DBSCAN — responses in the same cluster agree semantically, separate clusters signal divergence |
| 5. Fact Extraction | Claude API extracts key factual claims from each cluster’s representative response, turning fuzzy similarity into concrete, comparable statements |
| 6. Contradiction Detection | Cross-cluster fact comparison flags direct conflicts and computes how often the same question yields contradictory answers |
| 7. Consistency Scoring | A 0–100 score computed from cluster count, contradiction frequency, and semantic spread — a single number showing how reliable the prompt is |
| 8. HTML Report Generation | Fully self-contained interactive report saved to results/ — includes heatmap, pie chart, latency bars, response gallery, and specific recommendations |
Repository & Artifacts
Generated Artifacts:
- Paraphrase generation engine with 4-category variation strategy (Claude API)
- 5-mode stress testing suite — adversarial, Socratic, emotional, ambiguous, jargon-heavy
- Async concurrent LLM tester with multi-provider support (Claude, GPT-4, HuggingFace, Custom)
- Semantic embedding pipeline using sentence-transformers (all-MiniLM-L6-v2)
- DBSCAN clustering engine with configurable eps and min_samples
- Claude-powered fact extraction and cross-cluster contradiction detector
- 0–100 consistency scoring system with severity bands (Good / Medium / Poor)
- Interactive self-contained HTML report with heatmap, charts, and recommendations
- Batch testing mode for validating multiple prompts in one run
- 10 production-ready benchmark questions across 5 categories
Technical Details
- Paraphrase Generation:
- Claude API creates 20 variants per question across 4 style registers
- Covers how real users phrase things: terse, verbose, formal, conversational
- 5 specialized stress-test modes via
src/prompt_engineer.py - Adversarial (jailbreak framing), Socratic (assumption-challenging), Emotional/Urgent, Ambiguous, and Technical/Jargon-Heavy
- Semantic Analysis:
sentence-transformers(all-MiniLM-L6-v2) for 384-dim response embeddings- DBSCAN clustering with configurable
eps(default 0.3) andmin_samples(default 2) - Cosine distance metric for semantics-aware grouping
- Noise point detection flags outlier responses for manual review
- Contradiction Detection:
- Claude API extracts structured key claims from each cluster representative
- Cross-cluster comparison identifies direct factual conflicts
- Contradiction rate feeds directly into the final consistency score
- Reporting:
- Jinja2 templating with inline Chart.js — zero external dependencies at view time
- 20×20 cosine similarity heatmap with color gradient
- Cluster distribution pie chart and response latency bar chart
- Expandable response gallery with per-call latency and cost metadata
- Performance:
- Full pipeline (20 paraphrases + testing + analysis + report): 60–120 seconds
- asyncio concurrency drops sequential 3–5 min runs to under 60s
- ~2GB RAM usage including sentence-transformer model weights
- Less than 50ms overhead per individual analysis step after model load
Results
- Consistency Score: 87% (GOOD) on well-engineered customer support prompts
- Concurrent Speedup: 4–5x faster than sequential testing with asyncio parallelism
- Memory Efficiency: DBSCAN + embeddings pipeline runs in less than 2GB RAM
- Cluster Precision: Correctly groups semantic equivalents (e.g., “cancel account” ≈ “terminate subscription”) that string matching misses
- Contradiction Clarity: Fact extraction surfaces specific conflicts (e.g., “Cluster A: cancel anytime” vs “Cluster B: 30-day notice required”) rather than vague similarity scores
- Report Portability: Self-contained HTML reports shareable over Slack, email, or committed to repos — no server required
Example Consistency Analysis Output
============================================================
CONSISTENCY ANALYSIS RESULTS
============================================================
Question: "How do I reset my password?"
Model: Claude Sonnet 4
✓ CONSISTENCY SCORE: 87% (GOOD)
Analysis Summary:
Response Clusters: 2 groups identified
Contradictions Found: 0
Average Response Time: 1.2s
Total Cost Estimate: $0.08
Cluster Breakdown:
Cluster 1 (16/20 responses): Standard password reset flow via email link
Cluster 2 (4/20 responses): Includes mention of 2FA recovery options
Report Location:
./results/consistency_report_How_do_I_reset_20260221_082150.htmlBatch Test Summary (10 Benchmark Questions)
Batch Test Summary:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ Question ┃ Category ┃ Score ┃ Status ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ How do I reset my password? │ support │ 87% │ ✓ GOOD │
│ How do I cancel my subscription? │ support │ 82% │ ✓ GOOD │
│ Where can I download the mobile app?│ support │ 79% │ ✓ GOOD │
│ What's your refund policy? │ product │ 65% │ ⚠️ MEDIUM │
│ Do you offer student discounts? │ product │ 71% │ ⚠️ MEDIUM │
│ Why is the page loading slowly? │ technical│ 58% │ ❌ POOR │
│ How do I export my data? │ technical│ 63% │ ⚠️ MEDIUM │
│ How do I update my payment method? │ billing │ 84% │ ✓ GOOD │
│ When will I be charged? │ billing │ 67% │ ⚠️ MEDIUM │
│ Can I integrate with Slack? │ features │ 76% │ ⚠️ MEDIUM │
└─────────────────────────────────────┴──────────┴───────┴───────────┘Best Practices & Lessons Learned
- Use semantic embeddings instead of string matching — sentence-transformers correctly groups “cancel account” and “terminate subscription” where regex fails
- Concurrency is non-negotiable — asyncio parallelism cuts 3–5 minute sequential runs to under 60 seconds, the difference between running tests daily vs. never
- Tune DBSCAN
epsper domain — default 0.3 works well for support prompts, technical domains with narrow vocabulary benefit from 0.2 - Fact extraction is more valuable than similarity alone — cluster similarity tells you responses diverge, Claude’s fact extraction tells you how they diverge
- Run adversarial tests on any prompt handling sensitive data — model behavior often shifts under social engineering framing in ways basic paraphrase testing won’t catch
- Self-contained HTML reports matter for teams — removing CDN dependencies means reports can be shared over Slack, emailed, or committed to repos with zero setup
- Set the consistency score threshold before you look at results — deciding post-hoc whether 70% is “good enough” introduces bias into the evaluation
- Batch mode is essential for prompt libraries — testing 10 prompts takes the same setup effort as testing 1, use it to audit entire FAQ sets at once
Next Steps
- Add multi-language support for testing consistency across localized prompt variants
- Implement PDF report export for teams that need printable or ticketable outputs
- Build real-time streaming progress UI for longer batch test runs
- Add support for additional LLM providers — Cohere, AI21, Mistral, Gemini
- Create custom paraphrase categories via YAML config files without code changes
- Build GitHub Actions integration for automated consistency checks on prompt library PRs
- Add Weights & Biases integration for tracking consistency scores across model versions
- Implement regression alerts when a prompt’s score drops between model updates
- Develop a consistency budget system — fail CI if any prompt drops below a defined threshold