Skip to Content

LLM Consistency Monitor

Stress-test LLM reliability by generating semantically identical paraphrases, running them concurrently, and surfacing contradictions before they reach your users


Problem Statement

We Asked NEO to: Build a production-grade Python CLI tool that detects inconsistencies in LLM responses by generating semantically identical paraphrases and analyzing response patterns, implement concurrent testing across multiple LLM providers, use sentence-transformer embeddings with DBSCAN clustering to group and compare responses, and generate interactive HTML reports with consistency scores, heatmaps, and actionable prompt engineering recommendations.


Solution Overview

NEO built a comprehensive LLM consistency testing framework that quantifies prompt brittleness before it reaches production:

  1. Paraphrase Generation Engine uses Claude API to create 20 syntactically diverse variations across 4 style categories (formal, casual, short-form, statement-form)
  2. Concurrent Testing Layer queries any target LLM provider in parallel using asyncio, cutting wall-clock time from minutes to under 60 seconds
  3. Semantic Analysis Pipeline computes embeddings, runs DBSCAN clustering, extracts key facts per cluster, and detects cross-cluster contradictions
  4. Interactive Report Builder generates self-contained HTML reports with similarity heatmaps, cluster distribution charts, and ranked recommendations

The system achieves 87%+ consistency scores on well-engineered prompts and clearly flags poor prompts (below 60%) with specific, actionable fixes.


Workflow / Pipeline

StepDescription
1. Paraphrase GenerationClaude API generates 20 syntactically diverse variants — 5 formal, 5 casual, 5 short-form, 5 statement-form — representing how real users phrase the same question
2. Concurrent LLM TestingAll 20 paraphrases hit the target model in parallel via asyncio, reducing test time from 3–5 minutes to under 60 seconds
3. Semantic Embeddingsentence-transformers (all-MiniLM-L6-v2) encodes each response into a 384-dimensional vector capturing meaning, not just surface wording
4. DBSCAN ClusteringCosine similarity between embedding vectors grouped via DBSCAN — responses in the same cluster agree semantically, separate clusters signal divergence
5. Fact ExtractionClaude API extracts key factual claims from each cluster’s representative response, turning fuzzy similarity into concrete, comparable statements
6. Contradiction DetectionCross-cluster fact comparison flags direct conflicts and computes how often the same question yields contradictory answers
7. Consistency ScoringA 0–100 score computed from cluster count, contradiction frequency, and semantic spread — a single number showing how reliable the prompt is
8. HTML Report GenerationFully self-contained interactive report saved to results/ — includes heatmap, pie chart, latency bars, response gallery, and specific recommendations

Repository & Artifacts

Dakshjain1604/LLM-consistency-MonitorView on GitHub

Generated Artifacts:


Technical Details


Results

Example Consistency Analysis Output

============================================================ CONSISTENCY ANALYSIS RESULTS ============================================================ Question: "How do I reset my password?" Model: Claude Sonnet 4 ✓ CONSISTENCY SCORE: 87% (GOOD) Analysis Summary: Response Clusters: 2 groups identified Contradictions Found: 0 Average Response Time: 1.2s Total Cost Estimate: $0.08 Cluster Breakdown: Cluster 1 (16/20 responses): Standard password reset flow via email link Cluster 2 (4/20 responses): Includes mention of 2FA recovery options Report Location: ./results/consistency_report_How_do_I_reset_20260221_082150.html

Batch Test Summary (10 Benchmark Questions)

Batch Test Summary: ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓ ┃ Question ┃ Category ┃ Score ┃ Status ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩ │ How do I reset my password? │ support │ 87% │ ✓ GOOD │ │ How do I cancel my subscription? │ support │ 82% │ ✓ GOOD │ │ Where can I download the mobile app?│ support │ 79% │ ✓ GOOD │ │ What's your refund policy? │ product │ 65% │ ⚠️ MEDIUM │ │ Do you offer student discounts? │ product │ 71% │ ⚠️ MEDIUM │ │ Why is the page loading slowly? │ technical│ 58% │ ❌ POOR │ │ How do I export my data? │ technical│ 63% │ ⚠️ MEDIUM │ │ How do I update my payment method? │ billing │ 84% │ ✓ GOOD │ │ When will I be charged? │ billing │ 67% │ ⚠️ MEDIUM │ │ Can I integrate with Slack? │ features │ 76% │ ⚠️ MEDIUM │ └─────────────────────────────────────┴──────────┴───────┴───────────┘

Best Practices & Lessons Learned


Next Steps


References

View source on GitHub


Learn More