LLM Consistency Monitor

Stress-test LLM reliability by generating semantically identical paraphrases, running them concurrently, and surfacing contradictions before they reach your users

Problem Statement

We Asked NEO to: Build a production-grade Python CLI tool that detects inconsistencies in LLM responses by generating semantically identical paraphrases and analyzing response patterns, implement concurrent testing across multiple LLM providers, use sentence-transformer embeddings with DBSCAN clustering to group and compare responses, and generate interactive HTML reports with consistency scores, heatmaps, and actionable prompt engineering recommendations.

Solution Overview

NEO built a comprehensive LLM consistency testing framework that quantifies prompt brittleness before it reaches production:

Paraphrase Generation Engine uses Claude API to create 20 syntactically diverse variations across 4 style categories (formal, casual, short-form, statement-form)
Concurrent Testing Layer queries any target LLM provider in parallel using asyncio, cutting wall-clock time from minutes to under 60 seconds
Semantic Analysis Pipeline computes embeddings, runs DBSCAN clustering, extracts key facts per cluster, and detects cross-cluster contradictions
Interactive Report Builder generates self-contained HTML reports with similarity heatmaps, cluster distribution charts, and ranked recommendations

The system achieves 87%+ consistency scores on well-engineered prompts and clearly flags poor prompts (below 60%) with specific, actionable fixes.

Workflow / Pipeline

Step	Description
1. Paraphrase Generation	Claude API generates 20 syntactically diverse variants — 5 formal, 5 casual, 5 short-form, 5 statement-form — representing how real users phrase the same question
2. Concurrent LLM Testing	All 20 paraphrases hit the target model in parallel via asyncio, reducing test time from 3–5 minutes to under 60 seconds
3. Semantic Embedding	sentence-transformers (all-MiniLM-L6-v2) encodes each response into a 384-dimensional vector capturing meaning, not just surface wording
4. DBSCAN Clustering	Cosine similarity between embedding vectors grouped via DBSCAN — responses in the same cluster agree semantically, separate clusters signal divergence
5. Fact Extraction	Claude API extracts key factual claims from each cluster’s representative response, turning fuzzy similarity into concrete, comparable statements
6. Contradiction Detection	Cross-cluster fact comparison flags direct conflicts and computes how often the same question yields contradictory answers
7. Consistency Scoring	A 0–100 score computed from cluster count, contradiction frequency, and semantic spread — a single number showing how reliable the prompt is
8. HTML Report Generation	Fully self-contained interactive report saved to results/ — includes heatmap, pie chart, latency bars, response gallery, and specific recommendations

Repository & Artifacts

Dakshjain1604/LLM-consistency-MonitorView on GitHub

Generated Artifacts:

Paraphrase generation engine with 4-category variation strategy (Claude API)
5-mode stress testing suite — adversarial, Socratic, emotional, ambiguous, jargon-heavy
Async concurrent LLM tester with multi-provider support (Claude, GPT-4, HuggingFace, Custom)
Semantic embedding pipeline using sentence-transformers (all-MiniLM-L6-v2)
DBSCAN clustering engine with configurable eps and min_samples
Claude-powered fact extraction and cross-cluster contradiction detector
0–100 consistency scoring system with severity bands (Good / Medium / Poor)
Interactive self-contained HTML report with heatmap, charts, and recommendations
Batch testing mode for validating multiple prompts in one run
10 production-ready benchmark questions across 5 categories

Technical Details

Paraphrase Generation:
- Claude API creates 20 variants per question across 4 style registers
- Covers how real users phrase things: terse, verbose, formal, conversational
- 5 specialized stress-test modes via src/prompt_engineer.py
- Adversarial (jailbreak framing), Socratic (assumption-challenging), Emotional/Urgent, Ambiguous, and Technical/Jargon-Heavy
Semantic Analysis:
- sentence-transformers (all-MiniLM-L6-v2) for 384-dim response embeddings
- DBSCAN clustering with configurable eps (default 0.3) and min_samples (default 2)
- Cosine distance metric for semantics-aware grouping
- Noise point detection flags outlier responses for manual review
Contradiction Detection:
- Claude API extracts structured key claims from each cluster representative
- Cross-cluster comparison identifies direct factual conflicts
- Contradiction rate feeds directly into the final consistency score
Reporting:
- Jinja2 templating with inline Chart.js — zero external dependencies at view time
- 20×20 cosine similarity heatmap with color gradient
- Cluster distribution pie chart and response latency bar chart
- Expandable response gallery with per-call latency and cost metadata
Performance:
- Full pipeline (20 paraphrases + testing + analysis + report): 60–120 seconds
- asyncio concurrency drops sequential 3–5 min runs to under 60s
- ~2GB RAM usage including sentence-transformer model weights
- Less than 50ms overhead per individual analysis step after model load

Results

Consistency Score: 87% (GOOD) on well-engineered customer support prompts
Concurrent Speedup: 4–5x faster than sequential testing with asyncio parallelism
Memory Efficiency: DBSCAN + embeddings pipeline runs in less than 2GB RAM
Cluster Precision: Correctly groups semantic equivalents (e.g., “cancel account” ≈ “terminate subscription”) that string matching misses
Contradiction Clarity: Fact extraction surfaces specific conflicts (e.g., “Cluster A: cancel anytime” vs “Cluster B: 30-day notice required”) rather than vague similarity scores
Report Portability: Self-contained HTML reports shareable over Slack, email, or committed to repos — no server required

Example Consistency Analysis Output

============================================================
CONSISTENCY ANALYSIS RESULTS
============================================================

Question: "How do I reset my password?"
Model: Claude Sonnet 4

✓ CONSISTENCY SCORE: 87% (GOOD)

Analysis Summary:
  Response Clusters: 2 groups identified
  Contradictions Found: 0
  Average Response Time: 1.2s
  Total Cost Estimate: $0.08

Cluster Breakdown:
  Cluster 1 (16/20 responses): Standard password reset flow via email link
  Cluster 2 (4/20 responses):  Includes mention of 2FA recovery options

Report Location:
  ./results/consistency_report_How_do_I_reset_20260221_082150.html

Batch Test Summary (10 Benchmark Questions)

Batch Test Summary:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ Question                            ┃ Category ┃ Score ┃ Status    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ How do I reset my password?         │ support  │  87%  │ ✓ GOOD    │
│ How do I cancel my subscription?    │ support  │  82%  │ ✓ GOOD    │
│ Where can I download the mobile app?│ support  │  79%  │ ✓ GOOD    │
│ What's your refund policy?          │ product  │  65%  │ ⚠️ MEDIUM  │
│ Do you offer student discounts?     │ product  │  71%  │ ⚠️ MEDIUM  │
│ Why is the page loading slowly?     │ technical│  58%  │ ❌ POOR   │
│ How do I export my data?            │ technical│  63%  │ ⚠️ MEDIUM  │
│ How do I update my payment method?  │ billing  │  84%  │ ✓ GOOD    │
│ When will I be charged?             │ billing  │  67%  │ ⚠️ MEDIUM  │
│ Can I integrate with Slack?         │ features │  76%  │ ⚠️ MEDIUM  │
└─────────────────────────────────────┴──────────┴───────┴───────────┘

Best Practices & Lessons Learned

Use semantic embeddings instead of string matching — sentence-transformers correctly groups “cancel account” and “terminate subscription” where regex fails
Concurrency is non-negotiable — asyncio parallelism cuts 3–5 minute sequential runs to under 60 seconds, the difference between running tests daily vs. never
Tune DBSCAN eps per domain — default 0.3 works well for support prompts, technical domains with narrow vocabulary benefit from 0.2
Fact extraction is more valuable than similarity alone — cluster similarity tells you responses diverge, Claude’s fact extraction tells you how they diverge
Run adversarial tests on any prompt handling sensitive data — model behavior often shifts under social engineering framing in ways basic paraphrase testing won’t catch
Self-contained HTML reports matter for teams — removing CDN dependencies means reports can be shared over Slack, emailed, or committed to repos with zero setup
Set the consistency score threshold before you look at results — deciding post-hoc whether 70% is “good enough” introduces bias into the evaluation
Batch mode is essential for prompt libraries — testing 10 prompts takes the same setup effort as testing 1, use it to audit entire FAQ sets at once

Next Steps

Add multi-language support for testing consistency across localized prompt variants
Implement PDF report export for teams that need printable or ticketable outputs
Build real-time streaming progress UI for longer batch test runs
Add support for additional LLM providers — Cohere, AI21, Mistral, Gemini
Create custom paraphrase categories via YAML config files without code changes
Build GitHub Actions integration for automated consistency checks on prompt library PRs
Add Weights & Biases integration for tracking consistency scores across model versions
Implement regression alerts when a prompt’s score drops between model updates
Develop a consistency budget system — fail CI if any prompt drops below a defined threshold

References

View source on GitHub

Sentence Transformers·DBSCAN Algorithm·Anthropic Claude API·Click CLI Framework·Rich Terminal UI·OWASP LLM Top 10