LLM Response Judge

Auto-grade LLM responses with customizable rubrics, multi-provider evaluation, AI-powered rewrites, and a real-time React dashboard — no manual review required

Problem Statement

We Asked NEO to: Build a full-stack LLM evaluation web app with:

Customizable quality rubrics powered by real LLM scoring
Support for Anthropic, OpenAI, Google Gemini, and local Ollama models
Auto-improvement engine that rewrites poor responses with predicted score gains
Demo mode with pre-evaluated data — no API key needed
Dockerized React + FastAPI architecture for one-command deployment

Solution Overview

NEO built a production-ready LLM evaluation app that replaces manual review with automated rubric-driven grading:

Multi-Provider Evaluation Engine routes to Claude, GPT-4, Gemini, OpenRouter, or Ollama — same rubric, same scoring, any provider
Customizable Rubric System offers 3 built-in presets plus a full rubric editor with weighted criteria
Auto-Improvement Pipeline rewrites bottom-performing responses with predicted score improvements
Real-Time React Dashboard shows live progress, per-criterion breakdowns, critical issue flags (bottom 10%), and Chart.js visualizations
Demo Mode loads 20 pre-evaluated responses instantly — no API key, full feature exploration in under 60 seconds

Workflow / Pipeline

Step	Description
1. File Upload & Parsing	`FileUpload.jsx` validates CSV/JSON structure and checks required fields (`question`, `response`) before evaluation begins
2. Provider & Rubric Selection	Choose any LLM provider and an evaluation rubric — 3 built-in presets or a custom rubric with weighted criteria via `RubricEditor.jsx`
3. API Key Validation	Key validated via `/validate-key` before evaluation starts — stored in browser `localStorage` only, never persisted server-side
4. Batch Evaluation	FastAPI routes each response through the evaluation prompt (`backend/prompts/judge.py`) — scores each criterion with weighted aggregation
5. Per-Criterion Scoring	LLM returns structured scores per rubric criterion with justification text — granular feedback beyond a single composite score
6. Critical Issue Detection	Dashboard automatically flags the bottom 10% of responses for priority review — no manual sorting needed
7. Auto-Improvement	`/improve` endpoint sends the original Q&A + rubric context to the LLM and returns a rewrite with predicted score gain
8. Dashboard & Export	Real-time Chart.js visualizations, sortable results table, expandable per-response breakdowns — exportable to PDF or Markdown

Repository & Artifacts

Dakshjain1604/LLM-response-JudgeView on GitHub

Generated Artifacts:

React frontend — dashboard, file upload, rubric editor, settings (Vite + Tailwind CSS)
FastAPI backend — evaluation, batch processing, rubric management, auto-improvement endpoints
Multi-provider LLM clients — Anthropic, OpenAI, Google Gemini, OpenRouter, Ollama
Rubric-based evaluation prompt system with weighted per-criterion scoring (backend/prompts/judge.py)
Auto-improvement engine with score prediction (/improve endpoint)
Demo mode with 20 pre-evaluated responses — no API key required
100 sample Q&A pairs across technical, policy, and support categories
Docker Compose multi-container setup for one-command startup
GitHub Actions CI/CD workflow
Deployment guides for Vercel (frontend) and Railway (backend)

Technical Details

Frontend:
- React 18+ with Vite, Tailwind CSS dark mode, Chart.js visualizations
- Custom hooks for evaluation state and live progress polling
- API keys stored in localStorage — never sent to server storage
Backend:
- FastAPI with async handlers for non-blocking batch evaluation
- Pydantic schemas for validation, SQLAlchemy for result persistence
- CORS protection and rate limiting
LLM Providers:
- Anthropic Claude 3.5 Sonnet — recommended, highest evaluation quality
- OpenAI GPT-4 Turbo — high quality, fast
- Google Gemini Pro — cost-effective
- OpenRouter — unified API for Llama, Mistral, and more
- Ollama — local models, zero API cost, full privacy
Rubric System:
- 3 built-in presets: customer support, technical accuracy, creative writing
- Custom builder with weighted criteria summing to 100%
- Per-criterion scores with justification text per response
- Weighted aggregation into a single 0–100 composite score
API Endpoints:
- POST /evaluate — single response
- POST /evaluate/batch — CSV/JSON batch
- GET /rubrics / GET /rubrics/{id} — rubric retrieval
- POST /improve — rewrite with score prediction
- POST /validate-key — API key validation

Results

Throughput: ~100 responses in 3 minutes (varies by provider)
Scalability: Handles 500+ response batches without issues
Demo Load: Under 2 seconds — 20 full breakdowns, zero API calls
Auto-Improvement: Consistently predicts and achieves measurable score gains on flagged responses
Privacy: Zero server-side API key storage — credentials stay in localStorage

Example Evaluation Output (Single Response)

Question: "How do I reset my password?"

Evaluation Results:
┌──────────────────┬───────┬────────────────────────────────────┐
│ Criterion        │ Score │ Justification                      │
├──────────────────┼───────┼────────────────────────────────────┤
│ Accuracy         │  9/10 │ Correct steps, no misleading info  │
│ Clarity          │  8/10 │ Clear and concise, minor gaps      │
│ Empathy          │  6/10 │ Lacks warm acknowledgement         │
│ Completeness     │  7/10 │ Missing 2FA recovery mention       │
│ Actionability    │  9/10 │ User can act immediately           │
└──────────────────┴───────┴────────────────────────────────────┘

Composite Score: 78/100  |  ⚠️ Needs Improvement
Predicted Score After Auto-Improvement: 91/100  (+13 points)

Batch Evaluation Summary (100 Responses)

Batch Evaluation Complete
─────────────────────────────────────────────
Total Responses:    100   ✓
Average Score:       74.3 / 100

Score Distribution:
  Excellent (90+):  18   ██████
  Good (75-89):     41   █████████████
  Fair (60-74):     29   ██████████
  Poor (<60):       12   ████

Critical Issues Flagged: 10  (bottom 10%)
Evaluation Time: 2m 47s  |  Provider: Claude 3.5 Sonnet
─────────────────────────────────────────────

Best Practices & Lessons Learned

Start with Demo Mode — full feature exploration with zero setup cost before touching an API key
Match rubric criteria to your quality bar — “references correct policy version” is far more actionable than just “accuracy”
Weight criteria intentionally — equal weights treat empathy the same as factual accuracy; align with what drives real user satisfaction
Use Claude for evaluation — even if your app uses a different provider, Claude produces the most consistent rubric scores
Auto-improve flagged responses only — the highest scorers don’t need rewrites; save the API tokens
Store every batch result — scoring drift across model updates is invisible without historical baselines
Use Ollama for privacy-sensitive evals — zero cloud API cost, full local data residency
Per-criterion scores beat composite scores — 78/100 tells you nothing; empathy at 6/10 tells you exactly what to fix
Validate your file format first — a malformed CSV caught at row 300 means re-running the entire batch

Next Steps

Add multi-turn conversation evaluation — currently single-turn Q&A only
Implement evaluation history with cross-batch trend analysis
Build a rubric marketplace for sharing and version-controlling criteria
Add real-time collaboration for team batch reviews
Implement provider agreement scoring — surface where two providers disagree
GitHub Actions integration for automated evaluation on dataset PRs
Webhook support for triggering runs from external pipelines
Extend auto-improvement with style constraints (length limits, tone rules)

References

View source on GitHub

FastAPI·React + Vite·Tailwind CSS·Anthropic Claude API·OpenAI API·Google Gemini API·OpenRouter·Ollama·LLM-as-Judge Research·Docker Compose

Learn More

VS Code Extension

Install Neo and work directly with local code and data.

Platform Features

Understand Neo’s capabilities across web and IDE environments.

FAQ

Review security, privacy, limits, and troubleshooting information.