LLM Response Judge
Auto-grade LLM responses with customizable rubrics, multi-provider evaluation, AI-powered rewrites, and a real-time React dashboard — no manual review required
Problem Statement
We Asked NEO to: Build a full-stack LLM evaluation web app with:
- Customizable quality rubrics powered by real LLM scoring
- Support for Anthropic, OpenAI, Google Gemini, and local Ollama models
- Auto-improvement engine that rewrites poor responses with predicted score gains
- Demo mode with pre-evaluated data — no API key needed
- Dockerized React + FastAPI architecture for one-command deployment
Solution Overview
NEO built a production-ready LLM evaluation app that replaces manual review with automated rubric-driven grading:
- Multi-Provider Evaluation Engine routes to Claude, GPT-4, Gemini, OpenRouter, or Ollama — same rubric, same scoring, any provider
- Customizable Rubric System offers 3 built-in presets plus a full rubric editor with weighted criteria
- Auto-Improvement Pipeline rewrites bottom-performing responses with predicted score improvements
- Real-Time React Dashboard shows live progress, per-criterion breakdowns, critical issue flags (bottom 10%), and Chart.js visualizations
- Demo Mode loads 20 pre-evaluated responses instantly — no API key, full feature exploration in under 60 seconds
Workflow / Pipeline
| Step | Description |
|---|---|
| 1. File Upload & Parsing | FileUpload.jsx validates CSV/JSON structure and checks required fields (question, response) before evaluation begins |
| 2. Provider & Rubric Selection | Choose any LLM provider and an evaluation rubric — 3 built-in presets or a custom rubric with weighted criteria via RubricEditor.jsx |
| 3. API Key Validation | Key validated via /validate-key before evaluation starts — stored in browser localStorage only, never persisted server-side |
| 4. Batch Evaluation | FastAPI routes each response through the evaluation prompt (backend/prompts/judge.py) — scores each criterion with weighted aggregation |
| 5. Per-Criterion Scoring | LLM returns structured scores per rubric criterion with justification text — granular feedback beyond a single composite score |
| 6. Critical Issue Detection | Dashboard automatically flags the bottom 10% of responses for priority review — no manual sorting needed |
| 7. Auto-Improvement | /improve endpoint sends the original Q&A + rubric context to the LLM and returns a rewrite with predicted score gain |
| 8. Dashboard & Export | Real-time Chart.js visualizations, sortable results table, expandable per-response breakdowns — exportable to PDF or Markdown |
Repository & Artifacts
Generated Artifacts:
- React frontend — dashboard, file upload, rubric editor, settings (Vite + Tailwind CSS)
- FastAPI backend — evaluation, batch processing, rubric management, auto-improvement endpoints
- Multi-provider LLM clients — Anthropic, OpenAI, Google Gemini, OpenRouter, Ollama
- Rubric-based evaluation prompt system with weighted per-criterion scoring (
backend/prompts/judge.py) - Auto-improvement engine with score prediction (
/improveendpoint) - Demo mode with 20 pre-evaluated responses — no API key required
- 100 sample Q&A pairs across technical, policy, and support categories
- Docker Compose multi-container setup for one-command startup
- GitHub Actions CI/CD workflow
- Deployment guides for Vercel (frontend) and Railway (backend)
Technical Details
-
Frontend:
- React 18+ with Vite, Tailwind CSS dark mode, Chart.js visualizations
- Custom hooks for evaluation state and live progress polling
- API keys stored in
localStorage— never sent to server storage
-
Backend:
- FastAPI with async handlers for non-blocking batch evaluation
- Pydantic schemas for validation, SQLAlchemy for result persistence
- CORS protection and rate limiting
-
LLM Providers:
- Anthropic Claude 3.5 Sonnet — recommended, highest evaluation quality
- OpenAI GPT-4 Turbo — high quality, fast
- Google Gemini Pro — cost-effective
- OpenRouter — unified API for Llama, Mistral, and more
- Ollama — local models, zero API cost, full privacy
-
Rubric System:
- 3 built-in presets: customer support, technical accuracy, creative writing
- Custom builder with weighted criteria summing to 100%
- Per-criterion scores with justification text per response
- Weighted aggregation into a single 0–100 composite score
-
API Endpoints:
POST /evaluate— single responsePOST /evaluate/batch— CSV/JSON batchGET /rubrics/GET /rubrics/{id}— rubric retrievalPOST /improve— rewrite with score predictionPOST /validate-key— API key validation
Results
- Throughput: ~100 responses in 3 minutes (varies by provider)
- Scalability: Handles 500+ response batches without issues
- Demo Load: Under 2 seconds — 20 full breakdowns, zero API calls
- Auto-Improvement: Consistently predicts and achieves measurable score gains on flagged responses
- Privacy: Zero server-side API key storage — credentials stay in
localStorage
Example Evaluation Output (Single Response)
Question: "How do I reset my password?"
Evaluation Results:
┌──────────────────┬───────┬────────────────────────────────────┐
│ Criterion │ Score │ Justification │
├──────────────────┼───────┼────────────────────────────────────┤
│ Accuracy │ 9/10 │ Correct steps, no misleading info │
│ Clarity │ 8/10 │ Clear and concise, minor gaps │
│ Empathy │ 6/10 │ Lacks warm acknowledgement │
│ Completeness │ 7/10 │ Missing 2FA recovery mention │
│ Actionability │ 9/10 │ User can act immediately │
└──────────────────┴───────┴────────────────────────────────────┘
Composite Score: 78/100 | ⚠️ Needs Improvement
Predicted Score After Auto-Improvement: 91/100 (+13 points)Batch Evaluation Summary (100 Responses)
Batch Evaluation Complete
─────────────────────────────────────────────
Total Responses: 100 ✓
Average Score: 74.3 / 100
Score Distribution:
Excellent (90+): 18 ██████
Good (75-89): 41 █████████████
Fair (60-74): 29 ██████████
Poor (<60): 12 ████
Critical Issues Flagged: 10 (bottom 10%)
Evaluation Time: 2m 47s | Provider: Claude 3.5 Sonnet
─────────────────────────────────────────────Best Practices & Lessons Learned
- Start with Demo Mode — full feature exploration with zero setup cost before touching an API key
- Match rubric criteria to your quality bar — “references correct policy version” is far more actionable than just “accuracy”
- Weight criteria intentionally — equal weights treat empathy the same as factual accuracy; align with what drives real user satisfaction
- Use Claude for evaluation — even if your app uses a different provider, Claude produces the most consistent rubric scores
- Auto-improve flagged responses only — the highest scorers don’t need rewrites; save the API tokens
- Store every batch result — scoring drift across model updates is invisible without historical baselines
- Use Ollama for privacy-sensitive evals — zero cloud API cost, full local data residency
- Per-criterion scores beat composite scores — 78/100 tells you nothing; empathy at 6/10 tells you exactly what to fix
- Validate your file format first — a malformed CSV caught at row 300 means re-running the entire batch
Next Steps
- Add multi-turn conversation evaluation — currently single-turn Q&A only
- Implement evaluation history with cross-batch trend analysis
- Build a rubric marketplace for sharing and version-controlling criteria
- Add real-time collaboration for team batch reviews
- Implement provider agreement scoring — surface where two providers disagree
- GitHub Actions integration for automated evaluation on dataset PRs
- Webhook support for triggering runs from external pipelines
- Extend auto-improvement with style constraints (length limits, tone rules)