Skip to Content

LLM Response Judge

Auto-grade LLM responses with customizable rubrics, multi-provider evaluation, AI-powered rewrites, and a real-time React dashboard — no manual review required


Problem Statement

We Asked NEO to: Build a full-stack LLM evaluation web app with:


Solution Overview

NEO built a production-ready LLM evaluation app that replaces manual review with automated rubric-driven grading:

  1. Multi-Provider Evaluation Engine routes to Claude, GPT-4, Gemini, OpenRouter, or Ollama — same rubric, same scoring, any provider
  2. Customizable Rubric System offers 3 built-in presets plus a full rubric editor with weighted criteria
  3. Auto-Improvement Pipeline rewrites bottom-performing responses with predicted score improvements
  4. Real-Time React Dashboard shows live progress, per-criterion breakdowns, critical issue flags (bottom 10%), and Chart.js visualizations
  5. Demo Mode loads 20 pre-evaluated responses instantly — no API key, full feature exploration in under 60 seconds

Workflow / Pipeline

StepDescription
1. File Upload & ParsingFileUpload.jsx validates CSV/JSON structure and checks required fields (question, response) before evaluation begins
2. Provider & Rubric SelectionChoose any LLM provider and an evaluation rubric — 3 built-in presets or a custom rubric with weighted criteria via RubricEditor.jsx
3. API Key ValidationKey validated via /validate-key before evaluation starts — stored in browser localStorage only, never persisted server-side
4. Batch EvaluationFastAPI routes each response through the evaluation prompt (backend/prompts/judge.py) — scores each criterion with weighted aggregation
5. Per-Criterion ScoringLLM returns structured scores per rubric criterion with justification text — granular feedback beyond a single composite score
6. Critical Issue DetectionDashboard automatically flags the bottom 10% of responses for priority review — no manual sorting needed
7. Auto-Improvement/improve endpoint sends the original Q&A + rubric context to the LLM and returns a rewrite with predicted score gain
8. Dashboard & ExportReal-time Chart.js visualizations, sortable results table, expandable per-response breakdowns — exportable to PDF or Markdown

Repository & Artifacts

Dakshjain1604/LLM-response-JudgeView on GitHub

Generated Artifacts:


Technical Details


Results

Example Evaluation Output (Single Response)

Question: "How do I reset my password?" Evaluation Results: ┌──────────────────┬───────┬────────────────────────────────────┐ │ Criterion │ Score │ Justification │ ├──────────────────┼───────┼────────────────────────────────────┤ │ Accuracy │ 9/10 │ Correct steps, no misleading info │ │ Clarity │ 8/10 │ Clear and concise, minor gaps │ │ Empathy │ 6/10 │ Lacks warm acknowledgement │ │ Completeness │ 7/10 │ Missing 2FA recovery mention │ │ Actionability │ 9/10 │ User can act immediately │ └──────────────────┴───────┴────────────────────────────────────┘ Composite Score: 78/100 | ⚠️ Needs Improvement Predicted Score After Auto-Improvement: 91/100 (+13 points)

Batch Evaluation Summary (100 Responses)

Batch Evaluation Complete ───────────────────────────────────────────── Total Responses: 100 ✓ Average Score: 74.3 / 100 Score Distribution: Excellent (90+): 18 ██████ Good (75-89): 41 █████████████ Fair (60-74): 29 ██████████ Poor (<60): 12 ████ Critical Issues Flagged: 10 (bottom 10%) Evaluation Time: 2m 47s | Provider: Claude 3.5 Sonnet ─────────────────────────────────────────────

Best Practices & Lessons Learned


Next Steps


References

View source on GitHub


Learn More