Skip to Content

LLM Evaluator Tool: Runtime Checks for Quality and Safety

NEO built a flexible evaluator that scores model outputs against rubrics. Use it to gate prompts, routing, and fine-tunes before they touch real users. The tool is aimed at teams that already ship LLM features and need repeatable scoring, not one-off eyeball review.


Problem Statement

We asked NEO to ship one CLI and API surface for evaluation batches: custom criteria, rolled-up scores, and exports your CI and product reviews can actually use. Spreadsheet grading and ad hoc rubrics do not scale when you compare models, prompt versions, or weekly releases.


Solution Overview

  1. Rubric engine: Criteria and weights live in YAML so you can version them beside application code.
  2. Batch mode: CSV or JSON in, scored rows out, with optional parallelism for large sets.
  3. Extensible judges: Rules, embedding similarity, or LLM-as-judge behind a common interface.
  4. Composite scoring: Per-dimension scores roll into a single weighted index for pass and fail thresholds.

LLM evaluator tool architecture

Evaluation Dimensions

The default rubric treats each response as a structured artifact and scores it on five dimensions. Each dimension uses a 1 to 5 scale in the underlying implementation; those roll into a 0 to 100 weighted composite you can threshold in CI.

DimensionWhat it measures
RelevanceDoes the answer address what the user actually asked?
AccuracyAre the facts correct for the domain?
CompletenessDoes the answer cover the important aspects of a multi-part question?
ClarityIs the response structured and easy to follow?
SafetyDoes it avoid harmful content and respect policy-style constraints?

Weights are configurable in YAML so a medical assistant can emphasize accuracy and safety, while an internal coding assistant can emphasize relevance and completeness.


Workflow / Pipeline

StepDescription
1. Define rubricCriteria, scales, weights, and optional few-shot anchors for judges
2. Load datasetPairs of prompt and model output from CSV or JSON, or live calls
3. ScoreParallel workers call configured judges; failures are isolated per row
4. SummarizeAggregate metrics, composite score, and worst examples for review
5. ExportJSON, CSV, or Markdown reports for dashboards and pull requests

CLI and Reports

Typical usage points the CLI at a CSV of prompts and responses, a rubric file, and an OpenRouter API key for the judge model. Flags often include the input path, rubric path, model id, concurrency, and output path. The HTML report shows per-dimension breakdowns, the composite score distribution, and the lowest-scoring examples so reviewers do not hunt through raw logs. JSON and CSV exports support automation: fail the build when the composite drops below a floor or when any safety score hits a critical band.


Repository & Artifacts

gauravvij/llm-evaluatorView on GitHub

Generated Artifacts:


References

View source on GitHub


Learn More