Skip to Content

NeuroLens: Concept Neuron Discovery and Validation

Discover and validate which neurons in a neural network encode specific concepts—from “royalty” and “negation” to sentiment and tense—with automated probing, scoring, and causal ablation


Problem Statement

We asked NEO to: Build a tool that automatically finds, labels, and validates “concept neurons” in neural networks—the individual neurons that fire selectively for semantic concepts like “royalty,” “negation,” or “past tense.” Manually identifying which neurons are responsible for which behaviors is tedious and error-prone; we wanted a repeatable pipeline that runs targeted probes, scores neurons by selectivity, and optionally confirms findings through causal ablation.


Solution Overview

NEO built NeuroLens, an interpretability tool that makes concept neurons visible and actionable:

  1. Targeted Probing: Run concept-specific probes against the model and capture activations layer by layer
  2. Neuron Scoring: Score each neuron on selectivity (how much it responds to the concept vs. non-concept examples), consistency, and layer importance
  3. Optional Causal Validation: Zero out top neurons and measure the drop in model confidence to confirm they actually drive behavior
  4. Multi-Model Support: Works across GPT-2 (small, medium), BERT, RoBERTa, and vision models (ResNet-18, ResNet-50)

The tool produces interactive HTML reports with heatmaps, scatter plots, Sankey diagrams, and per-neuron “passport” cards, so you can see exactly where and how a concept is encoded.


Workflow / Pipeline

StepDescription
1. Concept & Model SelectionChoose a concept (e.g., royalty, negation, past_tense) and model (GPT-2, BERT, RoBERTa, ResNet)
2. Probe ConstructionBuild or load balanced positive/negative examples for the concept (built-in or custom probe dataset)
3. Activation HooksRegister hooks on MLP/FFN layers to capture neuron activations for each probe example
4. Selectivity & Consistency ScoringCompute per-neuron selectivity (activation difference) and consistency across positive examples
5. Optional Causal AblationZero out top neurons and measure drop in model confidence to validate behavioral impact
6. Clustering & LabelingGroup neurons into concept families and optionally generate plain-English descriptions per neuron
7. Report GenerationProduce self-contained HTML report with heatmaps, scatter plots, Sankey diagrams, and neuron cards

Repository & Artifacts

dakshjain-1616/Neuron-Activation-MapperView on GitHub

Generated Artifacts:


Technical Details

Core modules

ModuleRole
probe_builder.pyGenerates balanced concept / non-concept datasets
loader.pyRegisters activation hooks on MLP/FFN layers
activation_scorer.pyComputes selectivity, consistency, and importance metrics
concept_labeler.pyAuto-labels neurons and clusters into concept families
causal_ablator.pyZeros out neurons and measures confidence impact
visualizer.pyHeatmaps, scatter plots, Sankey diagrams
report_generator.pyAssembles self-contained HTML reports
cli.pyCommand-line interface

Results: Validated Concept Neurons

ConceptModelFinding
RoyaltyGPT-2L5–L7 neurons, selectivity > 0.7 for king/queen/crown
NegationGPT-2L3–L5 neurons firing on not/never/didn’t
Past tenseGPT-2L2–L4 neurons correlating with -ed suffix
DogResNet-18Conv4/5 neurons for fur texture and ear shape (GradCAM)
SentimentGPT-2L8–L10 neurons with >15% causal ablation drop

Best Practices & Lessons Learned


Next Steps


References

View source on GitHub


Learn More