NeuroLens: Concept Neuron Discovery and Validation

Discover and validate which neurons in a neural network encode specific concepts—from “royalty” and “negation” to sentiment and tense—with automated probing, scoring, and causal ablation

Problem Statement

We asked NEO to: Build a tool that automatically finds, labels, and validates “concept neurons” in neural networks—the individual neurons that fire selectively for semantic concepts like “royalty,” “negation,” or “past tense.” Manually identifying which neurons are responsible for which behaviors is tedious and error-prone; we wanted a repeatable pipeline that runs targeted probes, scores neurons by selectivity, and optionally confirms findings through causal ablation.

Solution Overview

NEO built NeuroLens, an interpretability tool that makes concept neurons visible and actionable:

Targeted Probing: Run concept-specific probes against the model and capture activations layer by layer
Neuron Scoring: Score each neuron on selectivity (how much it responds to the concept vs. non-concept examples), consistency, and layer importance
Optional Causal Validation: Zero out top neurons and measure the drop in model confidence to confirm they actually drive behavior
Multi-Model Support: Works across GPT-2 (small, medium), BERT, RoBERTa, and vision models (ResNet-18, ResNet-50)

The tool produces interactive HTML reports with heatmaps, scatter plots, Sankey diagrams, and per-neuron “passport” cards, so you can see exactly where and how a concept is encoded.

Workflow / Pipeline

Step	Description
1. Concept & Model Selection	Choose a concept (e.g., royalty, negation, past_tense) and model (GPT-2, BERT, RoBERTa, ResNet)
2. Probe Construction	Build or load balanced positive/negative examples for the concept (built-in or custom probe dataset)
3. Activation Hooks	Register hooks on MLP/FFN layers to capture neuron activations for each probe example
4. Selectivity & Consistency Scoring	Compute per-neuron selectivity (activation difference) and consistency across positive examples
5. Optional Causal Ablation	Zero out top neurons and measure drop in model confidence to validate behavioral impact
6. Clustering & Labeling	Group neurons into concept families and optionally generate plain-English descriptions per neuron
7. Report Generation	Produce self-contained HTML report with heatmaps, scatter plots, Sankey diagrams, and neuron cards

Repository & Artifacts

dakshjain-1616/Neuron-Activation-MapperView on GitHub

Generated Artifacts:

Self-contained interactive report.html with heatmaps, scatter plots, Sankey diagrams, and per-neuron passport cards
ablation_log.json with raw causal ablation results when causal validation is enabled
Support for custom probe datasets (text or vision, with positive/negative example splits)
Pipeline runs fully offline after initial model download from Hugging Face or torchvision

Technical Details

Supported Models: GPT-2, GPT-2-medium, BERT-base-uncased, RoBERTa-base, ResNet-18, ResNet-50
Modalities: Text and vision (default: text)
Metrics: Selectivity score [-1, +1], Consistency score [0, 1], Layer importance (correlation with output logits)
Causal Ablation: Optional zero-out of neurons with reported confidence drop
Layer Detection: Automatic for GPT-2 (.h[i]) and BERT (.encoder.layer[i])
Stack: Python, PyTorch, and Hugging Face Transformers; runs fully offline after initial model download

Core modules

Module	Role
`probe_builder.py`	Generates balanced concept / non-concept datasets
`loader.py`	Registers activation hooks on MLP/FFN layers
`activation_scorer.py`	Computes selectivity, consistency, and importance metrics
`concept_labeler.py`	Auto-labels neurons and clusters into concept families
`causal_ablator.py`	Zeros out neurons and measures confidence impact
`visualizer.py`	Heatmaps, scatter plots, Sankey diagrams
`report_generator.py`	Assembles self-contained HTML reports
`cli.py`	Command-line interface

Results: Validated Concept Neurons

Concept	Model	Finding
Royalty	GPT-2	L5–L7 neurons, selectivity > 0.7 for king/queen/crown
Negation	GPT-2	L3–L5 neurons firing on not/never/didn’t
Past tense	GPT-2	L2–L4 neurons correlating with -ed suffix
Dog	ResNet-18	Conv4/5 neurons for fur texture and ear shape (GradCAM)
Sentiment	GPT-2	L8–L10 neurons with >15% causal ablation drop

Best Practices & Lessons Learned

Start with built-in concepts (e.g., royalty, negation, past tense) to validate the pipeline before defining custom probes
Use the explainability features to get human-readable neuron descriptions and spot-check findings
Turn on causal validation when you need to confirm that a neuron actually drives model behavior, not just correlates with it
Focus on specific layers when you already know which layers matter—it speeds up runs and keeps analysis tight
Compare across architectures (e.g., same concept on GPT-2 vs. BERT) to see how different models encode the same idea
Custom probes (text or vision, with positive/negative splits) let you target domain-specific concepts

Next Steps

Add support for more architectures (LLaMA, Mistral, ViT) with automatic layer detection
Export neuron rankings and metrics to CSV/JSON for downstream analysis and dashboards
Integrate with training pipelines to monitor concept drift across checkpoints
Build a small UI for browsing reports and comparing runs side by side
Extend causal ablation to multi-neuron interventions and minimal sufficient sets
Publish a small benchmark of concept-neuron findings for reproducibility

References

View source on GitHub

NEO·OpenAI Microscope·Anthropic Circuits