NeuroLens: Concept Neuron Discovery and Validation
Discover and validate which neurons in a neural network encode specific concepts—from “royalty” and “negation” to sentiment and tense—with automated probing, scoring, and causal ablation
Problem Statement
We asked NEO to: Build a tool that automatically finds, labels, and validates “concept neurons” in neural networks—the individual neurons that fire selectively for semantic concepts like “royalty,” “negation,” or “past tense.” Manually identifying which neurons are responsible for which behaviors is tedious and error-prone; we wanted a repeatable pipeline that runs targeted probes, scores neurons by selectivity, and optionally confirms findings through causal ablation.
Solution Overview
NEO built NeuroLens, an interpretability tool that makes concept neurons visible and actionable:
- Targeted Probing: Run concept-specific probes against the model and capture activations layer by layer
- Neuron Scoring: Score each neuron on selectivity (how much it responds to the concept vs. non-concept examples), consistency, and layer importance
- Optional Causal Validation: Zero out top neurons and measure the drop in model confidence to confirm they actually drive behavior
- Multi-Model Support: Works across GPT-2 (small, medium), BERT, RoBERTa, and vision models (ResNet-18, ResNet-50)
The tool produces interactive HTML reports with heatmaps, scatter plots, Sankey diagrams, and per-neuron “passport” cards, so you can see exactly where and how a concept is encoded.
Workflow / Pipeline
| Step | Description |
|---|---|
| 1. Concept & Model Selection | Choose a concept (e.g., royalty, negation, past_tense) and model (GPT-2, BERT, RoBERTa, ResNet) |
| 2. Probe Construction | Build or load balanced positive/negative examples for the concept (built-in or custom probe dataset) |
| 3. Activation Hooks | Register hooks on MLP/FFN layers to capture neuron activations for each probe example |
| 4. Selectivity & Consistency Scoring | Compute per-neuron selectivity (activation difference) and consistency across positive examples |
| 5. Optional Causal Ablation | Zero out top neurons and measure drop in model confidence to validate behavioral impact |
| 6. Clustering & Labeling | Group neurons into concept families and optionally generate plain-English descriptions per neuron |
| 7. Report Generation | Produce self-contained HTML report with heatmaps, scatter plots, Sankey diagrams, and neuron cards |
Repository & Artifacts
Generated Artifacts:
- Self-contained interactive report.html with heatmaps, scatter plots, Sankey diagrams, and per-neuron passport cards
- ablation_log.json with raw causal ablation results when causal validation is enabled
- Support for custom probe datasets (text or vision, with positive/negative example splits)
- Pipeline runs fully offline after initial model download from Hugging Face or torchvision
Technical Details
- Supported Models: GPT-2, GPT-2-medium, BERT-base-uncased, RoBERTa-base, ResNet-18, ResNet-50
- Modalities: Text and vision (default: text)
- Metrics: Selectivity score [-1, +1], Consistency score [0, 1], Layer importance (correlation with output logits)
- Causal Ablation: Optional zero-out of neurons with reported confidence drop
- Layer Detection: Automatic for GPT-2 (
.h[i]) and BERT (.encoder.layer[i]) - Stack: Python, PyTorch, and Hugging Face Transformers; runs fully offline after initial model download
Core modules
| Module | Role |
|---|---|
probe_builder.py | Generates balanced concept / non-concept datasets |
loader.py | Registers activation hooks on MLP/FFN layers |
activation_scorer.py | Computes selectivity, consistency, and importance metrics |
concept_labeler.py | Auto-labels neurons and clusters into concept families |
causal_ablator.py | Zeros out neurons and measures confidence impact |
visualizer.py | Heatmaps, scatter plots, Sankey diagrams |
report_generator.py | Assembles self-contained HTML reports |
cli.py | Command-line interface |
Results: Validated Concept Neurons
| Concept | Model | Finding |
|---|---|---|
| Royalty | GPT-2 | L5–L7 neurons, selectivity > 0.7 for king/queen/crown |
| Negation | GPT-2 | L3–L5 neurons firing on not/never/didn’t |
| Past tense | GPT-2 | L2–L4 neurons correlating with -ed suffix |
| Dog | ResNet-18 | Conv4/5 neurons for fur texture and ear shape (GradCAM) |
| Sentiment | GPT-2 | L8–L10 neurons with >15% causal ablation drop |
Best Practices & Lessons Learned
- Start with built-in concepts (e.g., royalty, negation, past tense) to validate the pipeline before defining custom probes
- Use the explainability features to get human-readable neuron descriptions and spot-check findings
- Turn on causal validation when you need to confirm that a neuron actually drives model behavior, not just correlates with it
- Focus on specific layers when you already know which layers matter—it speeds up runs and keeps analysis tight
- Compare across architectures (e.g., same concept on GPT-2 vs. BERT) to see how different models encode the same idea
- Custom probes (text or vision, with positive/negative splits) let you target domain-specific concepts
Next Steps
- Add support for more architectures (LLaMA, Mistral, ViT) with automatic layer detection
- Export neuron rankings and metrics to CSV/JSON for downstream analysis and dashboards
- Integrate with training pipelines to monitor concept drift across checkpoints
- Build a small UI for browsing reports and comparing runs side by side
- Extend causal ablation to multi-neuron interventions and minimal sufficient sets
- Publish a small benchmark of concept-neuron findings for reproducibility