🧠 SAE Cross-Domain Analysis

Gemma 4B IT · Layer 22 · Residual Post · 65K SAE · L0 Medium
Metric: Pearson correlation · 8 cognitive domains · Adaptive dimension selection

Dimension Selection

Instead of a fixed top-K, each domain selects its own reliable SAE dimensions:

CV < 0.5: coefficient of variation across 120 questions (consistent activation, not sporadic)
mean > 10: meaningful activation magnitude (above noise floor)

Union of per-domain selections = dimensions for visualization.

Silhouette score: (domain separability in UMAP space)

UMAP Projection

Each point = 1 question. Colored by domain. Hover for details.

Cross-Domain Pearson Correlation

Pearson correlation of domain mean activations on the union of reliable dimensions.

Key Findings

coding is the most distinctive domain. All off-diagonal correlations with coding are the lowest (0.23–0.44). Robust across all metrics and datasets.

geography ↔ language shows strong alignment (0.84). These two domains share many high-activation SAE features. Cross-validated with multiple datasets — Pearson profile consistency ≥ 0.94.

symbolic ↔ math alignment (0.83) is also strong and robust across datasets (ReClor ↔ OpenBookQA profile r = 0.82).

math is heterogeneous. GSM8K (grade school arithmetic) and MMLU College Math show near-zero profile correlation (r ≈ 0.13). They activate fundamentally different SAE features. "Math" may not be a single cognitive domain.

spatial lacks a reliable second dataset. Only HellaSwag was available (commonsense/physical reasoning, not true spatial navigation). Self-built navigation tasks excluded per project rules.

sentiment has the most reliable dimensions (370) while coding has the fewest (55). This itself is a finding — sentiment tasks consistently activate many SAE features, coding tasks rely on fewer but more specific ones.

Cross-Validation: Within-Domain Consistency

For domains with 2+ independent datasets: how similar is each dataset's "similarity profile" to other datasets in the same domain? A high value means the domain's position in the similarity space is stable regardless of which benchmark you use.

✅ > 0.7 consistent · ⚠️ 0.4–0.7 mixed · ❌ < 0.4 unstable

Methodology

SAE Activation Extraction

For each question in each dataset, we run Gemma 4B IT, hook into Layer 22 residual stream, apply the SAE encoder (65K features, L0 Medium), and take the mean activation across all tokens after thresholding (feature_acts × (feature_acts > threshold)). This yields a 65,536-dimensional vector per question.

Dimension Selection (Adaptive)

Previous versions used a fixed top-K (e.g., top-200 active dimensions). This was problematic because:

Most distinguishing signal is concentrated in ~30 dimensions (48% of signal in top-10)
Different domains have different numbers of meaningful dimensions
Adding noise dimensions degrades UMAP structure (silhouette drops above 1000 dims)

Now, each domain independently selects dimensions where activation is consistent across its 120 questions (CV < 0.5, mean > 10). This is a unified principle, not a fixed K. Coding gets 55 dims, sentiment gets 370 — reflecting genuine cognitive structure.

Pearson as Primary Metric

Pearson correlation preserves activation magnitude. If two domains both strongly activate the same SAE feature (e.g., dim 839: coding=187, geography=579), Pearson captures this as a high correlation — which correctly reflects that they share a "brain region." Spearman discards this magnitude information.

Datasets

Domain	Datasets	Source
coding	HumanEval, MBPP	Code generation benchmarks
math	MMLU College Math, GSM8K, MMLU HS Math	Math reasoning benchmarks
geography	MMLU Geography, MMLU HS Geography	Geography knowledge
history	MMLU US History, MMLU HS History	Historical knowledge
sentiment	DAIR-Emotion, IMDb, SST-2	Sentiment analysis
language	GLUE/CoLA, GLUE/RTE	Language understanding
symbolic	ReClor, OpenBookQA	Logical/scientific reasoning
spatial	HellaSwag	Commonsense/physical reasoning

All datasets are from public benchmarks. No self-built questions included.