🧠 SAE Cross-Domain Analysis

Gemma 4B IT · Layer 22 · Residual Post · 65K SAE · L0 Medium
Metric: Pearson correlation · 8 cognitive domains · Adaptive dimension selection

Dimension Selection

Instead of a fixed top-K, each domain selects its own reliable SAE dimensions:

Union of per-domain selections = dimensions for visualization.

Silhouette score: (domain separability in UMAP space)

UMAP Projection

Each point = 1 question. Colored by domain. Hover for details.

Cross-Domain Pearson Correlation

Pearson correlation of domain mean activations on the union of reliable dimensions.

Key Findings

coding is the most distinctive domain. All off-diagonal correlations with coding are the lowest (0.23–0.44). Robust across all metrics and datasets.
geography ↔ language shows strong alignment (0.84). These two domains share many high-activation SAE features. Cross-validated with multiple datasets — Pearson profile consistency ≥ 0.94.
symbolic ↔ math alignment (0.83) is also strong and robust across datasets (ReClor ↔ OpenBookQA profile r = 0.82).
math is heterogeneous. GSM8K (grade school arithmetic) and MMLU College Math show near-zero profile correlation (r ≈ 0.13). They activate fundamentally different SAE features. "Math" may not be a single cognitive domain.
spatial lacks a reliable second dataset. Only HellaSwag was available (commonsense/physical reasoning, not true spatial navigation). Self-built navigation tasks excluded per project rules.
sentiment has the most reliable dimensions (370) while coding has the fewest (55). This itself is a finding — sentiment tasks consistently activate many SAE features, coding tasks rely on fewer but more specific ones.

Cross-Validation: Within-Domain Consistency

For domains with 2+ independent datasets: how similar is each dataset's "similarity profile" to other datasets in the same domain? A high value means the domain's position in the similarity space is stable regardless of which benchmark you use.

✅ > 0.7 consistent · ⚠️ 0.4–0.7 mixed · ❌ < 0.4 unstable

Methodology

SAE Activation Extraction

For each question in each dataset, we run Gemma 4B IT, hook into Layer 22 residual stream, apply the SAE encoder (65K features, L0 Medium), and take the mean activation across all tokens after thresholding (feature_acts × (feature_acts > threshold)). This yields a 65,536-dimensional vector per question.

Dimension Selection (Adaptive)

Previous versions used a fixed top-K (e.g., top-200 active dimensions). This was problematic because:

Now, each domain independently selects dimensions where activation is consistent across its 120 questions (CV < 0.5, mean > 10). This is a unified principle, not a fixed K. Coding gets 55 dims, sentiment gets 370 — reflecting genuine cognitive structure.

Pearson as Primary Metric

Pearson correlation preserves activation magnitude. If two domains both strongly activate the same SAE feature (e.g., dim 839: coding=187, geography=579), Pearson captures this as a high correlation — which correctly reflects that they share a "brain region." Spearman discards this magnitude information.

Datasets

DomainDatasetsSource
codingHumanEval, MBPPCode generation benchmarks
mathMMLU College Math, GSM8K, MMLU HS MathMath reasoning benchmarks
geographyMMLU Geography, MMLU HS GeographyGeography knowledge
historyMMLU US History, MMLU HS HistoryHistorical knowledge
sentimentDAIR-Emotion, IMDb, SST-2Sentiment analysis
languageGLUE/CoLA, GLUE/RTELanguage understanding
symbolicReClor, OpenBookQALogical/scientific reasoning
spatialHellaSwagCommonsense/physical reasoning

All datasets are from public benchmarks. No self-built questions included.