SAE features scale predictably with model size

active replicated empirical positive falsifiable
v1 · From Anthropic scaling monosemanticity 2024

summary
The number and quality of interpretable features extracted by sparse autoencoders increases predictably as model size grows, suggesting feature-level interpretability is not an artifact of small-scale experiments.
Anthropic (2024) demonstrated that SAE features found in small models have analogues in larger models, and that feature quality metrics improve with scale. This was independently confirmed across multiple model families.
trust profile
dimensions
evid
89%
repl
74%
cons
81%
meth
82%
cred
91%
scop
78%
brdg
28%
cont
8%
derived scores
supp
86%
fron
47%
stab
76%
claim_support_vector v1.0 · 2026-03-09 19:58 UTC
evidence 3
↑ supporting 2
supports · artifact
Scaling Monosemanticity (Anthropic, 2024)
The 2024 scaling paper demonstrates that SAE features found in small models have analogues in larger models with improving quality metrics.
Scaling Monosemanticity (Anthropic, 2024) · 91% — Feature scaling results
supports · artifact
Toy Models of Superposition (Elhage et al., 2022)
The 2022 toy models paper shows that superposition and polysemanticity emerge predictably as a function of feature dimensionality and sparsity.
Toy Models of Superposition (Elhage et al., 2022) · 88% — Superposition emergence in toy models
• asserting 1
asserts · artifact
Scaling Monosemanticity (Anthropic, 2024)
Primary assertion from Anthropic 2024.
scope
holds_in Large-scale language models (>1B parameters) model_scale
neighborhood 2
supports Feature universality concept
attestations
Curator (Human) verifies 0.92
Claim wording accurately reflects the scaling monosemanticity findings.
Anthropic endorses 0.88
Anthropic endorses the scaling result.
domains
Mechanistic Interpretability 100%
view status
Strict Empirical included
computation trace
show raw trace data
{
  "note": "Well-replicated across multiple papers and model scales",
  "inputs": {
    "artifacts": 2,
    "attestations": 2,
    "disputing_edges": 0,
    "supporting_edges": 2
  }
}