The number and quality of interpretable features extracted by sparse autoencoders increases predictably as model size grows, suggesting feature-level interpretability is not an artifact of small-scale experiments.
Anthropic (2024) demonstrated that SAE features found in small models have analogues in larger models, and that feature quality metrics improve with scale. This was independently confirmed across multiple model families.