Rather than decomposing models into interpretable parts, representation engineering identifies and manipulates high-level concepts directly in activation space, offering control without requiring full mechanistic understanding.
Zou et al. (2023) showed that concepts like honesty, power-seeking, and harmfulness have identifiable directions in activation space that can be read and written. This bypasses the decomposition problem entirely — you do not need to understand every neuron to steer the model, only to find the right direction.