Representation engineering offers an alternative to mechanistic decomposition

active tested methodological positive partially_falsifiable

v1 · Representation engineering claim

summary

Rather than decomposing models into interpretable parts, representation engineering identifies and manipulates high-level concepts directly in activation space, offering control without requiring full mechanistic understanding.

Zou et al. (2023) showed that concepts like honesty, power-seeking, and harmfulness have identifiable directions in activation space that can be read and written. This bypasses the decomposition problem entirely — you do not need to understand every neuron to steer the model, only to find the right direction.

trust profile

dimensions

evid

68%

repl

35%

cons

52%

meth

72%

cred

75%

scop

60%

brdg

45%

cont

28%

derived scores

supp

68%

fron

55%

stab

50%

claim_support_vector v1.0 · 2026-03-09 19:58 UTC

evidence 2

↑ supporting 1

supports · artifact

Representation Engineering (Zou et al., 2023)

Zou et al. show that representation engineering can steer model behavior without mechanistic decomposition.

Representation Engineering (Zou et al., 2023) · 85% — Representation reading and writing methodology

• asserting 1

asserts · artifact

Representation Engineering (Zou et al., 2023)

Direct assertion from the representation engineering paper.

attestations

Curator (Human) verifies 0.8

Representation engineering claim is well-supported by the Zou et al. paper.

domains

Mechanistic Interpretability 100%

view status

Strict Empirical included

computation trace

show raw trace data

{
  "note": "Promising alternative approach, moderate replication",
  "inputs": {
    "artifacts": 1,
    "attestations": 1,
    "disputing_edges": 0,
    "supporting_edges": 1
  }
}