back to reflections

An MRI For LLMs

Anthropic's Natural Language Autoencoders are not mind-reading. They are a new interpretability instrument for translating internal model states into text.
Petko D. Petkovon a break from CISO duties, building cbk.ai

Anthropic just released work on Natural Language Autoencoders, a way of turning Claude's internal activation states into readable text.

The headline is that Claude's thoughts can now be read. It is not Claude confessing what it really thinks. It is a trained translator of sorts. The system takes an activation, turns it into a natural language explanation, then checks whether that explanation can reconstruct the original activation.

Activation, explanation, reconstructed activation. If the reconstruction works, the explanation probably captured something real.

Anthropic reports that these explanations can surface internal content the model did not say out loud like planned rhymes, task strategies, training-data associations, evaluation awareness, and even hidden motivations in intentionally misaligned models, which is pretty impressive if true.

But this is not mind-reading. It is more like an MRI for LLMs with captions.

The captions can be wrong! Anthropic says so directly. NLAs can hallucinate and need corroboration. Though generated captions do not seem to collapse into nonsense. They are coherent, relevant, and useful.

Perhaps, this tells us that model internals are not just numerical mush. They appear to contain some structural representation. But just because it happens it does not mean it is truthful. The text can be considered an evidence, but not a confession.

The technology might be moving from hand-labeling to building translators for model cognition, which will make audits more scalable. That could be a significant step forward for AI safety but also fine-tuning and alignment. Though, it all depends on how reliable the decoder is and whether the claims can be verified.