AAAI 2026: Dual Encoding Spaces in LLMs for improved Mechanistic Interpretability

How do Large Language Models integrate concepts/words/patterns? Existing theory assumes they store a large dictionary and do simple computation (linear addition) with concepts from that dictionary. I show multiple kinds of evidence that LLMs store and use complex integrations (non-linear relationships) between concepts/words/patterns, revealing this ability with a relatively simple mechanistic interpretability method — and that this area has been overlooked in the field of mechanistic interpretability.

Link to blog on project: https://omarclaflin.com/llm-interpretability-project-dual-encoding-in-neural-network-representations/

https://arxiv.org/abs/2507.00269

[I’ll add AAAI link here later]

Quick presentation for AAAI 2026 conference