How do Large Language Models integrate concepts/words/patterns? Existing theory assumes they store a large dictionary and do simple computation (linear addition) with concepts from that dictionary. I show multiple kinds of evidence that LLMs store and use complex integrations (non-linear relationships) between concepts/words/patterns, revealing this ability with a relatively simple mechanistic interpretability method — and that this area has been overlooked in the field of mechanistic interpretability.
Link to blog on project: https://omarclaflin.com/llm-interpretability-project-dual-encoding-in-neural-network-representations/
https://arxiv.org/abs/2507.00269
[I’ll add AAAI link here later]
Quick presentation for AAAI 2026 conference
(Edit: added Youtube video link now)
Poster:
AAAI 2026 Website link: