layman’s tl;dr: Scientists attempt to read the “thinking” of AI use a technique called Autoencoders. I attempt to improve that tool which results in improved accuracy and nuance in the interpreted thoughts. Specifically, rather than just mind-reading (decoding) the words or concepts, I improve the tool so the relationships between the words or concepts can be read out as well, giving us greater access to their thoughts.
broad tl;dr: Typical interpretability tools of LLMs relies on autoencoders (e.g. SAEs) which use a linear decomposition technique to pull out feature identities (e.g. “Golden Gate Bridge”, a plot twist, etc) out of the neural activation space (e.g. MLP layer of an LLM). While surprisingly effective, there’s a host of problems still plaguing this approach and warping our understanding of how neural network spaces encode feature identities:
- non-orthogonality of feature identities/currently assumed as acceptable ‘noise’ due to compressed encoding space,
- polysemanticity [features responding to multiple concepts/words/etc]
- a missing ‘dark matter’ of computation where [beyond recent circuit analysis approaches], it is still unknown where computation of ‘thought’ is happening,
- “pathological” logit distributions (measured via KL divergence), revealing potentially overfit SAEs that haven’t captured the actual learned features,
- where are the ‘inhibition circuits’ of LLM thought?,
- is non-linear computation truly negligible?
- other observations/inconsistencies resulting from a current point of view that might be due to missing feature integration maps
specific tl;dr: Joint training of SAE and NFM components achieves 41.3% reconstruction improvement and 51.6% KL divergence reduction while spontaneously developing bimodal distribution of squared norm of features that validates the dual encoding hypothesis (with more subtle, distributed features contributing more to the feature interactions). The architecture also demonstrates systematic behavioral effects through controlled intervention experiments.
All Posts:
(6/14/25) First (idea) post: https://omarclaflin.com/2025/06/14/information-space-contains-computations-not-just-features/
(6/19/25) Using NFMs to explore non-linear interactions: https://omarclaflin.com/2025/06/19/updated-nfm-approach-methodology/
(6/23/25) Architecting a new, joint-training SAE + NFM: https://omarclaflin.com/2025/06/23/llm-intervention-experiments-with-integrated-features-part-3/
(6/25/25) Current problem in interpretability solved by this approach: https://omarclaflin.com/2025/06/25/kl-divergence-the-pathology-of-saes-partially-solved-by-feature-interactions/
(6/29/25) Dual encoding features seen as a bimodal distribution, final complete paper: https://omarclaflin.com/2025/06/29/joint-training-breakthrough-from-sequential-to-integrated-feature-learning/
paper (submitted, 6/30/25): https://arxiv.org/abs/2507.00269
[update 11/10/25, accepted into AAAI 2026, Main Technical Track]
github: https://github.com/omarclaflin/LLM_Intrepretability_Integration_Features_v2
(11/17/25) retrospective notes on using AI to accelerate this research project: https://omarclaflin.com/2025/11/17/ai-hype-post-how-llm-chatbots-helped-me-get-a-research-paper-accepted-at-a-very-competitive-conference-over-a-couple-weekends/
FAQ (Added 11/10/2025, from reviewers)
- Can you explain the difference between feature coactivation and feature integration?
Definitions:- Dictionary feature: Concept identity (f=a1+a2+a3…) “surprise” detected
- Co-activation: Statistical correlation (corr(f1,f2)) “surprise” + “unknown” co-occur
- Integration feature: Computational combination (I(f1,f2,..); I=f1+f2, etc) “happy” + “birthday” →JOY
- Nonlinear Integration feature: Nonlinear computational combination (I=f1*f2) “surprise” + “diagnosis” →DREAD
- Co-activation is the same feature activated by different inputs. Integration feature is a computational combination of two co-occurring features
- How does this differ from Engels work? Also, they suggest most reconstruction error is linear.
- Engel’s paper inferred types of error from indirect modelling; we explicitly resolved error by directly modelling it from specific hypothesized sources.
- BOTH of our NFM components (linear, non-linear) would be considered “nonlinear” error by their definition (they model SAE error from the activations, not SAE features)
- The last figure in their paper (unless I’m misunderstanding) suggests ~85% of error is NON-linear with a large enough SAE, strengthening our case that adding more dictionary features cannot explain the entire encoding of an LLM.
- Our paper focuses on missing computation of SAE features, with methods that interpretably & accurately explain behavioral output of the LLM
- More follow-up on the nonlinear component would be appreciated, given the suggestion that it has outweighing impact on the reconstruction accuracy.
- A follow-up nonlinear component analysis was done (page 6); which led to the discovery of the gram matrix phenomena (bimodal distribution, where the lowest norm features contribute to nonlinear interactions; high norm contribute directly to the residual). We can try to clarify this more in the future. From initial analysis, seems to show the distribution of the high norm gram features roughly overlap with a standard SAE distribution of gram features. Our joint architecture seems to reveal previously unseen lower-gram features that seem to contribute directly to the feature integrations.
- Is monosemanticity as a concept, related to this paper?
- Yep, semanticity is related. Semanticity currently defined, is an explanation focusing only on the trade-off between polysemanticity (noisy interference between features) and compression space, but as our paper shows LLM encoding also considers computational relationships BETWEEN features during compression, & suggests polysemanticity might reflect badly decoded but useful LLM function.
- Have you tried this on industry-sized models? Also, can’t we just use a larger SAE to simply solve these problems?
- Larger, more sophisticated SAEs could be compared, but our approach can also be scaled up in size or combined with any SAE method. This paper was intended as a demonstration of fundamental phenomena. As mentioned in the discussion, NFMs do not scale linearly so that is one challenge limiting this approach, but not insurmountable. Also, other techniques of non-linear feature interactions may ultimately prove to be superior in production to NFMs. As an enthusiast with a single 3090 RTX, this size model and datasets presented, were what was reasonably possible for me to test. However, I put in effort to establish the fundamental phenomena via reasonable comparisons, and tried to demonstrate its likely existence (1) statistically, (2) behaviorally, and (3) via computational metrics.