Category: Science Stuff

  • I’ve been trying to catch up on the field of interpretability these last few weeks in my free time, and have been going back over some materials that I either skimmed or simply referenced from another source. One such interesting paper/post was: https://www.alignmentforum.org/posts/rZPiuFxESMxCDHe4B/sae-reconstruction-errors-are-empirically-pathological Gurnee, W. “SAE reconstruction errors are (empirically) pathological.” AI Alignment Forum, March…

  • (This is a continuation of the previous posts: https://omarclaflin.com/2025/06/19/updated-nfm-approach-methodology/; Intro to idea: https://omarclaflin.com/2025/06/14/information-space-contains-computations-not-just-feature/) Paper: Feature Integration Beyond Sparse Coding: Evidence for Non-Linear Computation Spaces in Neural Networks Background: LLMs are commonly decomposed via SAE (or other encoders) into linearly separable parts, that lend surprising interpretability of those features. This project aims to explore the non-linear…

  • This post is an update on: https://omarclaflin.com/2025/06/14/information-space-contains-computations-not-just-features/ related this repo: https://github.com/omarclaflin/LLM_Intrepretability_Integration_Neurons This post covers NFM tricks and tips applied to LLMs. I will update the new repo link here when I make my next post (). Summary: Can we model feature integrations in a scalable and interpretable way? We assume feature integrations are features interactions…

  • Github here: https://github.com/omarclaflin/LLM_Intrepretability_Integration_Neurons Last weekend, I read Anthropic’s impressive piece of work: https://transformer-circuits.pub/2025/attribution-graphs/biology.html on a flight (which made my flight go faster). I think their most superficial takeaways are intuitive (e.g. LLM self-descriptions of its own thought process doesn’t match its own actual thinking process), and the examples are very clean (and the methodology behind…