Omar Claflin

about

Category: Science Stuff

KL Divergence & the “pathology of SAEs” partially solved by Feature Interactions? Maybe.

June 25, 2025

I’ve been trying to catch up on the field of interpretability these last few weeks in my free time, and have been going back over some materials that I either skimmed or simply referenced from another source. One such interesting paper/post was: https://www.alignmentforum.org/posts/rZPiuFxESMxCDHe4B/sae-reconstruction-errors-are-empirically-pathological Gurnee, W. “SAE reconstruction errors are (empirically) pathological.” AI Alignment Forum, March…
LLM Intervention Experiments with “Integrated Features”, part 3

June 23, 2025

(This is a continuation of the previous posts: https://omarclaflin.com/2025/06/19/updated-nfm-approach-methodology/; Intro to idea: https://omarclaflin.com/2025/06/14/information-space-contains-computations-not-just-feature/) Paper: Feature Integration Beyond Sparse Coding: Evidence for Non-Linear Computation Spaces in Neural Networks Background: LLMs are commonly decomposed via SAE (or other encoders) into linearly separable parts, that lend surprising interpretability of those features. This project aims to explore the non-linear…
Updated NFM Approach Methodology

June 19, 2025

This post is an update on: https://omarclaflin.com/2025/06/14/information-space-contains-computations-not-just-features/ related this repo: https://github.com/omarclaflin/LLM_Intrepretability_Integration_Neurons This post covers NFM tricks and tips applied to LLMs. I will update the new repo link here when I make my next post (). Summary: Can we model feature integrations in a scalable and interpretable way? We assume feature integrations are features interactions…
Information space contains computations, not just features

June 14, 2025

Github here: https://github.com/omarclaflin/LLM_Intrepretability_Integration_Neurons Last weekend, I read Anthropic’s impressive piece of work: https://transformer-circuits.pub/2025/attribution-graphs/biology.html on a flight (which made my flight go faster). I think their most superficial takeaways are intuitive (e.g. LLM self-descriptions of its own thought process doesn’t match its own actual thinking process), and the examples are very clean (and the methodology behind…

recent posts

about

Category: Science Stuff

KL Divergence & the “pathology of SAEs” partially solved by Feature Interactions? Maybe.

LLM Intervention Experiments with “Integrated Features”, part 3

Updated NFM Approach Methodology

Information space contains computations, not just features