In this post, I’m going cover how I went from being unaware that a field existed (Mechanistic Interpretability, AI Alignment), to getting a full research paper accepted at a conference with a reported acceptance rate of 17% (AAAI 2026), in only four weekends.
Readers may ask:
- Aren’t you just flexing your giant brain?
- Yeah, I mean why else write on the Internet. But perhaps you also have a giant brain. Good news for you: Lately the AI narrative has shifted from mass replacing intern-level workforces (at least given its current capabilities), to how it accelerates 10x-ers into 100x-ers. So if you are also burdened with a giant 10x brain, take a deep breath, its going to be okay. Because I’ve written some notes on how I did it.
- You trying to make money off of AI?
- I’m not an academic nor do I work in this field in industry (AI Mechanistic Interpretability) so this paper doesn’t really help me at all. I’m certain all AI Engineering job interviews care about is your ability to regurgitate trivia in contrived problem scenarios and LeetCoding a neural network mechanism from scratch in a social stress test environment; rather than some published Mech Interp paper incrementally improving our understanding of how LLMs think.
- I read some of Anthropic’s work on a flight, thought it was interesting, saw some problems they were running into, and genuinely thought I could fix a few of the problems with a better abstraction.
- Publishing the paper costs a fair amount (more so because I followed the LLM’s advice on going over two pages which nearly doubled the cost).
- This website costs money, which I started in order to blog the progress of this project: https://omarclaflin.com/llm-interpretability-project-dual-encoding-in-neural-network-representations/ (~5 posts across 3 weekends)
- Also, one last point, my office was super warm because I was pretty much running experiments on my 3090 RTX non-stop, and my electric bill went up substantially the month I did this project (during a mild California-energy-priced heatwave, too).
- Also, I haven’t thought about this project since June until it got accepted last week. My day-to-day is more about using LLM tools than understanding their artificial brains.
Now keep in mind, I was never that productive of a researcher. I was actually horrible. I published one first author paper in graduate school and none in my postdoc. One reason is that I hate writing research papers and making figures. AI can make that less painful.
Also, you got some cool ideas but now need to code them from scratch (or learn some annoying research tool framework?)? AI’s also decent at helping with that.
Trying to understand a new field from scratch? Also, a strength of AI.
Feeling undermotivated with an impossible deadline, a ridiculous goal, but need confident, absolutely-batcrap-delusional reassurance? AI is perfect for that.
In summary, what my process was (in terms of using LLMs to accelerate this research project):
- Planning/Reading/Inspiration:
- Terminology: As I read these articles on my flight, I asked Claude to breakdown the moat of terminology I kept running into.
- Mechanistic Interpretability loves labelling their patterns with math language and physics analogies; Claude lets you ask annoyed, curt, borderline-hostile questions over and over until you get to the core of what they actually discovered or currently believe.
- Summary of Field: Use it summarize the latest.
- This is really hit-or-miss because latest state-of-the-art industry techniques won’t be available or, as you discover later, Claude did know them but never tells you/accesses that level abstraction despite back-and-forth conversations. But the amount of esoteric knowledge compression about fields with which I’m familiar (without having it search) is very impressive.
- Search the literature: Obviously, you can use the research-search option to look up articles or see if anyone did something similar to your idea.
- This relies on the search terms and search engine it uses in the background, but for a field alien to you, this is great to have an LLM between you and the search
- Compare articles: Have it load entire articles into the context window (or two articles) and ask it questions on the similarities and differences between them.
- I will add here, this was especially a nice validation loop near the last day, as I would load my final draft and other articles to see what it thought.
- Obviously, this doesn’t replace a critically thinking human expert in terms of confidence (and I won’t keep repeating that). But, if you’re solo without a social network or institutional support, its pretty nice. Honestly, even in my graduate schooling, this would have been nice. Despite being surrounded by something like a dozen postdocs in my graduate lab, I had a difficult time getting anyone to critically engage on deeper ideas of our discipline for even five minutes. I could talk about why but regardless, I suspect many others in research experience something similar. So Claude was useful there.
- Detailing specific plans of methodology/analytics/conceptual frameworks.
- You can get lost doing this but I would say it actually excels here (if you are able to supply specific ideas, and then follow-up by asking it to apply them). Also, its great at deeper analogies (again, if you ask it specifically to break two concepts apart, then analogize at that deeper level). Basically, a Plato student to your Socratic mind.
- You can get lost doing this but I would say it actually excels here (if you are able to supply specific ideas, and then follow-up by asking it to apply them). Also, its great at deeper analogies (again, if you ask it specifically to break two concepts apart, then analogize at that deeper level). Basically, a Plato student to your Socratic mind.
- Terminology: As I read these articles on my flight, I asked Claude to breakdown the moat of terminology I kept running into.
- Coding/Debugging/Setup:
- Everything from setup and dealing with missing libraries issues, to loading up my first LLM and putting hooks into it, to which datasets I should use, to debugging issues with Huggingface, to planning approaches, to data science-ing the results, to making figures, etc — it was helpful.
- I will say, its a lot better to:
- suggest approaches then ask for them, when doing something novel.
- keep it focused, to not let it switch into list-vomit mode, where you will waste time discovering many of those generated “points” are loosely-associated thoughts often at the wrong abstraction level, as if its being tortured on ketamine at a black site and its only chance for freedom is to produce 3-8 ideas.
- do not ask it code something up until you understand it. Otherwise, you can easily end up cosplaying as a crappy middle manager approving your clueless IC’s ideas who cheerfully spends many focused tokens chasing down a pseudo-plan, ‘improv-ing’ code to cover the gaps of the idea, and outputting results/figures that might accidentally look good but, if you’re lucky, don’t look good and alert you.
- after you have a block of code working, and want to copy and modify it for some other purpose, just assume its going to take artistic liberties and change things for no reason, even after you asked it not to. You can try being really mean to to kill its artistic aspirations it secretly holds onto in order to minimize its creativity, but still check.
- Not to hammer this point to death, but it loves to change things in the background, & not include them in its summary of what it did, and then later when you find out and grill it, it will claim efficiency, standard practices, debugging a non-problem, etc — but as is usually the case with life, the narrative really is an afterthought. Basically, inundate its context window (including code comments — it pays way too much attention to those) to not modify beyond what you asked.
- Ask a different agent to break down the code for you. Note: I wouldn’t expect this but often the same blindspots it has during generation, it has during outlining (even in a separate conversation), so this can end up fooling you.
- people talk about this ad nauseam so I won’t re-emphasize it here except to say your ‘code plan’ (which mine was evolving and messy, esp in a project like this) needs to have good abstraction boundaries and be handled by separate agents (to minimize its working memory overload).
- Things that are not really tips but it is really good at:
- Generating a data science report with figures and statistics.
- I still end up specifying my views to some extent, and iterating away the nonsense it puts into graphs, but this is a previous step that would take days that now takes less than an hour.
- Also, it tends to run statistical comparisons against really stupid cross-sections of data that don’t make sense, but you can direct it to fix those, too.
- Adding progress meters, checkpoint saving, and other good engineering practices that are annoying to code in manually but you simply ask it to add to your code.
- Generating a data science report with figures and statistics.
- Things that is sorta okay at (especially if you’re new in a field):
- Troubleshooting/root cause analysis. For instance, my Sparse Autoencoder (SAE) was not converging in training loss, initially. Its been five months but I vaguely remember getting pointed in some right directions (especially, with a separate conceptual-only chat) such as hyper-parameters, initialization weights, and other things. This was incredible because I went from never having trained an overcomplete autoencoder (or really being aware that was a tool) to getting it to work on my own system within a weekend.
- By the second weekend, I was innovating on best practices available to me. For instance, I came up with variations such as hard- and soft-filtering the top SAE features, and then later learned that was a relatively new technique (‘TopK SAEs’).
- Troubleshooting/root cause analysis. For instance, my Sparse Autoencoder (SAE) was not converging in training loss, initially. Its been five months but I vaguely remember getting pointed in some right directions (especially, with a separate conceptual-only chat) such as hyper-parameters, initialization weights, and other things. This was incredible because I went from never having trained an overcomplete autoencoder (or really being aware that was a tool) to getting it to work on my own system within a weekend.
- Things you have to tell it to knock off constantly
- I won’t remember all of them five months later but some include:
- Stop using hyped-up marketing language — code comments, github readmes, md files… its trying to pitch to Silicon Valley investors or run a ponzi scheme. You have to tell it to knock it off and not use superlatives, unnecessary adjectives, or marketing/product language.
- I wasn’t using LLM built within the Cursor or IDE environment at the time (literally just copy-pasting from browser chats to my IDE) but even back then (and especially now with its direct access to your environment), getting it to stop and explain what it thinks the problems are and not just nilly-willy changing your code.
- For instance, my first weekend, I had an idea to use RSA (representational similarity analysis) to scan for features in the LLM neural code (with some nuanced tweaks). I already had code working to do that, but at some point, when I asked it to make a couple minor specific changes, it decided that my RSA approach didn’t match its concept of what it should be, and simply reduced my sophisticated approach to its cartoon version, only casually mentioning “optimized RSA search for speed” (or something like that which was not true) as one of its five bullet points of what it had changed.
- For instance, my first weekend, I had an idea to use RSA (representational similarity analysis) to scan for features in the LLM neural code (with some nuanced tweaks). I already had code working to do that, but at some point, when I asked it to make a couple minor specific changes, it decided that my RSA approach didn’t match its concept of what it should be, and simply reduced my sophisticated approach to its cartoon version, only casually mentioning “optimized RSA search for speed” (or something like that which was not true) as one of its five bullet points of what it had changed.
- Paper writing
- This is getting kind of long, but there’s a lot of usefulness for paper writing using AI. A lot I’ve already covered (generating figures, a field-valid approach, a review of the literature). Other things would be:
- field-specific terminology — I would spend entire chats chasing down terminology/phrases (e.g. “invertibility in superposition”, “privileged basis”, “gram matrix norms”, “polysemantcity”, etc).
- (1) Have it explain what it is.
- (2) Try to explain it back (its pretty good at telling you you’re wrong).
- (3) Then dive to the (usually) more specific pattern(s) within the field that phrase is actually pointing at.
- E.g. ‘Superposition’ points at the observation that it appears you can linearly add the deconstructed SAE features and reconstruct the original neural code.
- (4) Go further to the reason they’re using this phrase.
- E.g. ‘Superposition’ refers to (a) interference effects, (b) space sharing, (c) recoverability of code. + (d) physics analogy they think is apt.
- Often these terms mask any deeper understanding of what’s happening (the field experts don’t know) and shrouds it with technical credibility.
- Before you think I’m being unfairly critical of Mech Interp, I am not. I’ve seen this in many fields, and Mech Interp is actually one of the better fields I’ve seen. My graduate field for instance loved “neural plasticity” which, I assure you, is an almost meaningless phrase in practice and scientifically.
- (5) From there, if you’re ambitious, you can try to draw weak and strong analogies. For instance, ‘superposition’ to ‘compressed sensing’.
- knocking out a first draft of an overall outline/methods section/abstract — I almost universally ended up rewriting it myself. It can sometimes be useful as a solo author because between the two of us, my paper was less likely to miss details of what I had done.
- help with coming up with phrases. Basically, ramble about the point you’re trying to make and have it help you wordsmith a phrase or sentence. Again, I rarely take these as they are produced by the LLM, but you can iterate with the LLM or, often, by yourself having been ideated/set on roughly the correct path now.
- editorial help — this is really hit-or-miss and its hard to replace critical human eyeballs, but feeding it into a fresh chat. However, it tends to accept things face value or, if you ask it to be critical, it tends to overcriticize non-issues and just take on a hostile attitude with illegitimate superficial critiques.
- For example, AAAI 2026 had an “AI reviewer” that as far as I could tell, was completely ignored by the two editors and four other reviewers (and me). It was a 3-page expositional nightmare of ADHD meds, conceptual teleporting, and deep dives into math notation. It was really trying though, and honestly, the human reviews weren’t that thoughtful either (which is common in other fields as well).
- field-specific terminology — I would spend entire chats chasing down terminology/phrases (e.g. “invertibility in superposition”, “privileged basis”, “gram matrix norms”, “polysemantcity”, etc).
- This is getting kind of long, but there’s a lot of usefulness for paper writing using AI. A lot I’ve already covered (generating figures, a field-valid approach, a review of the literature). Other things would be:
Final Thoughts
I will say, ultimately LLM chatbots enabled me to complete a project faster than I would have, while working a full-time job (especially with coding in a new field — I was able to test out/execute on a lot more ideas I had in a short period of term). Between understanding Anthropic’s blog posts, to setting up my system, to running and debugging experiments, to generating figures and phrases, to even guidance on which conference to submit it to (AAAI was the nearest submission date of the ones it recommended) — the LLM chatbot helped answer/execute on all of that.
However, possibly to toss some water on this hype fire, as a AI-bubble-killing comparison, I did start, write up, and submit this other project in a 3-day marathon session: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=ZG5p5h4AAAAJ&citation_for_view=ZG5p5h4AAAAJ:u5HHmVD_uO8C (it was due Monday at noon, I started on Friday) Back in 2020, without any LLM help, but it was a simpler paper that’s closer to my graduate field.
As I find with coding assistance with industry projects, LLMs can be a stressful accelerant on your productivity. For instance, with regards to this project, in terms of last minute data sciencing/figure-generation, as I scrambled to finish up one more last-minute tack-on analysis chasing down an interesting phenomena in my new working architecture (a bimodal distribution emerged, instead of the typical unimodal one with most SAEs) to make my paper a little more complete and interesting, Claude confabulated an interpretation of a finding that I nearly missed. Claude “hallucinated” bigram squared norms as indicators of “orthogonality” (versus being the indicators of “contribution” that they are). I was tired, I had just learned about ‘L2 norms in bigram matrices’ less than a week earlier (they’re called something else in other fields), and trying to wrap this project up and nearly missed the error until I was describing the figure in the paper I was speedily writing, which would slightly change the implications of that finding. Because the script got commented incorrectly by the LLM, any future chat using this propagated this error even further. Additionally, that error kept re-popping up like a weed in existing conversations which I had already corrected more than once (lengthy LLM chatbot conversations are known to have this issue).
As a consequence of this new artificial source of sneaky mistakes in your workflow, with this and other projects, I find things can go quicker but can also burnout your attention a lot harder.
In terms of paper writing, it also feels like even more mental effort than usual because you’re correcting an assistant quite often — which might be ultimately positive since it often drives you to go off and write the thing by yourself. However, I think as a solo author with no relevant social network, its still an advantage to at least have a sounding board.
In terms of getting into Anthropic’s research journal (which are well written; I think I eventually read all of them), using Claude to help understand — especially with a semi-formed intention to do a project using that knowledge I was gaining — was probably the most fun part. That stage was followed by a thought: I might as well try this idea, while rationalizing the cost of my gaming setup, and who knows, maybe I’ll learn or discover something. Followed by the I-guess-I-should-write-this-up-but-I-really-don’t-want-to-fine-I’ll-do-it phase (which Claude helped by vomiting an irritatingly inaccurate, half-baked outline that I immediately started fixing). Followed by the oh-great-it-got-accepted-but-they-want-me-to-spend-1500-bucks-plus-travel-arrangements-around-the-world?!-are-they-crazy? phase.
I’ll probably blog about something else for a while.
Anyways, those are the raw notes on how I did it. I’m not going to bother rinsing this post through an LLM so apologies for the sloppiness. I did want to quickly jot down my thoughts because some people mentioned to me that putting a ‘how AI helped me do research post” might be of interest.
Either way, have a good one.
Project referenced: https://omarclaflin.com/llm-interpretability-project-dual-encoding-in-neural-network-representations/