Analogies between Software Reverse Engineering and Mechanistic Interpretability

Itay Yona

The distinction between "newbies get caught up trying to understand every detail, experts think in higher-level abstractions, make educated guesses, and only zoom in on the details that matter" felt super interesting and surprising to me.

I claim that this is 1) an instance of a common pattern that 2) is currently missing a step (the pre-newbie stage).

The general pattern is the following (terminology borrowed from Terry Tao):

The pre-rigorous stage: Really new people don't know how ~anything works in a field, and so use high-level abstractions that aren't necessarily grounded in reality.
The rigorous stage: Intermediate people learn the concrete grounding behind a field, but get bogged down in minutia.
The post rigorous stage: Experts form correct high level abstractions informed by their understanding of the grounding, but still use the grounding when the high level abstractions break down.

I think that many experts mainly notice the 2->3 transition, but not the 1->2 one, and so often dissuade newbies by encouraging them to not work in the rigorous stage. I claim this is really, really bad, and that a solid understanding of the rigorous stage is a good idea for ~everyone doing technical work.

Here's a few examples:

Terry Tao talks about this in math: Early students start out not knowing what a proof is and have to manipulate high-level, handwavy concepts. More advanced students learn the rigorous foundations behind various fields of math but are encouraged to focus on the formalism as opposed to 'focusing too much on what such objects actually “mean”'. Eventually, mathematicians are able to revisit their early intuitions and focus on the big picture, converting their intuitions to rigorous arguments when needed.
The exact same thing is true in almost discipline with mathematical proofs, e.g. physics or theoretical CS.
This happens in a very similar with programming as well.
In many strategy games (e.g. Chess), you see crazy high level strategizing at the super low and high levels, while the middle levels are focused on improving technique.
I'd also claim that something similar happens in psychology: freshmen undergrads come up with grand theories of human cognition that are pretty detached from reality, many intermediate researchers get bogged down in experimental results, while the good psychology researchers form high level representations and theories based on their knowledge of experiments (and are able to trivially translate intuitions into empirical claims).

My Takeaways

The underlying mindset actually feels pretty analogous!

I find it super interesting that they also think a lot about motifs (weird patterns and phenomena that only occur in specific contexts), and that these are often the first hook into understanding something weird and that you can then work backwards.
- (Not to be confused with the SRE use of hooking)
Also interesting that they also often focus on the inputs and outputs of the software as the starting point, to get a hook in, and then move on from there.
It's very key to have a deep, gears-level model of the system you're working with (how does a CPU work, how are things represented in memory, the stack, registers, etc)

The distinction between "newbies get caught up trying to understand every detail, experts think in higher-level abstractions, make educated guesses, and only zoom in on the details that matter" felt super interesting and surprising to me.

My attempt to translate it into mechanistic interpretability is that (if it is analogous):
- There are certain principles and patterns by which networks learn, that we can identify and understand.
- We likely will understand these by deeply reverse engineering specific parts of systems (especially toy systems) and digging into the details. But the goal here is to build intuitions and mental models, and a sense for how models work as a whole.
- Once we have these intuitions and some solid grounding in what we do understand well, the best mindset for reverse engineering an unfamiliar system is to be less rigorous and more intuitive. Make educated guesses, look for partial evidence for or against hypotheses, think at a high-level and somewhat abstract mode about the system, and only zoom in on a specific part of the system to deeply reverse engineer once you've identified what to prioritise.
I have no idea if it is analogous, but that mindset aligns a fair bit with my intuitions about MI (though I consider the field to be much more in the "building intuitions by deeply engineering things" phase lol)
As Lawrence Chan notes, this is likely an example of a general pattern, where newbies (1) reason high-level in a very ungrounded way, (2) dig really rigorously into the details constantly and (3) build intuition that's actually grounded, and return to reasoning on a high-level. And that if you want to get to 3, you need to do a lot of digging through details in stage 2 first, while experts can make the mistake of recommending skipping to 3 (which in practice puts people at 1)

I'm surprised at the emphasis on prioritisation, and identifying which part of the software you care about. My mental picture was that the goal was to fully de-compile things to source code, but it sounds like that's rarely the goal and is extremely hard.

But this aligns with my intuitions that a lot of what I want to do with a network is to localise the parts that are relevant to a specific task.

One approach to MI research that seems natural from a SRE perspective: Do extensive work reverse engineering toy models, and try to deeply understand the circuits there. Then, try to distill out motifs and find (ideally automated) tools to detect similar circuits in bigger toy models/real models.

Induction heads seem like a successful example of this, and I’m very curious about whether this works for my modular addition algorithm or a toy model of superposition!

If this works, it suggests that an approach to MI could be:

Identify a behaviour we care about/want to find a circuit for

Train a toy model to do this, that tries to be as simple as possible while being a good simulation of the real model

Study how the behaviour is implemented

Distill out the core patterns and fingerprints of this behaviour

Scan the real network for these, and use this as (non-rigorous) evidence for what’s going on.

I’m biased, because this is how I already think about MI, but it’s nice to have some external validation!

Call Notes

Background:

SRE = Software Reverse Engineering, the study of decompiling a program binary to the actual code.
Program binary -> assembly is easy, so it's basically about reverse engineering assembly.
Program files are enormous, want to understand what it's doing as much as you can, while translating as few instructions as possible
- Can be a few MB to 100s of MB in file size, rarely KB

High-level: - we are not trying to be really rigorous or fully reverse engineer things, this is a massive waste of time. We are using the scientific method. We have strong priors, make educated guesses, and the goal of our reverse engineering is to get out a few bits of data that narrows down our educated guesses.

Key challenge: - we want to find a specific function in the enormous file. The way we do this by identifying which part of the file is relevant.

Meta point: The goal is not to decompile the entire program. It's to identify the specific part of it that you care about understanding, or to zoom in on a specific vulnerability, or to figure out what it does at a high-level, etc. You need to prioritise and be intentional, not to dig into every single last detail.
- Often you care about specific details of a part of the software and how they deal with individual bits - does the function take in shorts vs longs, how does it cast, etc. These are often the details that lead to vulnerabilities, which is a lot of the goal here.
Main technique: Run the code a bunch on a bunch of inputs, get a black box sense for what it does. Use this to build intuitions about the program.
Form hypotheses about what could be going on internally, and use this black box operation to test and validate them. Try to break things, and look for inputs that could distinguish between your hypotheses
Try to get into the mindset of the programmer who wrote it
- He often thinks in C and classes, rather than in terms of assembly. You want to be as abstract as is reasonably possible without losing sight of the problem.
  - Specifics of registers etc are often not a helpful way to think about it.
  - You ultimately want to think about things as the programmer would have thought about things. The compiler and OS can often add a lot of irrelevant complexity. If you think in terms of assembly you'll need to spend a lot of time engaging with these, but if you think in C you can abstract a lot away
- This is both about "what is a reasonable way to implement a behaviour" or "there are many ways to implement this behaviour, which one might the programmer have chosen?"
  - We can rule out possibilities and explore by playing with the software, look at what triggers errors, weird edge cases, etc.

Core mindset: His core underlying philosophy is about looking for shortcuts.

Newbies look for specifics. Key details of what variables more from where to where, details of memory, etc. Very bottom up, easy to waste a lot of time.
Experts think in terms of abstractions - what is it doing, what are the variables, where are they, what is the code doing, etc. Top down, only zoom in once they've identified what's interesting.
- Analogy to chess: Newbies care a lot about whether a piece can be captured, is it protected, etc. As an expert, it's automatic, which lets us build better abstractions, and we can reason about strategy on a higher level.
To be a good SRE, understanding registers, loops, etc is important. But when you understand how to do this, it's enough.
- But a common mistake is to never switch to top down. This is one of the best forms of value a mentor gives, showing the worth of educated guesses, rapid moving on, skimming at a high level until you find the part worth zooming in on.
  - People often learn this by pair programming, and seeing that a mentor is willing to not be a perfectionist and move on.

Localising: How do I localise which part of the code matters?

High-Level - look for motifs in the code, and look at how it interfaces with external things (eg web or user interface)
Big disanalogy to neural networks: In a network, finding the representations is a lot of effort and very hard. But in software, we mostly care about how it represents the inputs, which are very easy and standard.
- Eg, the software acts on an image, and we know there are some standard operations - applying a fourier transform, storing the height and width, de-compression, etc.
- Example: If we know this is happening, then the code will be doing floating point multiplication. This is an unusual motif (rarely occurs in other contexts) and is easy to identify from the code, and a significant hint that de compression is happening
  - Underlying point: There’s a lot of weird crap that gets put into the binary from the compiler, or standard code from the underlying libraries/OS. We want to sift through for the parts specific to this program and the behaviour we care about.
  - This means that if we see a big block of code that does floating point multiplication, and it's called by parts of code that handles input images, it's very likely to be de-compression. On its own, being called by the input handling code is only weak evidence, but combined with floating point multiplication and our priors, this is strong evidence
- Summary: There is a special motif (floating point arithmetic) which is only used in some specific functional context. If you recognise that this is happening, this is a major hint about what’s going on!
Another type of work is using a debugger - this is dynamic analysis, not static analysis.
- We can look at what happens when the part of the code inside the de-compression algorithm is being run, and then look at the call stack. This can give us hints about what those functions do! Eg looking for user input.
- I love that the same ideas of static (analysing weights) vs dynamic (analysing the model on various inputs) analysis applies here!
Example: We want to crack photoshop to not need a serial number. We know that an incorrect serial number creates a pop-up. We can add a breakpoint to every part of the system that creates a pop-up, because this requires an interface with the UI. By seeing which breakpoint triggers, we can identify which call to MessageBoxA (the actual function name!) is relevant. And by looking at the call stack, we can identify which parts of the code are relevant.

A Practical workflow to balance between divergent, convergent and meta work:

This is Itay’s specific workflow, and is neither SRE specific nor universal in SRE.
Summary:
- Start the project by forming a creative mindmap. Start with your end-goal (eg finding a vulnerability in a specific system) and systematically break down the problem, think of as many in-roads as you can. Then prioritise and pick one to work on.
- Spend most of your in focused research mode, and just keep iterating on that sub-problem. Each day, set yourself a question, try to answer that question, and then pick the natural next question the next day.
- Predict how long this will take. When that time is up, move on to the next most promising thing
Research diaries:
- At the start of the day, write out a question. By the end of the day, try to answer it.
- This hopefully unlocks more questions and threads, and you can iterate!
- Each day, you default to following your research diary. What questions now feel alive, dive into those, keep iterating.
Mindmaps:
- Start with the fundamental goal, eg “get the system to think you’ve given it the right password”
- Branch out and brainstorm sub-goals, sub-sub-goals, etc.
- Prioritise and choose one to focus on - do a rough cost-effectiveness analysis
Most of the time, you're in very focused mode and trying to answer a deep question. You use the mindmap to keep track of new ideas and creative thoughts without breaking focus.
Return to the mindmap when things take longer than expected (even if you haven't run out of ideas!). Then pick the next promising thread
- Important: Be willing to go back to previously explored threads even if there are more threads you haven't explored yet.
- It's hard to tell if you're in a rabbit hole or the solution is nearby. Thinks it's hard to tell whether a thread is promising or not in the moment. You can get weak evidence from partial progress, but idk.
- The goal is not having an optimal algorithm, it's about being good enough. "Work on the most promising thing for X time, then move on to next most promising thing".
  - As long as you have good time estimates, traversing the mindmap graph is like applying the A* Algorithm to it
How to tell when you're in a rabbit hole? This is when it's most useful to have a mentor/rubber duck/break. There's no clear rules here, but with time and experience you develop taste for "there should be an easier solution, I've been on this for a while, let's switch".
- Implicitly, you should have a model of what doing this research looks like, how hard it is, how often you should make progress, etc. If you're repeatedly wrong/optimistic, you notice and update or pivot.
How would you mentor someone to learn this workflow? Pretty easy, it's very concrete, you can just outline the algorithm for the mentee to follow.
- Concretely, take a mentee who’s stuck on a problem. Listen to what’s going on for them, repeat the approach of expanding the mindmap and being willing to return to previous threads. This basically seemed to be enough to get them unstuck and motivated again!

Teaching: What would you put on a curriculum to make someone an expert SRE?

Technical details:
- Static vs dynamic analysis (analyse binary vs analyse it on inputs)
Teach common patterns:
- Eg De-compression has a lot of floating point multiplications, looking at inputs, changing things on the GUI, etc
- Give them toy problems to study with easy hooks in (ie patterns to give a starting point)
Intermediate problems which don't have deliberate, obvious hooks in
- But there probably isn't something that has no hooks. The core principles of the field is that things can be understood! But it might be hard and obfuscating to access it.
Move towards bigger systems. Don't just remove hints, but also obfuscate things.
- Eg code is partially encrypted, lots of distractor code, etc.
- What do you do here? Eh, not any general tips. But fundamentally, the computer can understand it, so it should be understandable, eg it must eventually be decrypted
- Mindset: Assume there is a shortcut. What is that shortcut? Use this to motivate your research.

Hard cases: Trying to reverse engineer a mysterious black box.

This is a pretty in the weeds example. Might be interesting but feel free to skip. Another interesting example
High Level takeaway: We need to rely on problem-solving principles even more, and focus on the constraints and bottlenecks of the system (eg, if it's accessing the internet, it must be doing that somehow)
Example: We have a binary that we extract from our router. We know it's router-y but know nothing else
It's not a classic case of software (like Photoshop) where we have documentation and know what the user interface is and what it should be doing. We know that it's doing router-related things, but not whether it's implemented with Linux or Windows, what language, what hardware, etc.
Sometimes we can guess, eg "this is using ARM's architecture/implementation", but sometimes we don't know what it is or aren't familiar with it, and need to start from scratch.
What do we do now? We still try to be a scientist! What do we know about what the software must be doing?
- Example: If it is a CPU, it must be having loops, accessing and moving memory, etc.
- Eg int manipulating commands may be 1 byte, other commands may be 2 bytes.
- Eg the code must be accessing registers, which must have locations represented internally
The underlying goal is to look for any chink in its armor! Try to find any pattern. This gives you an insight into what's going. You can identify what some parts of the code are doing, and make progress from there.

Different types of reverse engineering:

To do web hacking, we don't even have access to the server-side code and binaries. We can see how it responds to various inputs. Eg put a backtick into a text box, if it creates a 500 error this suggests it's using SQL
Hardware reverse engineering - interesting and is at the intersection. Like web bc it's black box, like software because we can break things, change temperature, etc:
- Example: CPUs will execute instructions even with insufficient power. They will just fail to do the power intensive operations
  - Eg: Will fail to access memory, but can still add numbers. You can use this to hack a password, because it can't access what the password is, we can get in with all zeros
- A great talk Itay recommends
- The jargon for this is side-channel attacks

Sometimes operations take more or less depending on the input, by precisely measuring the time, we can figure it out.

Example: runs through each byte of a string and returns at the first one that's correct. This means we can brute force the first character, then the second character, etc, because we can tell how many bytes we have correct.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

19

Analogies between Software Reverse Engineering and Mechanistic Interpretability

19

My Takeaways

Call Notes