AI ALIGNMENT FORUM
AF

127
Adam Karvonen
000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
The Thinking Machines Tinker API is good news for AI control and security
Adam Karvonen9d00

In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it's only legit Pytorch stuff, so it isn't just arbitrary code execution.

Reply
The Thinking Machines Tinker API is good news for AI control and security
Adam Karvonen9d02

I'm guessing most modern interp work should be fine. Interp has moved away from "let's do this complicated patching of attention head patterns between prompts" to basically only interacting with residual stream activations. You can easily do this with e.g. pytorch hooks, even in modern inference engines like vLLM. The amount of computation performed in a hook is usually trivial - I never have noticed a slowdown in my vLLM generations when using hooks.

Because of this, I don't think batched execution would be a problem - you'd probably want some validation in the hook so it can only interact with activations from the user's prompt.

There's also nnsight, which already supports remote execution of pytorch hooks on models hosted on Bau Lab machines through an API. I think they do some validation to ensure users can't do anything malicious.

You would need some process to handle the activation data, because it's large. If I'm training a probe on 1M activations, with d_model = 10k and bfloat16, then this is 20GB of data. SAEs are commonly trained on 500M + activations. We probably don't want the user to have access to this locally, but they probably want to do some analysis on it.

Reply
0Adam Karvonen's Shortform
9mo
0
No wikitag contributions to display.
37Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
3mo
0
42SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
10mo
1
43OthelloGPT learned a bag of heuristics
1y
1