Opinions expressed are my own and not endorsed by anyone.
Dev at ARC Evals since October 2022
It is just as ambitious/implausible as you say. I am hoping to get out some rough ideas in my next post anyways.
Perhaps there are some behavioral / black-box methods available for evaluating alignment, depending on the kind of system being evaluated.
Toy example: imagine a two part system where part A tries to do tasks and part B limits part A's compute based on the riskiness of the task. You could try to optimize the overall system towards catastrophic behavior and see how well your part B holds up.
Personally I expect monolithic systems to be hard to control than two-part systems, so I think this evaluation scheme has a good chance of being applicable. One piece of evidence: OpenAI's moderation system correctly flags most jailbreaks that get past the base model's RLHF.
When you accidentally unlock the tech tree by encouraging readers to actually map out a tech tree and strategize about it
No, excellent analysis though.
There is a lot of room between "ignore people; do drastic thing" and "only do things where the exact details have been fully approved". In other words, the Overton window has pretty wide error bars.
I would be pleased if someone sent me a computer virus that was actually a security fix. I would be pretty upset if someone fried all my gadgets. If someone secretly watched my traffic for evil AI fingerprints I would be mildly annoyed but I guess glad?
Even google has been threatening unpatched software people to patch it or else they'll release the exploit iirc
So some of the Q of "to pivotally act or not to pivotally act" is resolved by acknowledging that extent is relevant and you can be polite in some cases
This is the post I would have written if I had had more time, knew more, thought faster, etc
One note about your final section: I expect the tool -> sovereign migration to be pretty easy and go pretty well. It is also kind of multistep, not binary.
Eg current browser automation tools (which bring browsers one step up the agency ladder to scriptable processes) work very well, probably better than a from-scratch web scripting tool would work.
Fake example: predict proteins, then predict interactions, then predict cancer-preventiveness, THEN, if everything is going good so far, solve for the protein that prevents the cancer. You might not need more steps but you could also incrementally solve for the chemical synthesis process, eliminate undesired byproducts, etc.
More generally, it might be easy to flip simulators/predictors over into optimizers when the system is trustworthy and the situation demands.
If your system is low impact or whatever then you can have it melt gpus without going off and running the world forever.
So +1 for staying in happy tool land and punting on scary agent land
Someone make a PR for a builder/breaker feature on lesswrong
Thanks, especially like vague/incorrect labels to refer to that mismatch. Well-posed Q by Garrabrant, might touch on that in my next post.
Thoughts on when models will or won't use edge cases? For example, if you made an electronic circuit using evolutionary algorithms in a high fidelity simulation, I would expect it to take advantage of V = IR being wrong in edge cases.
In other words, how much of the work do you expect to be in inducing models to play nice with abstraction?
ETA: abstractions are sometimes wrong in stable (or stabilizable) states, so you can't always lean on chaos washing it out