Opinions expressed are my own and not endorsed by anyone. Currently working at ARC Evals.
There is a lot of room between "ignore people; do drastic thing" and "only do things where the exact details have been fully approved". In other words, the Overton window has pretty wide error bars.
I would be pleased if someone sent me a computer virus that was actually a security fix. I would be pretty upset if someone fried all my gadgets. If someone secretly watched my traffic for evil AI fingerprints I would be mildly annoyed but I guess glad?
Even google has been threatening unpatched software people to patch it or else they'll release the exploit iirc
So some of the Q of "to pivotally act or not to pivotally act" is resolved by acknowledging that extent is relevant and you can be polite in some cases
This is the post I would have written if I had had more time, knew more, thought faster, etc
One note about your final section: I expect the tool -> sovereign migration to be pretty easy and go pretty well. It is also kind of multistep, not binary.
Eg current browser automation tools (which bring browsers one step up the agency ladder to scriptable processes) work very well, probably better than a from-scratch web scripting tool would work.
Fake example: predict proteins, then predict interactions, then predict cancer-preventiveness, THEN, if everything is going good so far, solve for the protein that prevents the cancer. You might not need more steps but you could also incrementally solve for the chemical synthesis process, eliminate undesired byproducts, etc.
More generally, it might be easy to flip simulators/predictors over into optimizers when the system is trustworthy and the situation demands.
If your system is low impact or whatever then you can have it melt gpus without going off and running the world forever.
Assumptions:
So +1 for staying in happy tool land and punting on scary agent land
Someone make a PR for a builder/breaker feature on lesswrong
Thanks, especially like vague/incorrect labels to refer to that mismatch. Well-posed Q by Garrabrant, might touch on that in my next post.
Thoughts on when models will or won't use edge cases? For example, if you made an electronic circuit using evolutionary algorithms in a high fidelity simulation, I would expect it to take advantage of V = IR being wrong in edge cases.
In other words, how much of the work do you expect to be in inducing models to play nice with abstraction?
ETA: abstractions are sometimes wrong in stable (or stabilizable) states, so you can't always lean on chaos washing it out
What counts as a solution? You could ofc set this up completely manually. Or you could train another net to tweak the first. Gold standard would be to have a basically normal net do this on a basically normal task...
This is true in every field and is very difficult to systemize apparently. Perhaps a highly unstable social state to have people changing directions or thinking/speaking super honestly often.
How could one succeed where so few have?
It seems I was missing the right keywords in my search for demos of this because when I google "ai research assistant" there is quite a lot of work