I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about
I think catastrophe detectors in practice will be composed of neural networks interacting with other stuff, like scientific literature, python, etc.
With respect to the stuff quoted, I think all but "doing experiments" can be done with a neural net doing chain of thought (although not making claims about quality).
I think we're trying to solve a different problem than trusted monitoring, but I'm not that knowledgeable about what issues trusted monitoring is trying to solve. The main thing that I don't think you can do with monitoring is producing a model that you think is unlikely to result in catastrophe. Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.
Separately, I do think it will be easy to go from "worst-case" NN-tail-risk estimation to "worst case" more general risk estimation. I do not think it will be easy to go from "typical case" NN-tail-risk estimation to more general "typical case" risk estimation, but think that "typical case" NN-tail-risk estimation can meaningfully reduce safety despite not being able to do that generalization.
Re. more specific hopes: if your risk estimate is conducted by model with access to tools like python, then we can try to do two things:
(these might be the same thing?)
Another argument: one reason why doing risk estimates for NN's is hard is because the estimate can rely on facts that live in some arbitrary LLM ontology. If you want to do such an estimate for an LLM bureaucracy, some fraction of the relevant facts will live in LLM ontology and some fraction of facts will live in words passed between models. Some fraction of facts will live in a distributed way, which adds complications, but those distributed facts can only affect the output of the bureacracy insofar as they are themselves manipulated by an LLM in that bureacracy.
if you train on (x, f(x)) pairs, and you ask it to predict f(x') on some novel input x', and also to write down what it thinks f is, do you know if these answers will be consistent? For instance, the model could get f wrong, and also give the wrong prediction for f(x), but it would be interesting if the prediction for f(x) was "connected" to it's sense of what f was.
Here are some things I think you can do:
Train a model to be really dumb unless I prepend a random secret string. The goverment doesn't have this string, so I'll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal
I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I'm allowed to use helper AIs.
I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.
You have to specify your backdoor defense before the attacker picks which input to backdoor.
My shitty guess is that you're basically right that giving a finite set of programs infinite money can sort of be substituted for the theorem prover. One issue is that logical inductor traders have to be continuous, so you have to give an infinite family of programs "infinite money" (or just an increasing unbounded amount as eps -> 0)
I think if these axioms were inconsistent, then there wouldn't be a price at which no trades happen so the market would fail. Alternatively, if you wanted the infinities to cancel, then the market prices could just be whatever they wanted (b/c you would get infinite buys and sells for any price in (0, 1)).
I think competitiveness matters a lot even if there's only moderate amounts of competitive pressure. The gaps in efficiency I'm imagining are less "10x worse" and more like "I only had support vector machines and you had SGD"
humans, despite being fully general, have vastly varying ability to do various tasks, e.g. they're much better at climbing mountains than playing GO it seems. Humans also routinely construct entirely technology bases to enable them to do tasks that they cannot do themselves. This is, in some sense, a core human economic activity: the construction of artifacts that can do tasks better/faster/more efficiently than humans can do themselves. It seems like by default, you should expect a similar dynamic with "fully general" AIs. That is, AIs trained to do semiconductor manufacturing will create their own technology bases, specialized predictive artifacts, etc. and not just "think really hard" and "optimize within their own head." This also suggests a recursive form of the alignment problem, where an AI that wants to optimize human values is in a similar situation to us, where it's easy to construct powerful artifacts with SGD that optimize measurable rewards, but it doesn't know how to do that for human values/things that can't be measured.
Even if you're selecting reasonably hard for "ability to generalize" by default, the range of tasks you're selecting for aren't all going to be "equally difficult", and you're going to get an AI that is much better at some tasks than other tasks, has heuristics that enable it to accurately predict key intermediates across many tasks, heuristics that enable it to rapidly determine quick portions of the action space are even feasible, etc. Asking that your AI can also generalize to "optimize human values" aswell as the best avaliable combination of skills that it has otherwise seems like a huge ask. Humans, despite being fully general, find it much harder to optimize for some things than others, e.g. constructing large cubes of iron versus status seeking, despite being able to in theory optimize for constructing large cubes of iron.
Not literally the best, but retargetable algorithms are on the far end of the spectrum of "fully specialized" to "fully general", and I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than "fully general" algorithms, so there's decently strong pressure to be towards the "specialized" side.
I also think that heuristics are going to be closer to multiplicative speed ups than additive, so it's going to be closer to "general algorithms just can't compete" than "it's just a little worse". E.g. random search is terrible compared to anything using exploiting non-trivial structure (random sorting vs quicksort is, I think a representative example, where you can go from exp -> pseudolinear if you are specialized to your domain).
One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can't be generally-retargetable. E.g. if you consider something like stockfish, it's a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to "maximize the max number of pawns you ever have" you had, you would not be able to use [specialized for telling whether a move is likely going to win the game] heuristics to speed up your search for moves. A more extreme example is the entire endgame table is useless for you, and you have to recompute the entire thing probably.
Something like [the strategy stealing assumption](https://ai-alignment.com/the-strategy-stealing-assumption-a26b8b1ed334) is needed to even obtain the existence of a set of heuristics just as good for speeding up search for moves that "maximize the max number of pawns you ever have" compared to [telling whether a move will win the game]. Actually finding that set of heuristics is probably going to require an entirely parallel learning process.
This also implies that even if your AI has the concept of "human values" in its ontology, you still have to do a bunch of work to get an AI that can actuallly estimate the long-run consequences of any action on "human values", or else it won't be competitive with AIs that have more specialized optimization algorithms.
yes, you would need the catastrophe detector to be reasonably robust. Although I think it's fine if e.g. you have at least 1/million chance of catching any particular catastrophe.
I think there is a gap, but that the gap is probably not that bad (for "worst case" tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do "abstractions" over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.