However, in AI alignment, the hope is to learn from failures of narrow AI systems, and use that to prevent failures in more powerful AI systems.
This also jumped out at me as being only a subset of what I think of as "AI alignment"; like, ontological collapse doesn't seem to have been a failure of narrow AI systems. [By 'ontological collapse', I mean the problem where the AI knows how to value 'humans', and then it discovers that 'humans' aren't fundamental and 'atoms' are fundamental, and now it's not obvious how its preferences will change.]
Perhaps you mean "AI alignment in the slow takeoff frame", where 'narrow' is less a binary judgment and more of a continuous judgment; then it seems more compelling, but I still think the baseline prediction should be doom if we can only ever solve problems after encountering them.
But if on some absolute scale you say that AlphaZero is a design / search hybrid, then presumably you should also say the OpenAI Five is a design / search hybrid, since it uses PPO at the outer layer, which is a designed algorithm. This seems wrong.
I think I'm willing to bite that bullet; like, as far as we know the only stuff that's "search all the way up" is biological evolution.
But 'hybrid' seems a little strange; like, I think design normally has search as a subcomponent (in imaginary space, at least, and I think often also search through reality), and so in some sense any design that isn't a fully formed vision from God is a design/search hybrid. (If my networks use RELU activations 'by design', isn't that really by the search process of the ML community as a whole? And yet it's still useful to distinguish networks which determine what nonlinearity to use from local data, which which networks have it determined for them by an external process, which potentially has a story for why that's the right thing to do.)
Total horse takeover seems relevant as another way to think about intervening to 'control' things at varying levels of abstraction.
[The core thing about design that seems important and relevant here is that there's a "story for why the design will work", whereas search is more of an observational fact of what was out there when you looked. It seems like it might be easier to build a 'safe design' out of smaller sub-designs, whereas trying to search for a safe algorithm using search runs into all the anthropic problems of empiricism.]
But for now such approaches are being badly outperformed by search (in AI).
I suspect the edge here depends on the level of abstraction. That is, Go bots that use search can badly outperform Go bots that don't use any search, but using search at the 'high level' (like in MuZero) only somewhat outperforms using design at that level (like in AlphaZero).
It wouldn't surprise me if search always has an edge (at basically any level, exposing things to adjustment by gradient descent makes performance on key metrics better), but if the edge is small it seems plausible to focus on design.
But does it ever hallucinate the need to carry the one when it shouldn't?
To be pedantic: we care about "consequence-desirability-maximisers" (or in Rohin's terminology, goal-directed agents) because they do backwards assignment.
But I think the pedantry is important, because people substitute utility-maximisers for goal-directed agents, and then reason about those agents by thinking about utility functions, and that just seems incorrect.
This also seems right. Like, my understanding of what's going on here is we have:
The first is a narrow class, and depending on how strict you are with 'maximize', quite possibly no physically real agents will fall into it. The second is a universal class, which instantiates the 'trivial claim' that everything is utility maximization.
Put another way, the first is what happens if you hold utility fixed / keep utility simple, and then examine what behavior follows; the second is what happens if you hold behavior fixed / keep behavior simple, and then examine what utility follows.
Distance from the first is what I mean by "the further a robot's behavior is from optimal"; I want to say that I should have said something like "VNM-optimal" but actually I think it needs to be closer to "simple utility VNM-optimal."
I think you're basically right in calling out a bait-and-switch that sometimes happens, where anyone who wants to talk about the universality of expected utility maximization in the trivial 'general' sense can't get it to do any work, because it should all add up to normality, and in normality there's a meaningful distinction between people who sort of pursue fuzzy goals and ruthless utility maximizers.
Which seems very very complicated.
I realized my grandparent comment is unclear here:
but need a very complicated utility function to make a utility-maximizer that matches the behavior.
This should have been "consequence-desirability-maximizer" or something, since the whole question is "does my utility function have to be defined in terms of consequences, or can it be defined in terms of arbitrary propositions?". If I want to make the deontologist-approximating Innocent-Bot, I have a terrible time if I have to specify the consequences that correspond to the bot being innocent and the consequences that don't, but if you let me say "Utility = 0 - badness of sins committed" then I've constructed a 'simple' deontologist. (At least, about as simple as the bot that says "take random actions that aren't sins", since both of them need to import the sins library.)
In general, I think it makes sense to not allow this sort of elaboration of what we mean by utility functions, since the behavior we want to point to is the backwards assignment of desirability to actions based on the desirability of their expected consequences, rather than the expectation of any arbitrary property.
Actually, I also realized something about your original comment which I don't think I had the first time around; if by "some reasonable percentage of an agent's actions are random" you mean something like "the agent does epsilon-exploration" or "the agent plays an optimal mixed strategy", then I think it doesn't at all require a complicated utility function to generate identical behavior. Like, in the rock-paper-scissors world, and with the simple function 'utility = number of wins', the expected utility maximizing move (against tough competition) is to throw randomly, and we won't falsify the simple 'utility = number of wins' hypothesis by observing random actions.
Instead I read it as something like "some unreasonable percentage of an agent's actions are random", where the agent is performing some simple-to-calculate mixed strategy that is either suboptimal or only optimal by luck (when the optimal mixed strategy is the maxent strategy, for example), and matching the behavior with an expected utility maximizer is a challenge (because your target has to be not some fact about the environment, but some fact about the statistical properties of the actions taken by the agent).
I think this is where the original intuition becomes uncompelling. We care about utility-maximizers because they're doing their backwards assignment, using their predictions of the future to guide their present actions to try to shift the future to be more like what they want it to be. We don't necessarily care about imitators, or simple-to-write bots, or so on. And so if I read the original post as "the further a robot's behavior is from optimal, the less likely it is to demonstrate convergent instrumental goals", I say "yeah, sure, but I'm trying to build smart robots (or at least reasoning about what will happen if people try to)."
If a reasonable percentage of an agent's actions are random, then to describe it as a utility-maximiser would require an incredibly complex utility function (because any simple hypothesised utility function will eventually be falsified by a random action).
I'd take a different tack here, actually; I think this depends on what the input to the utility function is. If we're only allowed to look at 'atomic reality', or the raw actions the agent takes, then I think your analysis goes through, that we have a simple causal process generating the behavior but need a very complicated utility function to make a utility-maximizer that matches the behavior.
But if we're allowed to decorate the atomic reality with notes like "this action was generated randomly", then we can have a utility function that's as simple as the generator, because it just counts up the presence of those notes. (It doesn't seem to me like this decorator is meaningfully more complicated than the thing that gave us "agents taking actions" as a data source, so I don't think I'm paying too much here.)
This can lead to a massive explosion in the number of possible utility functions (because there's a tremendous number of possible decorators), but I think this matches the explosion that we got by considering agents that were the outputs of causal processes in the first place. That is, consider reasoning about python code that outputs actions in a simple game, where there are many more possible python programs than there are possible policies in the game.
I think you run into a problem that most animal communication is closer to a library of different sounds, each of which maps to a whole message, than it is something whose content is determined by internal structure, so you don't have the sort of corpus you need for unsupervised learning (while you do have the ability to do supervised learning).
I think also Conway's game of life has a large bestiary of 'stable patterns' that you could figure out and then dramatically increase your ability to predict things.
I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I'm only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.
I read Adversarial Examples are Features Not Bugs as suggesting that this sort of thing happens by default, and the main question is "sure, some of it happens by default, but can really big stuff happen by default?". But if you imagine a LSTM implementing a finite state machine, or something, it seems quite possible to me that it will mostly be hard to unravel instead of easy to unravel, while still being a relevant part of the computation.