One argument is that U() should be computable because the agent has to be able to use it in computations. This perspective is especially appealing if you think of U() as a black-box function which you can only optimize through search. If you can't evaluate U(), how are you supposed to use it? If U() exists as an actual module somewhere in the brain, how is it supposed to be implemented?
This seems like a weak argument. If I think about a human trying to achieve some goal in practice, "think of U() as a black-box function which you can only optimize through search" doesn't really describe how we typically reason. I would say that we optimize for things we can't evaluate all the time - it's our default mode of thought. We don't need to evaluate U() in order to decide which of two options yields higher U().
Example: suppose I'm a general trying to maximize my side's chance of winning a war. Can I evaluate the probability that we win, given all of the information available to me? No - fully accounting for every little piece of info I have is way beyond my computational capabilities. Even reasoning through an entire end-to-end plan for winning takes far more effort than I usually make for day-to-day decisions. Yet I can say that some actions are likely to increase our chances of victory, and I can prioritize actions which are more likely to increase our chances of victory by a larger amount.
Suppose I'm running a company, trying to maximize profits. I don't make decisions by looking at the available options, and then estimating how profitable I expect the company to be under each choice. Rather, I reason locally: at a cost of X I can gain Y, I've cached an intuitive valuation of X and Y based on their first-order effects, and I make the choice based on that without reasoning through all the second-, third-, and higher-order effects of the choice. I don't calculate all the way through to an expected utility or anything comparable to it.
If I see a $100 bill on the ground, I don't need to reason through exactly what I'll spend it on in order to decide to pick it up.
In general, I think humans usually make decisions directionally and locally: we try to decide which of two actions is more likely to better achieve our goals, based on local considerations, without actually simulating all the way to the possible outcomes.
Taking a more theoretical perspective... how would a human or other agent work with an uncomputable U()? Well, we'd consider specific choices available to us, and then try to guess which of those is more likely to give higher U(). We might look for proofs that one specific choice or the another is better; we might leverage logical induction; we might do something else entirely. None of that necessarily requires evaluating U().
I think you get "ground truth data" by trying stuff and seeing whether or not the AI system did what you wanted it to do.
That's the sort of strategy where illusion of transparency is a big problem, from a translation point of view. The difficult cases are exactly the cases where the translation usually produces the results you expect, but then produce something completely different in some rare cases.
Another way to put it: if we're gathering data by seeing whether the system did what we wanted, then the long tail problem works against us pretty badly. Those rare tail-cases are exactly the cases we would need to observe in order to notice problems and improve the system. We're not going to have very many of them to work with. Ability to generalize from small data sets becomes a key capability, but then we need to translate how-to-generalize in order for the AI to generalize in the ways we want (this gets at the can't-ask-the-AI-to-do-anything-novel problem).
(The other comment is my main response, but there's a possibly-tangential issue here.)
In a long-tail world, if we manage to eliminate 95% of problems, then we generate maybe 10% of the value. So now we use our 10%-of-value product to refine our solution. But it seems rather optimistic to hope that a product which achieves only 10% of the value gets us all the way to a 99% solution. It seems far more likely that it gets to, say, a 96% solution. That, in turn, generates maybe 15% of the value, which in turn gets us to a 96.5% solution, and...
Point being: in the long-tail world, it's at least plausible (and I would say more likely than not) that this iterative strategy doesn't ever converge to a high-value solution. We get fancier and fancier refinements with decreasing marginal returns, which never come close to handling the long tail.
Now, under this argument, it's still a fine idea to try the iterative strategy. But you wouldn't want to bet too heavily on its success, especially without a reliable way to check whether it's working.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you've translated better and now you've eliminated 99% of the risk...
I don't see how this ever actually gets around the chicken-and-egg problem.
An analogy: we want to translate from English to Korean. We first obtain a translation dictionary which is 95% accurate, then use it to ask our Korean-speaking friend to help out. Problem is, there's a very important difference between very similar translations of "help me translate things" - e.g. consider the difference between "what would you say if you wanted to convey X?" and "what should I say if I want to convey X?", when giving instructions to an AI. Both of those would produce very similar results, right up until everything went wrong. (Let me know if this analogy sounds representative of the strategies you imagine.)
If you do manage to get that first translation exactly right, and successfully ask your friend for help, then you're good - similar to the "translate how-to-translate" strategy from the OP. And with a 95% accurate dictionary, you might even have a decent chance of getting that first translation right. But if that first translation isn't perfect, then you need some way to find that out safely - and the 95% accurate dictionary doesn't make that any easier.
Another way to look at it: the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further. We need some other way to get at the ground truth, in order to actually reduce the error rate. If we know how to convey what-we-want with 95% accuracy, then we need some other way to get at the ground truth of translation in order to increase that accuracy further.
Endorsed; that definitely captures the key ideas.
If you haven't already, you might want to see my answer to Steve's comment, on why translation to low-level structure is the right problem to think about even if the AI is using higher-level models.
I agree with most of this reasoning. I think my main point of departure is that I expect most of the value is in the long tail, i.e. eliminating 95% of problems generates <10% or maybe even <1% of the value. I expect this both in the sense that eliminating 95% of problems unlocks only a small fraction of economic value, and in the sense that eliminating 95% of problems removes only a small fraction of risk. (For the economic value part, this is mostly based on industry experience trying to automate things.)
Optimization is indeed the standard argument for this sort of conclusion, and is a sufficient condition for eliminating 95% of problems to have little impact on risk. But again, it's not a necessary condition - if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn't really decreased. And that's exactly the sort of situation I expect when viewing translation as the central problem: illusion of transparency is exactly the sort of thing which doesn't seem like a problem 95% of the time, right up until you realize that everything was completely broken all along.
Anyway, sounds like value-in-the-tail is a central crux here.
One of the basic problems in the embedded agency sequence is: how does an agent recognize its own physical instantiation in the world, and avoid e.g. dropping a big rock on the machine it's running on? One could imagine an AI with enough optimization power to be dangerous, which gets out of hand but then drops a metaphorical rock on its own head - i.e. it doesn't realize that destroying a particular data center will shut itself down.
Similarly, one could imagine an AI which tries to take over the world, but doesn't realize that unplugging the machine on which it's running will shut it down - because it doesn't model itself as embedded in the world. (For similar reasons, such an AI might not see any reason to create backups of itself.)
Another possible safety valve: one could imagine an AI which tries to wirehead, but its operators put a lot of barriers in place to prevent it from doing so. The AI seizes whatever resources it needs to metaphorically smash those barriers, does so violently, then wireheads itself and just sits around.
Generalizing these two scenarios: I think it's plausible that unprincipled AI architectures tend to have built in safety valves - they'll tend to shoot themselves in the foot if they're able to do so. That's definitely not something I'd want to bet the future of the human species on, but it is a class of scenarios which would allow for an AI to deal a lot of damage while still failing to take over.
"how do I ensure that the AI system has an undo button" and "how do I ensure that the AI system does things slowly"
I don't think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of "slow" is roughly "a human has time to double-check", which immediately makes things very expensive.
Even if we abandon economic competitiveness, I doubt that slow+reversible makes the translation problem all that much easier (though it would make the AI at least somewhat less dangerous, I agree with that). It's probably somewhat easier - having a few cycles of feedback seems unlikely to make the problem harder. But if e.g. we're originally training the AI via RL, then slow+reversible basically just adds a few more feedback cycles after deployment; if millions or billions of RL cycles didn't solve the problem, then adding a handful more at the end seems unlikely to help much (though an argument could be made that those last few are higher-quality). Also, there's still the problem of translating a human's high-level notion of "reversible" into a low-level notion of "reversible".
Taking a more outside view... restrictions like "make it slow and reversible" feel like patches which don't really address the underlying issues. In general, I'd expect the underlying issues to continue to manifest themselves in other ways when patches are applied. For instance, even with slow & reversible changes, it's still entirely plausible that humans don't stop something bad because they don't understand what's going on in enough detail - that's a typical scenario in the "translation problem" worldview.
Zooming out even further...
I think the solutions I would look for would be quite different though...
I think what's driving this intuition is that you're looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities. I expect that such strategies, in general, will run into similar problems to those mentioned above:
Starting point: the problem which makes AI alignment hard is not the same problem which makes AI dangerous. This is the capabilities/alignment distinction: AI with extreme capabilities is dangerous; aligning it is the hard part.
So it seems like this framing of alignment removes the notion of the AI "optimizing for something" or "being goal-directed". Do you endorse dropping that idea?
Anything with extreme capabilities is dangerous, and needs to be aligned. This applies even outside AI - e.g. we don't want a confusing interface on a nuclear silo. Lots of optimization power is a sufficient condition for extreme capabilities, but not a necessary condition.
Here's a plausible doom scenario without explicit optimization. Imagine an AI which is dangerous in the same way as a nuke is dangerous, but more so: it can make large irreversible changes to the world too quickly for anyone to to stop it. Maybe it's capable of designing and printing a supervirus (and engineered bio-offence is inherently easier than engineered bio-defense); maybe it's capable of setting off all the world's nukes simultaneously; maybe it's capable of turning the world into grey goo.
If that AI is about as transparent as today's AI, and does things the user wasn't expecting about as often as today's AI, then that's not going to end well.
Now, there is the counterargument that this scenario would produce a fire alarm, but there's a whole host of ways that could fail:
Getting back to your question:
Do you endorse dropping that idea?
I don't endorse dropping the AI-as-optimizer idea entirely. It is definitely a sufficient condition for AI to be dangerous, and a very relevant sufficient condition. But I strongly endorse the idea that optimization is not a necessary condition for AI to be dangerous. Tool AI can be plenty dangerous if it's capable of making large, fast, irreversible changes to the world, and the alignment problem is still hard for that sort of AI.
I do expect that systems trained with limited information/compute will often learn multi-level models. That said, there's a few reasons why low-level is still the right translation target to think about.
First, there's the argument from the beginning of the OP: in the limit of abundant information & compute, there's no need for multi-level models; just directly modelling the low-level will have better predictive power. That's a fairly general argument, which applies even beyond AI, so it's useful to keep in mind.
But the main reason to treat low-level as the translation target is: assuming an AI does use high-level models, translating into those models directly will only be easier than translating into the low level to the extent that the AI's high-level models are similar to a human's high-level models. We don't have any reason to expect AI to use similar abstraction levels as humans except to the extent that those abstraction levels are determined by the low-level structure. In studying how to translate our own high-level models into low-level structure, we also learn when and to what extent an AI is likely to learn similar high-level structures, and what the correspondence looks like between ours and theirs.