LessWrong developer, rationalist since the Overcoming Bias days. Jargon connoisseur. Currently working on auto-applying tags to LW posts with a language model.
It seems to me that the surprising simplicity of current-generation ML algorithms is a big part of the problem.
As a thought experiment: suppose you had a human brain, with the sort of debug access you'd have with a neural net; ie, you could see all the connections, edge weights, and firings, and had a decent multiple of the compute the brain has. Could you extract something like a verbal inner monologue, a text stream that was strongly predictive of that human's plans? I don't think it would be trivial, but my guess is that you could. It wouldn't hold up against a meditator optimizing against you, but it would be a solid starting point.
Could you do the same thing to GPT-3? No; you can't get language out of it that predicts its plans, because it doesn't have plans. Could you do the same thing to AlphaZero? No, you can't get language out of it that predicts its plans, because it doesn't use language.
This analogy makes me think neural net transparency might not be as doomed as the early results would suggest; they aren't finding human-legible low-dimensional representations of things because those representations aren't present (GPT-3) or have nothing human-legible to match up to (AlphaZero).
In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I'd feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors. This isn't by itself sufficient for alignment, of course, but it'd make the problem look a lot more tractable.
I'm not sure whether humans having an inner monologue that looks like the language we trained on and predicts our future behavior is an incidental fact about humans, or a convergent property of intelligent systems that get most of their information from language, or a convergent property of all intelligent systems, or something that would require deliberate architecture choices to make happen, or something that we won't be able to make happen even with deliberate architecture choices. From my currents state of knowledge, none of these would surprise me much.
Note: Due to a bug, if you were subscribed to email notifications for curated posts, the curation email for this post came from Alignment Forum instead of LessWrong. If you're viewing this post on AF, to see the comments, view it on LessWrong instead. (This is a LessWrong post, not an AF post, but the two sites share a database and have one-directional auto-crossposting from AF to LW.)
Thanks Anna Salamon for the idea of making an AI which cares about what happens in a counterfactual ideal world, rather than the real world world with the transistors in it, as a corrigibility strategy. I haven't yet been able to find a way to make that idea work for an agent/utility maximizer, but it inspired the idea of doing the same thing in an oracle.
To clarify, what I meant was not that they need a source of shared randomness, but that they need a shared probability distribution; ie, having dice isn't enough, they also need to coordinate on a way of interpreting the dice, which is similar to the original problem of coordinating on an ordering over points.
I don't think the mechanics of the problem, as specified, let them mutually specify random things without something like an externally-provided probability distribution. This is aimed at eliminating that requirement. But it may be that this issue isn't very illuminating and would be better addressed by adjusting the problem formulation to provide that.
The procrastination paradox is isomorphic to well-founded recursion. In the reasoning, the fourth step, "whether or not I press the button, the next agent or an agent after that will press the button" is an invalid proof-step; it's shown that there is an inductive steps ending at the conclusion, but not that that chain has a base case.
This can only happen when the relation between an agent and its successor is not well-founded. If there is any well-founded relation between agents and their successors - either because they're in a finite universe, or because the first agent picked a well-founded relation and build that in - then the button will eventually get pushed.
Point (1) seems to be a combination of an issue of working around the absence of a mathematically-elegant communication channel in the formalism, and an incentive to choose some orderings over others because of (2). If (2) is solved and they can communicate, then they can agree on an ordering without any trouble because they're both indifferent to which one is chosen.
If you don't have communication but you have solved (2), I think you can solve the problem by splitting agents into two stages. In the first stage, agents try to coordinate on an ordering over the points. To do this, the two agents X and Y each generate a bag of orderings Ox and Oy that they think might be Schelling points. Agent X first draws an ordering from Ox and tries to prove coordination on it in PA+1, then draws another with replacement and tries to prove coordination on it in PA+2, then draws another with replacement and tries to prove coordination on it in PA+3, etc. Agent Y does the same thing, with a different bag of proposed orderings. If there is overlap between their respective sets, then the odds that they will fail to find a coordination point fall off exponentially, albeit slowly.
Then in the second stage, after proving in PA+n for some n that they will both go through the same ordering, each will try to prove coordination on point 1 in PA+n+1, on point 2 in PA+n+2, etc, for the points they find acceptable.
Regarding (2), the main problem is that this creates an incentive for agents to choose orderings that favor themselves when there is overlap between the acceptable regions, and this creates a high chance that they won't be able to agree on an ordering at all. Jessica Taylor's solution solves the problem of not being able to find an ordering, but at the cost of all the surplus utility that was in the region of overlap. For example, if Janos and I are deciding how to divide a dollar, I offer that Janos keeps it, and Janos offers that I keep it, that solution would have us set it on fire instead.
Instead, perhaps we could redefine the algorithm so that "cooperation at point N" means entering another round of negotiation, where only points that each agent finds at least as good as N are considered, and negotiation continues until it reaches a fixed point.
How to actually convert this into an algorithm? I haven't figured out all the technical details, but I think the key is having agents prove things of the form "we'll coordinate on a point I find at least as good as point N".
This relates to what in Boston we've been calling the Ensemble Stability problem: given multiple utility functions, some of which may be incorrect, how do you keep the AI from sacrificing the other values for the incorrect one(s). Maximin is a step in the right direction, but I don't think it fully solves the problem.
I see two main issues. First, suppose one of the utility functions in the set is erroneous, and the AI predicts that in the future, we'll realize this and create a different AI that optimizes without it. Then the AI will be incentivized to prevent the creation of that AI, or to modify it into including the erroneous value. The second issue is that, if one of the utility functions is offset so it outputs a score well below the others, the other utility functions will be crowded out in the AI's attention and resource allocation.
One approach to the latter problem might be to make a utility function aggregation that approaches maximin behavior in the limit as the AI's resources go to infinity, but starts out more linear.
This generalizes nicely. The asteroid problem provides a nice partitioning into two pieces, such that either piece alone has no effect, but the two pieces together have an effect. But most problems won't have such a partition built in.
If we want the answer to a yes/no question, the first instinct would be that no such partitioning is possible: if two AIs each provide less than 1 bit of information, then combining them won't produce a reliable answer. But we can make it work by combining the yes/no question with some other problem, as follows.
Suppose you want the answer to a question Q, which is a yes-or-no question. Then pick a hard problem H, which is an inconsequential yes-or-no question that AIs can solve reliably, but which humans can't, and for which P(H)=0.5. Take two AIs X and Y. The first AI outputs X=xor(Q,H), and believes that the second AI will output a coin flip. The second AI outputs Y=H, and believes that the first AI will output a coin flip. Then the answer can be obtained by combining the two outputs, xor(X,Y).