Daniel Kokotajlo

Daniel Kokotajlo's Comments

Predictors exist: CDT going bonkers... forever
To summarize my confusion, does CDT require that the agent unconditionally believe in perfect free will independent of history (and, ironically, with no causality for the exercise of will)? If so, that should be the main topic of dispute - the frequency of actual case where it makes bad predictions, not that it makes bad decisions in ludicrously-unlikely-and-perhaps-impossible situations.

Sorta, yes. CDT requires that you choose actions not by thinking "conditional on my doing A, what happens?" but rather by some other method (there are different variants) such as "For each causal graph that I think could represent the world, what happens when I intervene (in Pearl's sense) on the node that is my action, to set it to A?)" or "Holding fixed the probability of all variables not causally downstream of my action, what happens if I do A?"

In the first version, notice that you are choosing actions by imagining a Pearl-style intervention into the world--but this is not something that actually happens; the world doesn't actually contain such interventions.

In the second version, well, notice that you are choosing actions by imagining possible scenarios that aren't actually possible--or at least, you are assigning the wrong probabilities to them. ("holding fixed the probability of all variables not causally downstream of my action...")

So one way to interpret CDT is that it believes in crazy stuff like hardcore incompatibilist free will. But the more charitable way to interpret it is that it doesn't believe in that stuff, it just acts as if it does, because it thinks that's the rational way to act. (And they have plenty of arguments for why CDT is the rational way to act, e.g. the intuition pump "If the box is already either full or empty and you can't change that no matter what you do, then no matter what you do you'll get more money by two-boxing, so..."

The Main Sources of AI Risk?

Thank you for making this list. I think it is important enough to be worth continually updating and refining; if you don't do it then I will myself someday. Ideally there'd be a whole webpage or something, with the list refined so as to be disjunctive, and each element of the list catchily named, concisely explained, and accompanied by a memorable and plausible example. (As well as lots of links to literature.)

I think the commitment races problem is mostly but not entirely covered by #12 and #19, and at any rate might be worth including since you are OK with overlap.

Also, here's a good anecdote to link to for the "coding errors" section: https://openai.com/blog/fine-tuning-gpt-2/

Predictors exist: CDT going bonkers... forever

Well said.

I had a similar idea a while ago and am working it up into a paper ("CDT Agents are Exploitable"). Caspar Oesterheld and Vince Conitzer are also doing something like this. And then there is Ahmed's Betting on the Past case.

In their version, the Predictor offers bets to the agent, at least one of which the agent will accept (for the reasons you outline) and thus they get money-pumped. In my version, there is no Predictor, but instead there are several very similar CDT agents, and a clever human bookie can extract money from them by exploiting their inability to coordinate.

Long story short, I would bet that an actual AGI which was otherwise smarter than me but which doggedly persisted in doing its best to approximate CDT would fail spectacularly one way or another, "hacked" by some clever bookie somewhere (possibly in its hypothesis space only!). Unfortunately, arguably the same is true for all decision theories I've seen so far, but for different reasons...

Malign generalization without internal search

You are right; my comment was based on a misunderstanding of what you were saying. Hence why I unendorsed it.

(I read " In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents. " and thought you meant agents that don't use internal search at all.)

Malign generalization without internal search
Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above.

Computing the fastest route to Paris doesn't involve search?

More generally, I think in order for it to work your example can't contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.

And it's still not obvious to me that this is viable. It seems possible in principle (just imagine a sufficiently large look-up table!) but it seems like it probably wouldn't be competitive with agents that do search at least to the extent that humans do. After all, humans evolved to do search over actions, but we totally didn't have to--if bundles of heuristics worked equally well for the sort of complex environments we evolved in, then why didn't we evolve that way instead?

EDIT: Just re-read and realized you are OK with subroutines that explicitly perform search over actions. But why? Doesn't that undermine your argument? Like, suppose we have an architecture like this:

LOOP:State = GetStateOfWorld(Observation)

IF State == InPain:Cry&FlailAbout

IF State == AttractiveMateStraightAhead:MoveForward&Grin

ELSE ==: Do(RunSubroutine[SearchOverActionsAndOutputActionThoughtToYieldGreatestExpectedNumberOfGrandchildren])


This seems not meaningfully different from the version that doesn't have the first two IF statements, as far as talk of optimizers is concerned.

[This comment is no longer endorsed by its author]Reply
The "Commitment Races" problem

I don't think I was missing that element. The way I think about it is: There is some balance that must be struck between making commitments sooner (risking making foolish decisions due to ignorance) and later (risking not having the right commitments made when a situations arises in which they would be handy). A commitment race is a collective action problem where individuals benefit from going far to the "sooner" end of the spectrum relative to the point that would be optimal for everyone if they could coordinate.

I agree about humans not being able to make commitments--at least, not arbitrary commitments. (Arguably, getting angry and seeking revenge when someone murders your family is a commitment you made when you were born.) I think we should investigate whether this inability is something evolution "chose" or not.

I agree it's a race in knowledge/understanding as well as time. (The two are related.) But I don't think more knowledge = more power. For example, if I don't know anything and decide to commit to plan X which benefits me, else war, and you know more than me--in particular, you know enough about me to know what I will commit to--and you are cowardly, then you'll go along with my plan.

A dilemma for prosaic AI alignment

Thanks! I endorse that summary.

Comment on your planned opinion: I mostly agree; I think what this means is that prosaic AI safety depends somewhat on an empirical premise: That joint training doesn't bring a major competitiveness penalty. I guess I only disagree insofar as I'm a bit more skeptical of that premise. What does the current evidence on joint training say on the matter? I have no idea, but I am under the impression that you can't just take an existing training process--such as the one that made AlphaStar--and mix in some training tasks from a completely different domain and expect it to work. This seems like evidence against the premise to me. As someone (Paul?) pointed out in the comments when I said this, this point applies to fine-tuning as well. But if so that just means that the second and third ways of the dilemma are both uncompetitive, which means prosaic AI safety is uncompetitive in general.

A dilemma for prosaic AI alignment
Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn't suggests to me that if you can apply both to a problem, RL is probably an inferior approach.

Good point. New argument: Your argument could have been made in support of GOFAI twenty years ago "Symbol-manipulation programs have had lots of commercial applications, but neural nets have had almost none, therefore the former is a more generally powerful and promising approach to AI than the latter" but not only does it seem wrong in retrospect it was probably not a super powerful argument even then. Analogously, I think we are too early to tell whether RL or supervised learning will be more useful for powerful AI.

Simulation of what? Selection of what? I don't think those count for my purposes, because they punt the question. (e.g. if you are simulating an agent, then you have an agent-architecture. If you are selecting over things, and the thing you select is an agent...) I think computer program is too general since it includes agent architectures as a subset. These categories are fuzzy of course, so maybe I'm confused, but it still seems to make sense in my head.

(Ah, interesting, it seems that you want to standardize "agent-like architecture" in the opposite of the way that I want to. Perhaps this is underlying our disagreement. I'll try to follow your definition henceforth, but remember that everything I've said previously was with my definition.)

Good point to distinguish between the two. I think that all bullet points, to varying extents, might still qualify as genuine benefits, in the sense that you are talking about. But they might not. It depends on whether there is another policy just as good along the path that the cutting-edge training tends to explore. I agree #2 is probably not like this, but I think #3 might be. (Oh wait, no, it's your terminology I'm using now... in that case, I'll say "#3 isn't an example of agent-like architecture being beneficial to text prediction, but it might well be a case a lower-level architecture exactly like an agent-like architecture except lower level being beneficial to text prediction, supposing that it's not competitive to predict text except by simulating something like a human writing.")

I love your idea to generate a list of concrete scenarios of accidentally agency! These 3.5 are my contributions off the top of my head, if I think of more I'll come back and let you know. And I'd love to see your list if you have a draft somewhere!

I agree the universal prior is malign thing could hurt a non-agent architecture too, and that some agent architectures wouldn't be susceptible to it. Nevertheless it is an example of how you might get accidentally agency, not in your sense but in my sense: A non-agent architecture could turn out to have an agent as a subcomponent that ends up taking over the behavior at important moments.

A dilemma for prosaic AI alignment

Thanks btw, I'm learning a lot from these replies. Are you thinking of training something agenty, or is the hope to train something that isn't agenty?

A dilemma for prosaic AI alignment

OK, thanks! I'm pleased to see this and other empirical premises explicitly laid out. It means we as a community are making predictions about the future based on models which can be tested before it's too late, and perhaps even now.

Load More