All of Vladimir_Nesov's Comments + Replies

Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

The problem of figuring out preference without wireheading seems very similar to the problem of maintaining factual knowledge about the world without suffering from appeals to consequences. In both cases a specialized part of agent design (model of preference or model of a fact in the world) has a purpose (accurate modeling of its referent) whose pursuit might be at odds with consequentialist decision making of the agent as a whole. The desired outcome seems to involve maintaining integrity of the specialized part, resisting corruption of consequentialist ... (read more)

Why You Should Care About Goal-Directedness

Yeah, that was sloppy of the article. In context, the quote makes a bit of sense, and the qualifier "in every detail" does useful work (though I don't see how to make the argument clear just by defining what these words mean), but without context it's invalid.

1Adam Shimi6moSorry for my last comment, it was more a knee-jerk reaction than a rational conclusion. My issue here is that I'm still not sure of what would be a good replacement for the above quote, that still keeps intact the value of having compressed representations of systems following goals. Do you have an idea?
Why You Should Care About Goal-Directedness

Having an exact model of the world that contains the agent doesn't require any explicit self-references or references to the agent. For example, if there are two programs whose behavior is equivalent, A and A', and the agent correctly thinks of itself as A, then it can also know the world to be a program W(A') with some subexpressions A', but without subexpression A. To see the consequences of its actions in this world, it would be useful for the agent to figure out that A is equivalent to A', but it is not necessary that this is known to the agent from th... (read more)

1Adam Shimi6moThanks for additional explanations. That being said, I'm not an expert on Embedded Agency, and that's definitely not the point of this post, so just writing stuff that are explicitly said in the corresponding sequence is good enough for my purpose. Notably, the section [https://www.lesswrong.com/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN#3__Embedded_world_models] on Embedded World Models from Embedded Agency [https://www.lesswrong.com/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN] begins with: Maybe that's not correct/exact/the right perspective on the question. But once again, I'm literally giving a two sentence explanations of what the approach says, not the ground truth or a detailed investigation of the subject.
Why You Should Care About Goal-Directedness

The quote sounds like an argument for non-existence of quines or of the context in which things like the diagonalization lemma are formulated. I think it obviously sounds like this, so raising nonspecific concern in my comment above should've been enough to draw attention to this issue. It's also not a problem Agent Foundations explores, but it's presented as such. Given your background and effort put into the post this interpretation of the quote seems unlikely (which is why I didn't initially clarify, to give you the first move). So I'm confused. Everyth... (read more)

1Adam Shimi6moI do appreciate you pointing out this issue, and giving me the benefit of the doubt. That being said, I prefer that comments clarify the issue raised, if only so that I'm more sure of my interpretation. The up and downvotes in this thread are I think representative of this preference (not that I downvoted your post -- I was glad for feedback). About the quote itself, rereading it and rereading Embedded Agency [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN], I think you're right about what I write not being an Agents Foundation problem (at least not one I know of). What I had in mind was more about non-realizability [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN#3_1__Realizability] and self-reference [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN#3_2__Self_reference] in the context of decision/game theory. I seem to have mixed the two with naive Gödelian self-reference in my head at the time of writing, which resulted in this quote. Do you think that this proposed change solves your issues? "This has many ramifications, including non-realizability [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh/p/i3BTagvt3HbPMx6PN#3_1__Realizability] (the impossibility of the agent to contain an exact model of the world, because it is inside the world and thus smaller), self-referential issues [self-reference] in the context of game theory (because the model is part of the agent which is part of the world, other agents can access it and exploit it), and the need to find an agent/world boundary (as it's not given for free like in the dualistic perspective)."
Why You Should Care About Goal-Directedness

Trouble comes from self-reference: since the agent is part of the world, so is its model, and thus a perfect model would need to represent itself, and this representation would need to represent itself, ad infinitum. So the model cannot be exact.

???

5Adam Shimi6moWhat's the issue?
Vanessa Kosoy's Shortform

I agree. But GPT-3 seems to me like a good estimate for how much compute it takes to run stream of consciousness imitation learning sideloads (assuming that learning is done in batches on datasets carefully prepared by non-learning sideloads, so the cost of learning is less important). And with that estimate we already have enough compute overhang to accelerate technological progress as soon as the first amplified babbler AGIs are developed, which, as I argued above, should happen shortly after babblers actually useful for automation of human jobs are deve... (read more)

2Vanessa Kosoy8moAnother thing that might happen is a data bottleneck. Maybe there will be a good enough dataset to produce a sideload that simulates an "average" person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general). Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That's because, unless you're training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your "episodes". Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length. In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.
Vanessa Kosoy's Shortform

I was arguing that near human level babblers (including the imitation plateau you were talking about) should quickly lead to human level AGIs by amplification via stream of consciousness datasets, which doesn't pose new ML difficulties other than design of the dataset. Superintelligence follows from that by any of the same arguments as for uploads leading to AGI (much faster technological progress; if amplification/distillation of uploads is useful straight away, we get there faster, but it's not necessary). And amplified babblers should be stronger than v... (read more)

1Vanessa Kosoy8moThe imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a cause of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an "intrinsic" AGI might lead to the classical "foom" scenario. EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.
Vanessa Kosoy's Shortform

To me this seems to be essentially another limitation of the human Internet archive dataset: reasoning is presented in an opaque way (most slow/deliberative thoughts are not in the dataset), so it's necessary to do a lot of guesswork to figure out how it works. A better dataset both explains and summarizes the reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3 can do that to an extent by roleplaying Feynman).

Any algorithm can be represented by a habit of thought (Turing machine style if you must), and if those are in the dataset,... (read more)

1Vanessa Kosoy8moI don't see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior "stream of thought". In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.
Vanessa Kosoy's Shortform

This seems similar to gaining uploads prior to AGI, and opens up all those superorg upload-city amplification/distillation constructions which should get past human level shortly after. In other words, the limitations of the dataset can be solved by amplification as soon as the AIs are good enough to be used as building blocks for meaningful amplification, and something human-level-ish seems good enough for that. Maybe even GPT-n is good enough for that.

1Vanessa Kosoy8moThat is similar to gaining uploads (borrowing terminology from Egan, we can call them "sideloads"), but it's not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you're distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that's not better than plain speedup.
Towards a mechanistic understanding of corrigibility

I agree that exotic decision algorithms or preference transformations are probably not going to be useful for alignment, but I think this kind of activity is currently more fruitful for theory building than directly trying to get decision theory right. It's just that the usual framing is suspect: instead of exploration of the decision theory landscape by considering clearly broken/insane-acting/useless but not yet well-understood constructions, these things are pitched (and chosen) for their perceived use in alignment.

1David Krueger1yWhat do you mean "these things"? Also, to clarify, when you say "not going to be useful for alignment", do you mean something like "...for alignment of arbitrarily capable systems"? i.e. do you think they could be useful for aligning systems that aren't too much smarter than humans?
A Critique of Functional Decision Theory

By the way, selfish values seem related to the reward vs. utility distinction. An agent that pursues a reward that's about particular events in the world rather than a more holographic valuation seems more like a selfish agent in this sense than a maximizer of a utility function with a small-in-space support. If a reward-seeking agent looks for reward channel shaped patterns instead of the instance of a reward channel in front of it, it might tile the world with reward channels or search the world for more of them or something like that.

Formalising decision theory is hard

I was never convinced that "logical ASP" is a "fair" problem. I once joked with Scott that we can consider a "predictor" that is just the single line of code "return DEFECT" but in the comments it says "I am defecting only because I know you will defect."

I'm leaning this way as well, but I think it's an important clue to figuring out commitment races. ASP Predictor, DefectBot, and a more general agent will make different commitments, and these things are already algorithms specialized for certain situations. How is the chosen commitment related to what

... (read more)
Why we need a *theory* of human values

Yes, that's the almost fully general counterargument: punt all the problems to the wiser versions of ourselves.

It's not clear what the relevant difference is between then and now, so the argument that it's more important to solve a problem now is as suspect as the argument that the problem should be solved later.

How are we currently in a better position to influence the outcome? If we are, then the reason for being in a better position is a more important feature of the present situation than object-level solutions that we can produce.

1Stuart Armstrong2yWe have a much clearer understanding of the pressures we are under now, as to what pressures simulated versions of ourselves would be in the future. Also, we agree much more strongly with the values of our current selves than with the values of possible simulated future selves. Consequently, we should try and solve early the problems with value alignment, and punt technical problems to our future simulated selves. It's not particularly a question of influencing the outcome, but of reaching the right solution. It would be a tragedy if our future selves had great influence, but pernicious values.
Two Neglected Problems in Human-AI Safety

I worry that in the context of corrigibility it's misleading to talk about alignment, and especially about utility functions. If alignment characterizes goals, it presumes a goal-directed agent, but a corrigible AI is probably not goal-directed, in the sense that its decisions are not chosen according to their expected value for a persistent goal. So a corrigible AI won't be aligned (neither will it be misaligned). Conversely, an agent aligned in this sense can't be visibly corrigible, as its decisions are determined by its goals, not orders and wishes of

... (read more)
Three AI Safety Related Ideas

I thought the point of idealized humans was to avoid problems of value corruption or manipulation

Among other things, yes.

which makes them better than real ones

This framing loses the distinction I'm making. More useful when taken together with their environment, but not necessarily better in themselves. These are essentially real humans that behave better because of environments where they operate and lack of direct influence from the outside world, which in some settings could also apply to the environment where they were raised. But they share the

... (read more)
1Rohin Shah2yYeah, I agree with all of this. How would you rewrite my sentence/paragraph to be clearer, without making it too much longer?
Three AI Safety Related Ideas

If it's too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones.

My interpretation of how the term is used here and elsewhere is that idealized humans are usually in themselves, and when we ignore costs, worse than real ones. For example, they could be based on predictions of human behavior that are not quite accurate, or they may only remain sane for an hour of continuous operation from some initial state. They are only better because they can

... (read more)
1Rohin Shah2yPerhaps Wei Dai could clarify, but I thought the point of idealized humans was to avoid problems of value corruption or manipulation, which makes them better than real ones. I agree that idealized humans have the benefit of making things like infinite HCH possible, but that doesn't seem to be a main point of this post.
Why we need a *theory* of human values

More to the point, these failure modes are ones that we can talk about from outside

So can the idealized humans inside a definition of indirect normativity, which motivates them to develop some theory and then quarantine parts of the process to examine their behavior from outside the quarantined parts. If that is allowed, any failure mode that can be fixed by noticing a bug in a running system becomes anti-inductive: if you can anticipate it, it won't be present.

3Stuart Armstrong2yYes, that's the almost fully general counterargument: punt all the problems to the wiser versions of ourselves. But some of these problems are issues that I specifically came up with. I don't trust that idealised non-mes would necessarily have realised these problems even if put in that idealised situation. Or they might have come up with them too late, after they had already altered themselves. I also don't think that I'm particularly special, so other people can and will think up problems with the system that hadn't occurred to me or anyone else. This suggests that we'd need to include a huge amount of different idealised humans in the scheme. Which, in turn, increases the chance of the scheme failing due to social dynamics, unless we design it carefully ahead of time. So I think it is highly valuable to get a lot of people thinking about the potential flaws and improvements for the system before implementing it. That's why I think that "punting to the wiser versions of ourselves" is useful, but not a sufficient answer. The better we can solve the key questions ("what are these 'wiser' versions?", "how is the whole setup designed?", "what questions exactly is it trying to answer?"), the better the wiser ourselves will be at their tasks.
Intuitions about goal-directed behavior

Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don't necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can't derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-conta

... (read more)
Intuitions about goal-directed behavior

My guess is that agents that are not primarily goal-directed can be good at defending against goal-directed agents (especially with first mover advantage, preventing goal-directed agents from gaining power), and are potentially more tractable for alignment purposes, if humans coexist with AGIs during their development and operation (rather than only exist as computational processes inside the AGI's goal, a situation where a goal concept becomes necessary).

I think the assumption that useful agents must be goal-directed has misled a lot of discussion of AI r

... (read more)
1Wei Dai2yI think I disagree with this at least to some extent. Humans are not generally safe agents, and in order for not-primarily-goal-directed AIs to not exacerbate humans' safety problems (for example by rapidly shifting their environments/inputs out of a range where they are known to be relatively safe), it seems that we have to solve many of the same metaethical/metaphilosophical problems that we'd need to solve to create a safe goal-directed agent. I guess in some sense the former has lower "AI risk" than the latter in that you can plausibly blame any bad outcomes on humans instead of AIs, but to me that's actually a downside because it means that AI creators can more easily deny their responsibility to help solve those problems.
Impact Measure Desiderata

I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it's a maximizer that's not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won't be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It's sort of a daemon, but with

... (read more)
1Alex Turner3yThat’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?
Impact Measure Desiderata

It could as easily be "do this one slightly helpful thing", an addition on top of doing nothing. It doesn't seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.

1Alex Turner3yWhether these granular actions exist is also an open question I listed. I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.
Impact Measure Desiderata

I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don't think it's particularly useful to work out the details without a general plan or expectation of important technical surprises.)

1Alex Turner3yIf you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer ("dont help me too much"). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.
Impact Measure Desiderata

It's Rice's theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it's never fully general, unless we by construction ensure that unexpected things can't occur. And even then it's not what we would have wanted the notions of agents or goals to be, because it's not clear what that is.

Intent verification doesn't seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after o... (read more)

1Alex Turner3yAre you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.
Impact Measure Desiderata
Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.

It'll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.

Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent.

To the extent its existence could pose a problem for another agent (according to the measure, which can't really talk about goals of ... (read more)

1Alex Turner3yThen it specifically isn’t allowed by intent verification. Are these your priors, or posteriors after having read my post? Because AUP is about the goals of arbitrary agents.
Impact Measure Desiderata

The sub-agent in this scenario won't be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It's no more useful than its absence. But for the main agent, it's as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.

1Alex Turner3yThat isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point. Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built. I discuss this kind of thing in several places in the comments, if you’re interested.
Impact Measure Desiderata
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact.

Here is a scenario that I think low impact methods can't address. Let's say the AI has the opportunity to easily (without affecting its ability to do its task) create a "free" copy or a sub-agent on the Internet. That sub-agent can be tasked with pursuing the goal of restricting whatever measure of impact the main agent operates under. The sub-agents very carefully eats the universe, while ensuring that according to the impact meas... (read more)

1Alex Turner3yMy new measure [https://www.lesswrong.com/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure] captures this, if I understand correctly. This action drastically changes the agent’s ability to achieve different goals.
Does UDT *really* get counter-factually mugged?

Game-aligned agents aren't useful for AI control as complete agents, since if you give them enough data from the real world, they start caring about it. This aspect applies more straightforwardly to very specialized sub-agents.

It's misleading to say that such agents assign probability zero to the real world, since the computations they optimize don't naturally talk about things like worlds at all. For example, consider a chess-playing AI that should learn to win against a fixed opponent program. It only reasons about chess, there is no reason to introduce

... (read more)
Does UDT *really* get counter-factually mugged?

The issue in the OP is that possibility of other situations influences agent's decision. The standard way of handling this is to agree to disregard other situations, including by appealing to Omega's stipulated ability to inspire belief (it's the whole reason for introducing the trustworthiness clause). This belief, if reality of situations is treated equivalently to their probability in agent's eyes, expels the other situations from consideration.

The idea Paul mentioned is just another way of making sure that the other situations don't intrude on the thou

... (read more)
1IAFF-User-1114yI reason as follows: 1. Omega inspires belief only after the agent encounters Omega. 2. According to UDT, the agent should not update its policy based on this encounter; it should simply follow it. 3. Thus the agent should act according to whatever the best policy is, according to its original (e.g. universal) prior from before it encountered Omega (or indeed learned anything about the world). I think either: 1. the agent does update, in which case, why not update on the result of the coin-flip? or 2. the agent doesn't update, in which case, what matters is simply the optimal policy given the original prior.
Does UDT *really* get counter-factually mugged?

An agent that cares only about given worlds is a useful concept. If these worlds are more like gameboards of an abstract game (with the agent being part of the gameboards), we can talk about game-aligned AI. By virtue of only caring about the abstract game it won't be motivated to figure out how its decisions influence our physical world (which it doesn't care about) and so won't normally be dangerous despite not being human-aligned.

0IAFF-User-1114yThis seems only loosely related to my OP. But it is quite interesting... so you're proposing that we can make safe AIs by, e.g. giving them a prior which puts 0 probability mass on worlds where dangerous instrumental goals are valuable. The simplest way would be to make the agent believe that there is no past / future (thus giving us a more “rational” contextual bandit algorithm than we would get by just setting a horizon of 0). However, Mathieu Roy suggested to me that acausal trade might still emerge, and I think I agree based on open-source prisoner’s dilemma. Anyways, I think that's a promising avenue to investigate. Having a good model of the world seems like a necessary condition for an AI to pose a significant Xrisk.
My current take on the Paul-MIRI disagreement on alignability of messy AI

Unaligned AIs don't necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI's values. It's not clear that "naturally occurring" unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example.

My current take on the Paul-MIRI disagreement on alignability of messy AI

Being able to "gracefully step aside" (to be replaced) is an example of what I meant by "limited scope" (in time). Even if AI's scope is "broad", the crucial point is that it's not literally everything (and by default it is). In practice it shouldn't be more than a small part of the future, so that the rest can be optimized better, using new insights. (Also, to be able to ask what humans would want today, there should remain some humans who didn't get "optimized" into something else.)

My current take on the Paul-MIRI disagreement on alignability of messy AI

It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve.

For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won't

... (read more)
My current take on the Paul-MIRI disagreement on alignability of messy AI

Speaking for myself, the main issue is that we have no idea how to do step 3, how to tell a pre-existing sovereign what to do. A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn't designed to be able to understand certain things, it won't be possible to direct it correctly. If in 100 years the humans come up with new principles in how the AI should make decisions (philosophical progress), it may be impossible to express these principles as directions for an existing AI that

... (read more)
0Paul Christiano4yIt's not clear to me why "limited scope" and "can be replaced" are related. An agent with broad scope can still be optimizing something like "what the human would want me to do today" and the human could have preferences like "now that humans believe that an alternative design would have been better, gracefully step aside." (And an agent with narrow scope could be unwilling to step aside if so doing would interfere with accomplishing its narrow task.)
Updatelessness and Son of X

UDT, in its global policy form, is trying to solve two problems: (1) coordination between the instances of an agent faced with alternative environments; and (2) not losing interest in counterfactuals as soon as observations contradict them. I think that in practice, UDT is a wrong approach to problem (1), and the way in which it solves problem (2) obscures the nature of that problem.

Coordination, achieved with UDT, is like using identical agents to get cooperation in PD. Already in simple use cases we have different amounts of computational resources for i

... (read more)
1Wei Dai5yIt seems to me like cooperation might be possible in much greater generality. I don't see how we know that it is possible. Please explain? I'm having trouble following you here. Can you explain more about each point, and how they can be addresses separately?
Control and security

I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior

This does seem sufficient to solve the immediate problem of AI risk, without compromising the potential for optimizing the world with our detailed values, provided

  • The line between "us" that maintain control and the AI design is sufficiently blurred (via learning, uploading, prediction etc., to remove the overhead of dealing with physical humans)
  • "Behave effectively" includes cap
... (read more)
Control and security

This works as a subtle argument for security mindset in AI control (while not being framed as such). One issue is that it might deemphasize some AI control problems that are not analogous to practical security problems, like detailed value elicitation (where in security you formulate a few general principles and then give up). That is, the concept of {AI control problems that are analogous to security problems} might be close enough to the concept of {all AI control problems} to replace it in some people's minds.

0Paul Christiano5yIt seems to me like failures of value learning can also be a security problem: if some gap between the AI's values and the human values is going to cause trouble, the trouble is most likely to show up in some adversarially-crafted setting. I do agree that this is not closely analogous to security problems that cause trouble today. I also agree that sorting out how to do value elicitation in the long-run is not really a short-term security problem, but I am also somewhat skeptical [https://medium.com/ai-control/ambitious-vs-narrow-value-learning-99bd0c59847e#.n1ppqrg7s] that it is a critical control problem. I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior, and a failure of this property (e.g. because the AI has a bad conception of "effective control") is likely to be a security problem.
(C)IRL is not solely a learning process

I agree with the sentiment that there are philosophical difficulties that AI needs to take into account, but that very likely take far too long to formulate. Simpler kinds of indirect normativity that involve prediction of uploads allow delaying that work to after AI.

So this issue doesn't block all actionable work, as its straightforward form would suggest. There might be no need for the activities to be in this order in physical time. Instead it motivates work on the simpler kinds of indirect normativity that would allow such philosophical investigations

... (read more)