Recent Discussion

Clarifying "AI Alignment"
159mo3 min readShow Highlight

When I say an AI A is aligned with an operator H, I mean:

A is trying to do what H wants it to do.

The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.

This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.

In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.

Analogy

Consider a human... (Read more)

3Wei Dai4h Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent. So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn't want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we've had a similar discussion before and either it didn't get resolved or I didn't understand your position. I didn't see a direct attempt to answer this in the comment I'm replying to, and it's fine if you don't want to go down this road again but I want to convey my continued confusion.) You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem. I don't understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin's. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn't realize.) Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably. This makes sense.
4Paul Christiano3h Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn't want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else?The oracle is not aligned when asked questions that cause it to do malign optimization. The human+oracle system is not aligned in situations where the human would pose such questions. For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do. For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign [https://ai-alignment.com/benign-ai-e4eb6ec6d68e] when talking about possibly-incoherent systems, or things that don't even resemble optimizers. The definition in this post is a bit sloppy here, but I'm usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in "what H wants it to do" (such that there can be several different preferences that are "what H wants") we could say something like: A is aligned with H if everything it is trying to do is "what H wants."That's not great either though (and I think the original post is more at an appropriate level of attempted-precision).
3Wei Dai1m (In the following I will also use "aligned" to mean "intent aligned".) The human+oracle system is not aligned in situations where the human would pose such questions. Ok, sounds like "intent aligned at some points in time and not at others" was the closest guess. To confirm, would you endorse "the system was aligned when the human was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question"? Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the "intent alignment problem" has been solved for an AI, or when would you call an AI (such as IDA) itself "intent aligned"? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use "intent alignment" you always have some specific situation or set of situations in mind?
2Rohin Shah1h Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment: Isn't HCH also such a multiagent system?Yes, I shouldn't have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is "trying to do", i.e. I wouldn't say it has a single "motivation". This allows you to say "the system is not intent-aligned", even though you can't say "the system is trying to do X". Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn't make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement: For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things.Also, I want to note strong agreement with this: Of course, it also seems quite likely that AIs of the kind that will probably be built ("by default") also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.

Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent.

HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) "What is a good approximation of the user's utility function?" followed by "What action would maximize EU according to this utility func

... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post
Soft takeoff can still lead to decisive strategic advantage
206h7 min readShow Highlight

[Epistemic status: Argument by analogy to historical cases. Best case scenario it's just one argument among many.]

I have on several occasions heard people say things like this:

The original Bostrom/Yudkowsky paradigm envisioned a single AI built by a single AI project, undergoing intelligence explosion all by itself and attaining a decisive strategic advantage as a result. However, this is very unrealistic. Discontinuous jumps in technological capability are very rare, and it is very implausible that one project could produce more innovations than the rest of the world combined. Instead
... (Read more)

I disagree about 1939 Germany--Sure, their economy would collapse, but they'd be able to conquer western europe before it collapsed, and use the resources and industry set up there. Even if they couldn't do that they would be able to reorient their economy in a year or two and then conquer the world.

I agree about the Afghanistan case but I'm not sure what lessons to draw from it for the AGI scenario in particular.

5Lukas Finnveden3h Hm, my prior is that speed of learning how stolen code works would scale along with general innovation speed, though I haven't thought about it a lot. On the one hand, learning the basics of how the code works would scale well with more automated testing, and a lot of finetuning could presumably be automated without intimate knowledge. On the other hand, we might be in a paradigm where AI tech allows us to generate lots of architectures to test, anyway, and the bottleneck is for engineers to develop an intuition for them, which seems like the thing that you're pointing at.
3Hoagy26m I think this this points to the strategic supremacy of relevant infrastructure in these scenarios. From what I remember of the battleship era, having an advantage in design didn't seem to be a particularly large advantage - once a new era was entered, everyone with sufficient infrastructure switches to the new technology. This feels similar to the AI scenario, where technology seems likely to spread quickly through a combination of high financial incentive, interconnected social networks, state-sponsored espionage etc. The way in which a serious differential emeres is likely to be more through a gap in the infrastructure to implement the new technology. It seems that the current world is tilted towards infrastructure ability diffusing fast enough to, but it seems possible that if we have a massive increase in economic growth then this balance is altered and infrastructure gaps emerge, creating differentials that can't easily be reversed by a few algorithm leaks.
4Adele Lopez3h Yeah, I think the engineer intuition is the bottleneck I'm pointing at here.
12Lukas Finnveden4h Great post! One thing I noticed is that claim 1 speak about nationstates while most of the AI-bits speak about companies/projects. I don't think this is a huge problem, but it seems worth looking into. It seems true that it'll be necessary to localize the secret bits into single projects, in order to keep things secret. It also seems true that such projects could keep a lead on the order of months/years. However, note that this does no longer correspond to having a country that's 30 years ahead of the rest of the world. Instead, it corresponds to having a country with a single company 30 years ahead of the world. The equivalent analogy is: could a company transported 30 years back in time gain a decisive strategic advantage for itself / whatever country it landed in? A few arguments: * A single company might have been able to bring back a single military technology, which may or may not have been sufficient to turn the world, alone. However, I think one can argue that AI is more multipurpose than most technologies. * If the company wanted to cooperate with its country, there would be an implementation lag after the technology was shared. In old times, this would perhaps correspond to building the new ships/planes. Today, it might involve taking AI architectures and training them for particular purposes, which could be more or less easy depending on the generality of the tech. (Maybe also scaling up hardware?) During this time, it would be easier for other projects and countries to steal the technology (though of course, they would have implementation lags of their own). * In the historical case, one might worry that a modern airplane company couldn't produce much useful things 30 years back in time, because they relied on new materials and products from other companies. Translated to when AI-companies develops along with the world, this would highlight that the AI-company could develop a 30-year-lead-equivalent in AI
Vaniver's View on Factored Cognition
620h7 min readShow Highlight

The View from 2018

In April of last year, I wrote up my confusions with Paul’s agenda, focusing mostly on approval directed agents. I mostly have similar opinions now; the main thing I noticed on rereading it was I talked about ‘human-sized’ consciences, when now I would describe them as larger than human size (since moral reasoning depends on cultural accumulation which is larger than human size). But on the meta level, I think they’re less relevant to Paul’s agenda than I thought then; I was confused about how Paul’s argument for alignment worked. (I do think my objections were correct object

... (Read more)
1VojtaKovarik5h That is, I can easily see how factored cognition allows you to stick to cognitive strategies that definitely solve a problem in a safe way, but don't see how it does that and allows you to develop new cognitive strategies to solve a problem that doesn’t result in an opening for inner optimizers--not within units, but within assemblages of units. Do you have some intuition for how inner optimizers would arise within assemblages of units, without being initiated by some unit higher in the hierarchy? Or is that what you are pointing at?

When I imagine them they are being initiated by some unit higher in the hierarchy. Basically, you could imagine having a tree of humans that is implementing a particular search process, or a different tree of humans implementing a search over search processes, with the second perhaps being more capable (because it can improve itself) but also perhaps leading to inner alignment problems.

6Vaniver19h It wouldn't surprise me if I was similarly confused now, tho hopefully I am less so, and you shouldn't take this post as me speaking for Paul. This post was improved some by a discussion with Evan [http://lesswrong.com/users/evhub] which crystallized some points as 'clear disagreements' instead of me being confused, but I think there are more points to crystallize further in this way. It was posted tonight in the state it's in as part of MSFP 2019's blog post day, but might get edited more tomorrow or perhaps will get further elaborated in the comments section.
Towards an Intentional Research Agenda
517h3 min readShow Highlight

This post is motivated by research intuitions that better formalisms in consciousness research contribute to agent foundations in more ways than just the value loading problem. Epistemic status: speculative.

David Marr's levels of analysis is the idea that any analysis of a system involves analyzing it at multiple, distinct levels of abstraction. These levels are the computational, which describes what it is the system is trying to do, the algorithmic, which describes which algorithms the system instantiates in order to accomplish that goal, and the implementation level, describing the har... (Read more)

Another direction that has been stubbornly resisting crystallizing is the idea that goodhearting is a positive feature in adversarial environments via something like granting ͼ-differential privacy while still allowing you to coordinate with others by taking advantage of one way functions. i.e. hashing shared intents to avoid adversarial pressure. This would make this sort of work part of a billion year arms race where one side attempts to reverse engineer signaling mechanisms while the other side tries to obfuscate them to prevent the current signaling fr... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post

8Jessica Taylor4h On the subject of intentionality/reference/objectivity/etc, On the Origin of Objects [https://smile.amazon.com/Origin-Objects-Bradford-Book/dp/0262692090?sa-no-redirect=1] is excellent. My thinking about reference has a kind of discontinuity from before reading this book to after reading it. Seriously, the majority of analytic philosophy discussion of indexicality, qualia, reductionism, etc seems hopelessly confused in comparison.
4romeostevensit3h Reading this now, thanks.
1VojtaKovarik5h (I don't have much experience thinking in these terms, so maybe the question is dumb/already answered in the post. But anyway: ) Do you have some more-detailed (and stupidly explicit) examples of the intentional and algorithmic views on the same thing, and how to translate between them?
3romeostevensit4h This is a good question. I think ways of thinking about Marr's levels itself might be underdetermined and therefore worth trying to crux on. Let's take the example of birds again. On the implementation level we can talk about the physical systems of a bird interacting with its environment. On the algorithmic level we can talk about patterns of behavior supported by the physical environment that allow the bird to do certain tasks. On the computational (intentional) level we can talk about why those tasks are useful in terms of some goal architecture like survival and sexual selection. We can think about underdetermination when we have any notion that different goals might in theory be instantiated by two otherwise similar birds, when we have a notion of following different strategies to achieve the same goal, or when we think about having the same goals and strategies instantiated on a different substrate (simulations). One of the reasons I think this topic is confusing is that in reality we only ever have access to the algorithmic level. We don't have direct access to the instantiation level (physics) we just have algorithms that more or less reliably return physics shaped invariances. Likewise we don't have direct access to our goals, we just have algorithms that return goodharted proxies that we use to triangulate on inferred goals. We improve the accuracy of these algorithms over time through a pendulum swing from modification to the representation vs modification to the traversal (this split is also from Marr).
Tabooing 'Agent' for Prosaic Alignment
2020h6 min readShow Highlight
...
8Wei Dai16h This explanation seems clear and helpful to me. I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those). For example you cite Paul Christiano a number of times, but I don't think he intends to end up with a model "implementing highly effective expected utility maximization for some utility function". If we are dealing with a model doing expected utility maximization we can ‘just’ try to understand whether we agree with its goal, and then essentially trust that it will correctly and stably generalize to almost any situation. This seems too optimistic/trusting. See Ontology identification problem [https://arbital.com/p/ontology_identification/], Modeling distant superintelligences [https://arbital.com/p/distant_SIs/], and more recently The “Commitment Races” problem [https://www.lesswrong.com/posts/brXr7PJ2W4Na2EW2q/the-commitment-races-problem] .
4Hjalmar Wijk15h I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those). I'm quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding 'agentic' models from this framework is: 1. Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them. 2. If we believe H then things which generalize very competently are likely to have agent-like internal architecture. 3. Having a selection criteria or model-space/prior which actively pushes away from such agent-like architectures could then help push away from things which generalize too broadly. I think my main problem with this argument is that step 3 might make step 2 invalid - it might be that if you actively punish agent-like architecture in your search then you will break the conditions that made 'too broad generalization' imply 'agent-like architecture', and thus end up with things that still generalize very broadly (with all the downsides of this) but just look a lot weirder. This seems too optimistic/trusting. See Ontology identification problem, Modeling distant superintelligences, and more recently The “Commitment Races” problem. Thanks for the links, I definitely agree that I was drastically oversimplifying this problem. I still think this task might be much simpler than the task of trying to understand the generalization of some strange model whose internal working we don't even have a vocabulary to describe.
7romeostevensit5h (black) hat tip to johnswentworth for the notion that the choice of boundary for the agent is arbitrary in the sense that you can think of a thermostat optimizing the environment or think of the environment as optimizing the thermostat. Collapsing sensor/control duality for at least some types of feedback circuits.
1Hjalmar Wijk4h These sorts of problems are what caused me to want a presentation which didn't assume well-defined agents and boundaries in the ontology, but I'm not sure how it applies to the above - I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I'm likely missing your point.

Motivating example: consider a primitive bacterium with a thermostat, a light sensor, and a salinity detector, each of which has functional mappings to some movement pattern. You could say this system has a 3 dimensional map and appears to search over it.

Formalising decision theory is hard
819h1 min readShow Highlight

In this post, I clarify how far we are from a complete solution to decision theory, and the way in which high-level philosophy relates to the mathematical formalism. I’ve personally been confused about this in the past, and I think it could be useful to people who casually follows the field. I also link to some less well-publicized approaches.

The first disagreement you might encounter when reading about alignment-related decision theory is the disagreement between Causal Decision Theory (CDT), Evidential Decision Theory (EDT), and different logical decision theories emerging from MIRI and less... (Read more)

4Vanessa Kosoy6h Heterodox opinion: I think the entire MIRIesque (and academic philosophy) approach to decision theory is confused. The basic assumption seems to be, that we can decouple the problem of learning a model of the world from the problem of taking a decision given such a model. We then ignore the first problem, and assume a particular shape for the model (for example, causal network) which allows us to consider decision theories such as CDT, EDT etc. However, in reality the two problems cannot be decoupled. This is because the type signature of a world model is only meaningful if it comes with an algorithm for how to learn a model of this type. For example, consider Newcomb's paradox. The agent makes a decision under the assumption that Omega behaves in a certain way. But, where did the assumption come from? Realistic agents have to learn everything they know. Learning normally requires a time sequence. For example, we can consider the iterated Newcomb's paradox (INP). In INP, any reinforcement learning (RL) algorithm will converge to one-boxing, simply because one-boxing gives it the money. This is despite RL naively looking like CDT. Why does it happen? Because in the learned model, the "causal" relationships are not physical causality. The agent comes to believe that taking the one box causes the money to appear there. In Newcomb's paradox EDT succeeds but CDT fails. Let's consider an example where CDT succeeds and EDT fails: the XOR blackmail. The iterated version would be IXB. In IXB, classical RL doesn't guarantee much because the environment is more complex than the agent (it contains Omega). To overcome this, we can use RL with incomplete models [https://www.alignmentforum.org/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda] . I believe that this indeed solves both INP and IXB. Then we can consider e.g. counterfactual mugging. In counterfactual mugging, RL with incomplete models doesn't work. That's because the assumption t
3Lukas Finnveden4h In INP, any reinforcement learning (RL) algorithm will converge to one-boxing, simply because one-boxing gives it the money. This is despite RL naively looking like CDT.Yup, like Caspar [https://casparoesterheld.files.wordpress.com/2018/01/learning-dt.pdf], I think that model-free RL learns the EDT policy in most/all situations. I'm not sure what you mean with it looking like CDT. In Newcomb's paradox CDT succeeds but EDT fails. Let's consider an example where EDT succeeds and CDT fails: the XOR blackmail.Isn't it the other way around? The one-boxer gets more money, but gives in to blackmail, and therefore gets blackmailed in the first place.

RL is CDT in the sense that, your model of the world consists of actions and observations, and some causal link from past actions and observations to current observations, but there is no causal origin to the actions. The actions are just set by the agent to whatever it wants.

And, yes, I got CDT and EDT flipped there, good catch!

Load More