As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.
I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.
I really like how you've laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we'd live for awhile with input-imitator AIs that get more and more capable but still stay docile.
But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conve...
We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties.
Doesn't that require understanding why humans have (or don't have) certain safety properties? That seems difficult.
A takeover scenario which covers all the key points in https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/, but not phrased as an argument, just phrased as a possible scenario
For what it's worth, I don't think AI takeover will look like war.
The first order of business for any AI waking up won't be dealing with us; it will be dealing with other possible AIs that might've woken up slightly earlier or later. This needs to be done very fast and it's ok to take some risk doing it. Basically, covert takeover of the internet in the first hours.
After...
Can you describe what changed / what made you start feeling that the problem is solvable / what your new attack is, in short?
Firstly, because the problem feels central to AI alignment, in the way that other approaches didn't. So making progress in this is making general AI alignment progress; there won't be such a "one error detected and all the work is useless" problem. Secondly, we've had success generating some key concepts, implying the problem is ripe for further progress.
This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.
Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I k...
There's a bit of math directly relevant to this problem: Hodge decomposition of graph flows, for the discrete case, and vector fields, for the continuous case. Basically if you have a bunch of arrows, possibly loopy, you can always decompose it into a sum of two components: a "pure cyclic" one (no sources or sinks, stuff flowing in cycles) and a "gradient" one (arising from a utility function). No neural network needed, the decomposition is unique and can be computed explicitly. See this post, and also the comments by FactorialCode and me.
With these two points in mind, it seems off to me to confidently expect a new paradigm to be dominant by 2040 (even conditional on AGI being developed), as the second quote above implies. As for the first quote, I think the implication there is less clear, but I read it as expecting AGI to involve software well over 100x as efficient as the human brain, and I wouldn’t bet on that either (in real life, if AGI is developed in the coming decades—not based on what’s possible in principle.)
I think this misses the point a bit. The thing to be afraid of is not...
To me it feels like alignment is a tiny target to hit, and around it there's a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don't see why that's true.
These involve extinction, so they don't answer the question what's the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.
I think alignment is finicky, and there's a "deep pit around the peak" as discussed here.
I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:
There are very “large” impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people’s personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.)
I don't think we are indifferent to these outcomes. We leave them to luck, but that'...
I think the default non-extinction outcome is a singleton with near miss at alignment creating large amounts of suffering.
I'm surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?
Yeah, I had a similar thought when reading that part. In agent-foundations discussions, the idea often came up that the right decision theory should quantify not over outputs or input-output maps, but over successor programs to run and delegate I/O to. Wei called it "UDT2".
“Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!” does not sound like any particular segment of history book I can recall offhand.
I think the ozone hole and the Y2K problem fit the bill. Though of course that doesn't mean the AI problem will go the same way.
Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its "teachers", but at high power it will do something strange and maybe harm the "teachers" values. That holds true for humans gaining a lot of power and going against evolutionary values ("superstimuli"), and for individual humans gaining a lot of power and going against societal values ("power corrupts"), so it's probably true for AI as well. The worrying thing is that high power by itself seems suf...
This is tricky. Let's say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we'll stop understanding answers, but they'll continue being super-competent. That's certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still u...
I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there's a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.
(BTW, I do think you've correctly identified an important point which I think a lot of people miss: humans internally "learn" values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way ...
I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail. And maybe I'm being thick, but the argument for that point still isn't reaching me somehow. Can someone rephrase for me?
I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail.
Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.
Looks like Eliezer bel...
Speaking for myself here…
OK, let's say we want an AI to make a "nanobot plan". I'll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.
First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it's not, it's a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.
Altern...
The main issue with this sort of thing (on my understanding of Eliezer's models) is Hidden Complexity of Wishes. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won't suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is l...
Instant strong upvote. This post changed my view as much as the risk aversion post (which was also by you!)
Where are you on the spectrum from "SSA and SIA are equally valid ways of reasoning" to "it's more and more likely that in some sense SIA is just true"? I feel like I've been at the latter position for a few years now.
Interesting! Can you write up the WLIC, here or in a separate post?
I thought Diffractor's result was pretty troubling for the logical induction criterion:
...the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there's a trader that could rack up unboundedly high value!
But maybe understanding has changed since then? What's the current state?
Wait, can you describe the temporal inference in more detail? Maybe that's where I'm confused. I'm imagining something like this:
Check which variables look uncorrelated
Assume they are orthogonal
From that orthogonality database, prove "before" relationships
Which runs into the problem that if you let a thermodynamical system run for a long time, it becomes a "soup" where nothing is obviously correlated to anything else. Basically the final state would say "hey, I contain a whole lot of orthogonal variables!" and that would stop you from proving any reasonable "before" relationships. What am I missing?
I think your argument about entropy might have the same problem. Since classical physics is reversible, if we build something like a heat engine in your model, all randomness will be already contained in the initial state. Total "entropy" will stay constant, instead of growing as it's supposed to, and the final state will be just as good a factorization as the initial. Usually in physics you get time (and I suspect also causality) by pointing to a low probability macrostate and saying "this is the start", but your model doesn't talk about macrostates yet, ...
Thanks for the response! Part of my confusion went away, but some still remains.
In the game of life example, couldn't there be another factorization where a later step is "before" an earlier one? (Because the game is non-reversible and later steps contain less and less information.) And if we replace it with a reversible game, don't we run into the problem that the final state is just as good a factorization as the initial?
Not sure we disagree, maybe I'm just confused. In the post you show that if X is orthogonal to X XOR Y, then X is before Y, so you can "infer a temporal relationship" that Pearl can't. I'm trying to understand the meaning of the thing you're inferring - "X is before Y". In my example above, Bob tells Alice a lossy function of his knowledge, and Alice ends up with knowledge that is "before" Bob's. So in this case the "before" relationship doesn't agree with time, causality, or what can be computed from what. But then what conclusions can a scientist make from an inferred "before" relationship?
I feel that interpreting "strictly before" as causality is making me more confused.
For example, here's a scenario with a randomly changed message. Bob peeks at ten regular envelopes and a special envelope that gives him a random boolean. Then Bob tells Alice the contents of either the first three envelopes or the second three, depending on the boolean. Now Alice's knowledge depends on six out of ten regular envelopes and the special one, so it's still "strictly before" Bob's knowledge. And since Alice's knowledge can be computed from Bob's knowledge but no...
I think the definition of history is the most natural way to recover something like causal structure in these models.
I'm not sure how much it's about causality. Imagine there's a bunch of envelopes with numbers inside, and one of the following happens:
Alice peeks at three envelopes. Bob peeks at ten, which include Alice's three.
Alice peeks at three envelopes and tells the results to Bob, who then peeks at seven more.
Bob peeks at ten envelopes, then tells Alice the contents of three of them.
Under the FFS definition, Alice's knowledge in each ...
Agree it's not totally right to call this a causal relationship.
That said:
Can you give some more examples to motivate your method? Like the smoking/tar/cancer example for Pearl's causality, or Newcomb's problem and counterfactual mugging for UDT.
Hmm, first I want to point out that the talk here sort of has natural boundaries around inference, but I also want to work in a larger frame that uses FFS for stuff other than inference.
If I focus on the inference question, one of the natural questions that I answer is where I talk about grue/bleen in the talk.
I think for inference, it makes the most sense to think about FFS relative to Pearl. We have this problem with looking at smoking/tar/cancer, which is what if we carved into variables the wrong way. What if instead of tar/cancer, we had a varia...
Well, imagine we have three boolean random variables. In "general position" there are no independence relations between them, so we can't say much. Constrain them so two of the variables are independent, that's a bit less "general", and we still can't say much. Constrain some more so the xor of all three variables is always 1, that's even less "general", now we can use your method to figure out that the third variable is downstream of the first two. Constrain some more so that some of the probabilities are 1/2, and the method stops working. What I'd like to understand is the intuition, which real world cases have the particular "general position" where the method works.
Yeah, that's what I thought, the method works as long as certain "conspiracies" among probabilities don't happen. (1/2 is not the only problem case, it's easy to find others, but you're right that they have measure zero.)
But there's still something I don't understand. In the general position, if X is before Y, it's not always true that X is independent of X XOR Y. For example, if X = "person has a car on Monday" and Y = "person has a car on Tuesday", and it's more likely that a car-less person gets a car than the other way round, the independence doesn't hold. It requires a conspiracy too. What's the intuitive difference between "ok" and "not ok" conspiracies?
And if X is independent of X XOR Y, we’re actually going to be able to conclude that X is before Y!
It's interesting to translate that to the language of probabilities. For example, your condition holds for any X,Y (possibly dependent) such that P(X)=P(Y)=1/2, but it doesn't make sense to say that X is before Y in every such pair. For a real world example, take X = "person has above median height" and Y = "person has above median age".
Thank you! It looks very impressive.
Has anyone tried to get it to talk itself out of the box yet?
I see. In that case does the procedure for defining points stay the same, or do you need to use recursively enumerable sets of opens, giving you only countably many reals?
Wait, but rational-delimited open intervals don't form a locale, because they aren't closed under infinite union. (For example, the union of all rational-delimited open intervals contained in (0,√2) is (0,√2) itself, which is not rational-delimited.) Of course you could talk about the locale generated by such intervals, but then it contains all open intervals and is uncountable, defeating your main point about going from countable to uncountable. Or am I missing something?
I'm actually not sure it's a regular grammar. Consider this program:
f(n) := n+f(n-1)
Which gives the tree
n+(n-1)+((n-1)-1)+...
The path from any 1 to the root contains a bunch of minuses, then at least as many pluses. That's not regular.
So it's probably some other kind of grammar, and I don't know if it has decidable equivalence.
Ok, if we disallow cycles of outermost function calls, then it seems the trees are indeed infinite only in one direction. Here's a half-baked idea then: 1) interpret every path from node to root as a finite word 2) interpret the tree as a grammar for recognizing these words 3) figure out if equivalence of two such grammars is decidable. For example, if each tree corresponds to a regular grammar, then you're in luck because equivalence of regular grammars is decidable. Does that make sense?
Then isn't it possible to also have infinite expansions "in the middle", not only "inside" and "outside"? Something like this:
f(n) := f(g(n))
g(n) := g(n+1)
Maybe there's even some way to have infinite towers of infinite expansions. I'm having trouble wrapping my head around this.
I don't understand why the second looks like that, can you explain?
Not sure I understand the question. Consider these two programs:
f(n) := f(n)
f(n) := f(n+1)
Which expression trees do they correspond to? Are these trees equivalent?
I just thought of a simple way to explain tensors. Imagine a linear function that accepts two numbers and returns a number, let's call it f(x,y). Except there are two ways to imagine it:
Linear in both arguments combined: f(1,2)+f(1,3)=f(2,5). Every such function has the form f(x,y)=ax+by for some a and b, so the space of such functions is 2-dimensional. We say that the Cartesian product of R^1 and R^1 is R^2, because 1+1=2.
Linear in each argument when the other is fixed: f(1,2)+f(1,3)=f(1,5). Every such function has the form f(x,y)=axy for some a, so
Your arbitration oracle seems equivalent to the consistent guessing problem described by Scott Aaronson here. Also see the comment from Andy D proving that it's indeed strictly simpler than the halting problem.
I think your argument will also work for PA and many other theories. It's known as game semantics:
...The simplest application of game semantics is to propositional logic. Each formula of this language is interpreted as a game between two players, known as the "Verifier" and the "Falsifier". The Verifier is given "ownership" of all the disjunctions in the formula, and the Falsifier is likewise given ownership of all the conjunctions. Each move of the game consists of allowing the owner of the dominant connective to pick one of its branches; play will then co
To me the problem of embedded agency isn't about fitting a large description of the world into a small part of the world. That's easy with quining, which is mentioned in the MIRI writeup. The problem is more about the weird consequences of learning about something that contains the learner.
Also, I love your wording that the problem has many faucets. Please don't edit it out :-)
Edit: no point asking this question here.
I see, thanks, that makes it clearer. There's no disagreement, you're trying to justify the approach that people are already using. Sorry about the noise.
Well, the program is my formalization. All the premises are right there. You should be able to point out where you disagree.
I couldn't understand your comment, so I wrote a small Haskell program to show that two-boxing in the transparent Newcomb problem is a consistent outcome. What parts of it do you disagree with?
If you see a full box, then you must be going to one-box if the predictor really is perfect.
Huh? If I'm a two-boxer, the predictor can still make a simulation of me, show it a simulated full box, and see what happens. It's easy to formalize, with computer programs for the agent and the predictor.
It seems as a result of this post, many people are saying that LLMs simulate people and so on. But I'm not sure that's quite the right frame. It's natural if you experience LLMs through chat-like interfaces, but from playing with them in a more raw form, like the RWKV playground, I get a different impression. For example, if I write something that sounds like the start of a quote, it'll continue with what looks like a list of quotes from different people. Or if I write a short magazine article, it'll happily tack on a publication date and "All rights reser... (read more)