Ah, I see. That makes sense now!
I do expect this to happen. The question is merely: what's the best predictor of how hard it is to find inference algorithms more efficient effective than gradient descent? Is it whether those inference algorithms are more complex than gradient descent? Or is it whether those inference algorithms run for longer than gradient descent? Since gradient descent is very simple but takes a long time to run, my bet is the latter: there are many simple ways to convert compute to optimisation, but few compute-cheap ways to convert additional complexity to optimization.
No, I wasn't advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of "which is favored" than the bounded simplicity prior you're assuming:
Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers)
But now I realise that I don't understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can't be representing.
In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.
Hence, the honest-imitation hypothesis is heavily penalized compared to hypotheses that are in themselves agents which are more "epistemically sophisticated" than the outer loop of the AI.
In a deep learning context, the latter hypothesis seems much more heavily favored when using a simplicity prior (since gradient descent is simple to specify) than a speed prior (since gradient descent takes a lot of computation). So as long as the compute costs of inference remain smaller than the compute costs of training, a speed prior seems more appropriate for evaluating how easily hypotheses can become more epistemically sophisticated than the outer loop.
Quick brainstorm:
I think it's less about how many holes there are in a given plan, and more like "how much detail does it need before it counts as a plan?" If someone says that their plan is "Keep doing alignment research until the problem is solved", then whether or not there's a hole in that plan is downstream of all the other disagreements about how easy the alignment problem is. But it seems like, separate from the other disagreements, Eliezer tends to think that having detailed plans is very useful for making progress.
Analogy for why I don't buy this: I don't think that the Wright brothers' plan to solve the flying problem would count as a "plan" by Eliezer's standards. But it did work.
Strong +1s to many of the points here. Some things I'd highlight:
Meta-level: +1 for actually writing a thing.
Also meta-level: -1 because when I read this I get the sense that you started from a high-level intuition and then constructed a set of elaborate explanations of your intuition, but then phrased it as an argument.
I personally find this frustrating because I keep seeing people being super confident in their high-level intuitive metaphorical view of consequentialism and then never doing the work of actually digging beneath those metaphors. (Less a criticism of this post, more a criticism of everyone upvoting this p... (read more)
Thanks for the post, I think it's a useful framing. Two things I'd be interested in understanding better:
In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”).
As I said in a reply to Eliezer... (read more)
Sorry, I should have been clearer. Let's suppose that a copy of you spent however long it takes to write an honest textbook with the solution to alignment (let's call it N Yudkowsky-years), and an evil copy of you spent N Yudkowsky-years writing a deceptive textbook trying to make you believe in a false solution to alignment, and you're given one but not told which. How long would it take you to reach 90% confidence about which you'd been given? (You're free to get a team together to run a bunch of experiments and implementations, I'm just asking that you ... (read more)
Depends what the evil clones are trying to do.
Get me to adopt a solution wrong in a particular direction, like a design that hands the universe over to them? I can maybe figure out the first time through who's out to get me, if it's 200 Yudkowsky-years. If it's 200,000 Yudkowsky-years I think I'm just screwed.
Get me to make any lethal mistake at all? I don't think I can get to 90% confidence period, or at least, not without spending an amount of Yudkowsky-time equivalent to the untrustworthy source.
Hmm, okay, here's a variant. Assume it would take N Yudkowsky-years to write the textbook from the future described above. How many Yudkowsky-years does it take to evaluate a textbook that took N Yudkowsky-years to write, to a reasonable level of confidence (say, 90%)?
Thanks for writing this, I agree that people have underinvested in writing documents like this. I agree with many of your points, and disagree with others. For the purposes of this comment, I'll focus on a few key disagreeements.
... (read more)My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the pheno
Maybe one way to pin down a disagreement here: imagine the minimum-intelligence AGI that could write this textbook (including describing the experiments required to verify all the claims it made) in a year if it tried. How many Yudkowsky-years does it take to safely evaluate whether following a textbook which that AGI spent a year writing will kill you?
Infinite? That can't be done?
Small request: given that it's plausible that a bunch of LW material on this topic will end up quoted out of context, would you mind changing the headline example in section 5 to something less bad-if-quoted-out-of-context?
I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven't yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs - i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)
A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.
The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.
What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to und... (read more)
Imagine taking someone's utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I'd want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.
But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to ... (read more)
A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it's very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.
(For existing models, it might be more productive to ask for "surprising behavior" rather than "failures" per se, sinc... (read more)
I'm loving this whole sequence, but I particularly love:
9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things
That feels very crisp, clear, and informative.
Probably the easiest "honeypot" is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that's anything like "get more reward" (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd... (read more)
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don't think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I'm about to get a reward. Now I've hacked myself to like strawberries. But I have no idea how that's implemented in my brain, I can't "pick the parameters for myself", even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
When I read other people, I often feel like they're operating in a 'narrower segment of their model', or not trying to fit the whole world at once, or something. They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'
To me it seems like this is what you should expect other people to look like both when other people know less about a domain than you do, and also when you're overconfident ... (read more)
Key role, but most current ML is in the "applied" section, where the "theory" section instead explains the principles by which neural nets (or future architectures) work on the inside. Logical induction is a sidebar at some point explaining the theoretical ideal we're working towards, like I assume AIXI is in some textbooks.
Planning, Abstraction, Reasoning, Self-awareness.
Thanks, done!
I'm curious if you have a way to summarise what you think the "core insight" of ELK is, that allows it to improve on the way other alignment researchers think about solving the alignment problem.
Interesting post :) I'm intuitively a little skeptical - let me try to figure out why.
I think I buy that some reasoning process could consistently decide to hack in a robust way. But there are probably parts of that reasoning process that are still somewhat susceptible to being changed by gradient descent. In particular, hacking relies on the agent knowing what its current mesa-objective is - but that requires some type of introspective access, which may be difficult and the type of thing which could be hindered by gradient descent (especially when you're ... (read more)
Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly - might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight!
I'm still a little wary of how much the report talks about concepts in a humans' Bayes net without really explaining why this is... (read more)
Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don't reason using Bayes nets -- but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn't be logically inconsistent or physically impossible, and we wouldn't want alignment to fail in that world.
If you solve something given worst-case assumptions, you've solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that's n... (read more)
We’ll assume the humans who constructed the dataset also model the world using their own internal Bayes net.
This seems like a crucial premise of the report; could you say more about it? You discuss why a model using a Bayes net might be "oversimplified and unrealistic", but as far as I can tell you don't talk about why this is a reasonable model of human reasoning.
I guess I just don't feel like you've established that it would have been reasonable to have credence above 90% in either of those cases. Like, it sure seems obvious to me that computers and automobiles are super useful. But I have a huge amount of evidence now about both of those things that I can't really un-condition on. So, given that I know how powerful hindsight bias can be, it feels like I'd need to really dig into the details of possible alternatives before I got much above 90% based on facts that were known back then.
(Although this depends on how ... (read more)
Interesting post. Overall, though, it feels like you aren't taking hindsight bias seriously enough. E.g. as one example:
Some people thought battleships would beat carriers. Others thought that the entire war would be won from the air. Predicting the future is hard; we shouldn’t be confident. Therefore, we shouldn’t assign more than 90% credence to the claim that powerful, portable computers (assuming we figure out how to build them) will be militarily useful, e.g. in weapon guidance systems or submarine sensor suites.
In this particular case, an alternative... (read more)
I didn't push this point at the time, but Paul's claim that "GPT-3 + 5 person-years of engineering effort [would] foom" seems really wild to me, and probably a good place to poke at his model more. Is this 5 years of engineering effort and then humans leaving it alone with infinite compute? Or are the person-years of engineering doled out over time?
Unlike Eliezer, I do think that language models not wildly dissimilar to our current ones will be able to come up with novel insights about ML, but there's a long way between "sometimes comes up with novel insig... (read more)
I didn't push this point at the time, but Paul's claim that "GPT-3 + 5 person-years of engineering effort [would] foom" seems really wild to me, and probably a good place to poke at his model more. Is this 5 years of engineering effort and then humans leaving it alone with infinite compute?
The 5 years are up front and then it's up to the AI to do the rest. I was imagining something like 1e25 flops running for billions of years.
I don't really believe the claim unless you provide computing infrastructure that is externally maintained or else extremely ... (read more)
The two extracts from this post that I found most interesting/helpful:
... (read more)The problem is that the resource gets consumed differently, so base-rate arguments from resource consumption end up utterly unhelpful in real life. The human brain consumes around 20 watts of power. Can we thereby conclude that an AGI should consume around 20 watts of power, and that, when technology advances to the point of being able to supply around 20 watts of power to computers, we'll get AGI?
I'm saying that Moravec's "argument from comparable resource consumption" must
My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot."
Oh, certainly Eliezer should trust his intuitions and believe that he's not a crackpot. But I'm not arguing about what the person with the theory should believe, I'm arguing about what outside observers should believe, if they don't have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn't modest epistemology.
the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete).
I'm not sure how this would actually work. The proponent of the AGI-capitalism analogy might say "ah yes, AGI killing everyone is another data point on the trend of capitalism becoming increasingly destructive". Or... (read more)
Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.
This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, unti... (read more)
I don't expect such a sequence to be particularly useful, compared with focusing on more object-level arguments. Eliezer says that the largest mistake he made in writing his original sequences was that he "didn’t realize that the big problem in learning this valuable way of thinking was figuring out how to practice it, not knowing the theory". Better, I expect, to correct the specific mistakes alignment researchers are currently making, until people have enough data points to generalise better.
it seems to me that you want properly to be asking "How do we know this empirical thing ends up looking like it's close to the abstraction?" and not "Can you show me that this abstraction is a very powerful one?"
I agree that "powerful" is probably not the best term here, so I'll stop using it going forward (note, though, that I didn't use it in my previous comment, which I endorse more than my claims in the original debate).
But before I ask "How do we know this empirical thing ends up looking like it's close to the abstraction?", I need to ask "Does the ab... (read more)
I'm still trying to understand the scope of expected utility theory, so examples like this are very helpful! I'd need to think much more about it before I had a strong opinion about how much they support Eliezer's applications of the theory, though.
Not a problem. I share many of your frustrations about modesty epistemology and about most alignment research missing the point, so I sympathise with your wanting to express them.
On consequentialism: I imagine that it's pretty frustrating to keep having people misunderstand such an important concept, so thanks for trying to convey it. I currently feel like I have a reasonable outline of what you mean (e.g. to the level where I could generate an analogy about as good as Nate's laser analogy), but I still don't know whether the reason you find it much more c... (read more)
My model of Eliezer says that there is some deep underlying concept of consequentialism, of which the "not very coherent consequentialism" is a distorted reflection; and that this deep underlying concept is very closely related to expected utility theory. (I believe he said at one point that he started using the word "consequentialism" instead of "expected utility maximisation" mainly because people kept misunderstanding what he meant by the latter.)
I don't know enough about conservative vector fields to comment, but on priors I'm pretty skeptical of this being a good example of coherent utilities; I also don't have a good guess about what Eliezer would say here.
Thanks! I think that this is a very useful example of an advance prediction of utility theory; and that gathering more examples like this is one of the most promising way to make progress on bridging the gap between Eliezer's and most other people's understandings of consequentialism.
Potentially important thing to flag here: at least in my mind, expected utility theory (i.e. the property Eliezer was calling "laser-like" or "coherence") and consequentialism are two distinct things. Consequentialism will tend to produce systems with (approximate) coherent expected utilities, and that is one major way I expect coherent utilities to show up in practice. But coherent utilities can in-principle occur even without consequentialism (e.g. conservative vector fields in physics), and consequentialism can in-principle not be very coherent (e.g. if... (read more)
My objection is mostly fleshed out in my other comment. I'd just flag here that "In other words, you have to do things the "hard way"--no shortcuts" assigns the burden of proof in a way which I think is not usually helpful. You shouldn't believe my argument that I have a deep theory linking AGI and evolution unless I can explain some really compelling aspects of that theory. Because otherwise you'll also believe in the deep theory linking AGI and capitalism, and the one linking AGI and symbolic logic, and the one linking intelligence and ethics, and the on... (read more)
It also isn't clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern ("expected utility theory/consequentialism"). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking "No, that's not how these things work." In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with ... (read more)
I think we live in a world where there are very strong forces opposed to technological progress, which actively impede a lot of impactful work, including technologies which have the potential to be very economically and strategically important (e.g. nuclear power, vaccines, genetic engineering, geoengineering).
This observation doesn't lead me to a strong prediction that all such technologies will be banned; nor even that the most costly technologies will be banned - if the forces opposed to technological progress were even approximately rational, then bann... (read more)
Strong upvote, you're pointing at something very important here. I don't think I'm defending epistemic modesty, I think I'm defending epistemic rigour, of the sort that's valuable even if you're the only person in the world.
I suspect Richard isn't actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of "successful advance predictions"/"wide applications" corresponds to those likely failure modes, etc).
Yes, this is corre... (read more)
I think some of your confusion may be that you're putting "probability theory" and "Newtonian gravity" into the same bucket. You've been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though). "Probability theory" also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance p... (read more)
we already know how to build myopic optimizers
What are you referring to here?
That all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn't seem like they help explain why myopia is significantly more natural than "obey humans"?
The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do.
Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still ha... (read more)
Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has?
To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitat... (read more)
I'm feeling very excited about this agenda. Is there currently a publicly-viewable version of the living textbook? Or any more formal writeup which I can include in my curriculum? (If not I'll include this post, but I expect many people would appreciate a more polished writeup.)