Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself.
A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.
I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story?
The important point (for current purposes) is that, as the things-the-system-is-capable-of-doing-or-building scale up, we want the system's ability to notice subtle problems to scale up with it. If the system is capable of designing complex machines way outside what hum... (read more)
An example might be helpful here: consider the fusion power generator scenario. In that scenario, a human thinking about what they want arrives at the wrong answer, not because of uncertainty about their own values, but because they don't think to ask the right questions about how the world works. That's the sort of thing I have in mind.
In order to handle that sort of problem, an AI has to be able to use human values somehow without carrying over other specifics of how a human would reason about the situation.
I don’t think “the human is deciding whether or
We don't necessarily need the AGI itself to have human-like drives, intuitions, etc. It just needs to be able to model the human reasoning algorithm well enough to figure out what values humans assign to e.g. an em.
(I expect an AI which relied heavily on human-like reasoning for things other than values would end up doing something catastrophically stupid, much as humans are prone to do.)
I expect that there will be concepts the AI finds useful which humans don't already understand. But these concepts should still be of the same type as human concepts - they're still the same kind of natural abstraction. Analogy: a human who grew up in a desert tribe with little contact with the rest of the world may not have any concept of "snow", but snow is still the kind-of-thing they're capable of understanding if they're ever exposed to it. When the AI uses concepts humans don't already have, I expect them to be like that.
As long as the concepts are the type of thing humans can recognize/understand, then it should be conceptually straightforward to model how humans would reason about those concepts or value them.
This part of Proof Strategy 1 is a basically-accurate description of what I'm working towards:
We try to come up with an unambiguous definition of what [things] are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
... it's just not necessarily about objects localized in 3D space.
Also, there's several possible paths, and they don't all require ... (read more)
Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothe
Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothe
to be paid out to anyone you think did a fine job of distilling the thing
Needing to judge submissions is the main reason I didn't offer a bounty myself. Read the distillation, and see if you yourself understand it. If "Coherence of Distributed Decisions With Different Inputs Implies Conditioning" makes sense as a description of the idea, then you've probably understood it.
If you don't understand it after reading an attempted distillation, then it wasn't distilled well enough.
I'd like to complain that this project sounds epistemically absolutely awful. It's offering money for arguments explicitly optimized to be convincing (rather than true), it offers money only for prizes making one particular side of the case (i.e. no money for arguments that AI risk is no big deal), and to top it off it's explicitly asking for one-liners.
I understand that it is plausibly worth doing regardless, but man, it feels so wrong having this on LessWrong.
Short answer: about one full day.
Longer answer: normally something like this would sit in my notebook for a while, only informing my own thinking. It would get written up as a post mainly if it were adjacent to something which came up in conversation (either on LW or in person). I would have the idea in my head from the conversation, already be thinking about how best to explain it, chew on it overnight, and then if I'm itching to produce something in the morning I'd bang out the post in about 3-4 hours.
Alternative paths: I might need this idea as backgrou... (read more)
I haven't put a distillation bounty on this, but if anyone else wants to do so, leave a comment and I'll link to it in the OP.
If you wouldn't mind one last question before checking out: where did that formula you're using come from?
Oh, melting the GPUs would not actually be a pivotal act. There would need to be some way to prevent new GPUs from being built in order for it to be a pivotal act.
Military capability is not strictly necessary; a pivotal act need not necessarily piss off world governments. AGI-driven propaganda, for instance, might avoid that.
Alternatively, an AGI could produce nanomachines which destroy GPUs, are extremely hard to eradicate, but otherwise don't do much of anything.
(Note that these aren't intended to be very good/realistic suggestions, they're just meant to point to different dimensions of the possibility space.)
+1 to the distinction between "Regulating AI is possible/impossible" vs "pivotal act framing is harmful/unharmful".
I'm sympathetic to a view that says something like "yeah, regulating AI is Hard, but it's also necessary because a unilateral pivotal act would be Bad". (TBC, I'm not saying I agree with that view, but it's at least coherent and not obviously incompatible with how the world actually works.) To properly make that case, one has to argue some combination of:
In fact, before you get to AGI, your company will probably develop other surprising capabilities, and you can demonstrate those capabilities to neutral-but-influential outsiders who previously did not believe those capabilities were possible or concerning. In other words, outsiders can start to help you implement helpful regulatory ideas...
It is not for lack of regulatory ideas that the world has not banned gain-of-function research.
It is not for lack of demonstration of scary gain-of-function capabilities that the world has not banned gain-of-functi... (read more)
(edited)There are/could be crucial differences between GoF and some AGI examples. Eg, a convincing demonstration of the ability to overthrow the government. States are also agents, also have convergent instrumental goals. GoF research seems much more threatening to individual humans, but not that much threatening to states or governments.
Various thoughts that this inspires:
Gain of Function Ban as Practice-Run/Learning for relevant AI Bans
I have heard vague-musings-of-plans in the direction of "get the world to successfully ban Gain of Function research, as a practice-case for getting the world to successfully ban dangerous AI."
I have vague memories of the actual top bio people around not being too focused on this, because they thought there were easier ways to make progress on biosecurity. (I may be conflating a few different statements – they might have just critiquing a particular ... (read more)
That is definitely a selection theorem, and sounds like a really cool one! Well done.
Update: I too have now spent like 1.5 hours reading about AC design and statistics, and I can now give a reasonable guess at exactly where the I-claim-obviously-ridiculous 20-30% number came from. Summary: the SACC/CEER standards use a weighted mix of two test conditions, with 80% of the weight on conditions in which outdoor air is only 3°F/1.6°C hotter than indoor air.
The whole backstory of the DOE's SACC/CEER rating rules is here. Single-hose air conditioners take center stage. The comments on the DOE's rule proposals can basically be summarized as:
... and will both be reduced by a factor of about (exhaust - outdoor) / (exhaust - indoor) which will be much more than 50%.
I assume you mean much less than 50%, i.e. (T_outside - T_inside) averaged over the room will be less than 50% greater with two hoses than with one?
I'm open to such a bet in principle, pending operational details. $1k at even odds?
Operationally, I'm picturing the general plan I sketched four comments upthread. (In particular note the three bulleted conditions starting with "The day being hot enough and the room large enough that the A... (read more)
Or you could get to it before I do and I could perform a replication.
I bought my single-hose AC for the 2019 heat wave in Mountain View (which was presumably basically similar to Berkeley).
When I was in Vegas, summer was just three months of permanent extreme heat during the day; one does not stay somewhere without built-in AC in Vegas.
I'm curious what your BOTEC was / if you think 130 is too high an estimate for the exhaust temp?
I don't remember what calculation I did then, but here's one with the same result. Model the single-hose air conditioner as removing air from the room, and replacing with a mix of air at two temperatures: TC (the temperature of cold air coming from the air conditioner), and TH (the temperature outdoors). If we assume that TC is constant and that the cold and hot air are introduced in roughly 1:1 proportions (i.e. the flow rate from ... (read more)
... I actually already started a post titled "Preregistration: Air Conditioner Test (for AI Alignment!)". My plan was to use the one-hose AC I bought a few years ago during that heat wave, rig up a cardboard "second hose" for it, and try it out in my apartment both with and without the second hose next time we have a decently-hot day. Maybe we can have an air conditioner test party.
Predictions: the claim which I most do not believe right now is that going from one hose to two hose with the same air conditioner makes only a 20%-30% difference. The main metr... (read more)
Alright, I am more convinced than I was about the temperature issue, but the test setup still sounds pretty bad.
First, Boston does not usually get all that sweltering. I grew up in Connecticut (close to Boston and similar weather), summer days usually peaked in the low 80's. Even if they waited for a really hot week, it was probably in the 90's. A quick google search confirms this: typical July daily high temp is 82, and google says "Overall during July, you should expect about 4-6 days to reach or exceed 90 F (32C) while the all-time record high for Bosto... (read more)
The best thing we took away from our tests was the chance at a direct comparison between a single-hose design and a dual-hose design that were otherwise identical, and our experience confirmed our suspicions that dual-hose portable ACs are slightly more effective than single-hose models but not effective enough to make a real difference
I roll to disbelieve. I think it is much more likely that something is wrong with their test setup than that the difference between one-hose and two-hose is negligible.
Just on priors, the most obvious problem is that they're... (read more)
(Also, I expect it to seem like I am refusing to update in the face of any evidence, so I'd like to highlight that this model correctly predicted that the tests were run someplace where it was not hot outside. Had that evidence come out different, I'd be much more convinced right now that one hose vs two doesn't really matter.)
From how we tested:
Over the course of a sweltering summer week in Boston, we set up our five finalists in a roughly 250-square-foot space, taking notes and rating each model on the basic setup process, performance, portability, acces
On the physics: to be clear, I'm not saying the air conditioner does not work at all. It does make the room cooler than it started, at equilibrium.
I also am not surprised (in this particular example) to hear that various expert sources already account for the inefficiency in their evaluations; it is a problem which should be very obvious to experts. Of course that doesn't apply so well to e.g. the example of medical research replication failures. The air conditioner example is not meant to be an example of something which is really hard to notice for human... (read more)
In this particular case, I indeed do not think the conflict is worth the cost of exploring - it seems glaringly obvious that people are buying a bad product because they are unable to recognize the ways in which it is bad.
The wirecutter recommendation for budget portable ACs is a single-hose model. Until very recently their overall recommendation was also a single-hose model.
The wirecutter recommendations (and other pages discussing this tradeoffs) are based on a combination of "how cold does it make the room empirically?" and quantitative estimates of coo... (read more)
There is an important difference here between "obvious in advance" and "obvious in hindsight", but your basic point is fair, and the virus example is a good one. Humanity's current state is indeed so spectacularly incompetent that even the obvious problems might not be solved, depending on how things go.
Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.
This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which ... (read more)
I agree with the basic difference you point to between fast- and slow-takeoff worlds, but disagree that it has important strategic implications for the obviousness of takeover risk.
In slow takeoff worlds, many aspects of the alignment problem show up well before AGI goes critical. However, people will by-default train systems to conceal those problems. (This is already happening: RL from human feedback is exactly the sort of strategy which trains systems to conceal problems, and we've seen multiple major orgs embracing it within the past few months.) As a ... (read more)
I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI c... (read more)
Pithy one-sentence summary: to the extent that I value corrigibility, a system sufficiently aligned with my values should be corrigible.
A heuristic argument that says "evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it" seems obviously wrong to me.
Yup, that sounds like a crux. Bookmarked for later.
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the ... (read more)
That falls squarely under the "other reasons to think our models are not yet deceptive" - i.e. we have priors that we'll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.
Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:
Now, the key heuristic: in a high-dimen... (read more)
I understand that deceptive models won't show signs of deception :) That's why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?
It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)
I think it's very unclear how big a problem Goodhart is for alignment research---it seems like a question about a particular technical domain.
Just a couple weeks ago I had this post talking about how, in some technical areas, we've been able to find very robust formulations of particular concepts (i.e. "True Names"). The domains where evaluation is much easier - math, physics, CS - are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we're in a domain where we don't have a robust mathematical f... (read more)
This seems to completely ignore the main problem with approaches which try to outsource alignment research to AGI: optimizing for alignment strategies which look promising to a human reviewer will also automatically incentivize strategies which fool the human reviewer. Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.
Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.
I think it's very unclear how big a problem Goodhart is for alignment research---it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don't have formal theorem statements. There are also domains where it's not much easier, where the whole thing rests on compl... (read more)
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you're being fooled). It's still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.
If it turns out that evaluation of alignment proposals is not easier than generation, we're in pretty big trouble because we'll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we're not sure about those alignment proposals. But then you're still susceptible to the same kinds of problems.
Precision feels pretty far from the true name of the important feature of true names
You're right, I wasn't being sufficiently careful about the wording of a bolded sentence. I should have said "robust" where it said "precise". Updated in the post; thankyou.
Also I basically agree that robustness to optimization is not the True Name of True Names, though it might be a sufficient condition.
The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.
There is no way to confirm zero mutual information, and even if there was there is zero probability that the mutual information was zero. Very small, perhaps. Zero, no.
Thanks for bringing this up; it raises to a technical point which didn't make sense to include in the post but which I was hoping someone would raise in the comments.
The key point: Goodhart problems are about generalization, not approximation.
Suppose I have a proxy u′ for a true utility function u, and u′ is always within ϵ of u (i.e. |u′−u|<ϵ... (read more)
This is an interesting observation; I don't see how it addresses my point.
There is no exact solution to mutual information from two finite samples. There is no ϵ-approximation of mutual information from two finite samples, either.
On the topic of said observation: beware that ϵ-approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin's Constant.)
In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable... (read more)
Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks.
I think you missed the point of that particular metaphor. The claim was not that revenue of a nail factory is a robust operationalization of nail value. The claim was that a competitive nail market plus nail-maker reputation tracking is a True Name for a p... (read more)
Imagine it's 1665 and we're trying to figure out the True Name of physical force - i.e. how hard it feels like something is pushing or pulling.
One of the first steps is to go through our everyday experience, paying attention to what causes stronger/weaker sensations of pushing and pulling, or what effects stronger/weaker sensations have downstream. We might notice, for instance, that heavier objects take more force to push, or that a stronger push accelerates things faster. So, we might expect to find some robust relationship between the True Names of forc... (read more)
It's not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent.
What's the evidence for this claim?
When I look at e.g. nails, the economic value of a nail seems reasonably complicated. Yet the "pointers to nail value" which we use in practice - i.e. competitive markets and reputation systems - do have clean, robust mathematical formulations.
Furthermore, before the mid-20th century, I expect that most people would have expected that competitive market... (read more)
It's not clear to me that your metaphors are pointing at something in particular.
Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks. If this was all we needed, then yes, absolutely, I'm sure there's a similarly neat and simple way to instrumentalize human values - it's just going to fail if things are too ... (read more)
To be clear, I do not mean to use the label "mainline prediction" for this whole technique. Mainline prediction tracking is one way of implementing this general technique, and I claim that the usefulness of the general technique is the main reason why mainline predictions are useful to track.
(Also, it matches up quite well with Nate's model based on his comment here, and I expect it also matches how Eliezer wants to use the technique.)
As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal t
As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal t
This came up with Aysajan about two months ago. An exercise which I recommended for him: first, pick a technical academic paper. Read through the abstract and first few paragraphs. At the end of each sentence (or after each comma, if the authors use very long sentences), pause and write/sketch a prototypical example of what you currently think they're talking about. The goal here is to get into the habit of keeping a "mental picture" (i.e. prototypical example) of what the authors are talking about as you read.
Other good sources on which to try this exerci... (read more)
I find that plausible, a priori. Mostly doesn't affect the stuff in the talk, since that would still come from the environment, and the same principles would apply to culturally-derived values as to environment-derived values more generally. Assuming the hardwired part is figured out, we should still be able to get an estimate of human values within the typical-human-value-distribution-for-a-given-culture from data which is within the typical-human-environment-distribution-for-that-culture.
No plans in motion. Thank you very much if you decide to do so! Also, you might want to message Rob to get the images.
You can think of everything I'm doing as occurring in a "God's eye" model. I expect that an agent embedded in this God's-eye model will only be able to usefully measure natural abstractions within the model. So, shifting to the agent's perspective, we could say "holding these abstractions fixed, what possible models are compatible with them?". And that is indeed a direction I plan to go. But first, I want to get the nicest math I possibly can for computing the abstractions within a model, because the cleaner that is the cleaner I expect that computing poss... (read more)
I think a lot of 10%ers could learn to do wedding-cake multiplication, if sufficiently well-paid as adults rather than being tortured in school, out to 6 digits
What is wedding-cake multiplication? A quick google just turned up a lot of people who want to sell me wedding cakes...
how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?
This question needs a whole essay (or several) on its own. If I don't get around to leaving a longer answer in the next few days, ping me.
Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?
How is the "the genome is small, therefore generators of human values (that can't be learned from the environment) are no more c