# All of johnswentworth's Comments + Replies

Optimization at a Distance

Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself.

A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.

[Intro to brain-like-AGI safety] 14. Controlled AGI

I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story?

Basically, yeah.

The important point (for current purposes) is that, as the things-the-system-is-capable-of-doing-or-building scale up, we want the system's ability to notice subtle problems to scale up with it. If the system is capable of designing complex machines way outside what hum... (read more)

[Intro to brain-like-AGI safety] 14. Controlled AGI

An example might be helpful here: consider the fusion power generator scenario. In that scenario, a human thinking about what they want arrives at the wrong answer, not because of uncertainty about their own values, but because they don't think to ask the right questions about how the world works. That's the sort of thing I have in mind.

In order to handle that sort of problem, an AI has to be able to use human values somehow without carrying over other specifics of how a human would reason about the situation.

I don’t think “the human is deciding whether or

4Steve Byrnes5d
It's possible that I misunderstood what you were getting at in that post. I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story? You could have equally well written the post as “Suppose, a few years from now, I set about trying to design a cheap, simple fusion power generator - something I could build in my garage and use to power my house. After years of effort, I succeed….” Is that correct? If so, I think that’s a problem that can be mitigated in mundane ways (e.g. mandatory inventor training courses spreading best-practices for brainstorming unanticipated consequences, including red-teams, structured interviews, etc.), but can't be completely solved by humans. But it also can’t be completely solved by any possible AI, because AIs aren’t and will never be omniscient, and hence may make mistakes or overlook things, just as humans can. Maybe you're thinking that we can make AIs that are less prone to human foibles like wishful thinking and intellectual laziness etc.? But I’m optimistic that we can make “social instinct” brain-like AGIs that are also unusually good at avoiding those things (after all, some humans are significantly better than others at avoiding those things, while still having normal-ish social instincts and moral intuitions).
[Intro to brain-like-AGI safety] 14. Controlled AGI

We don't necessarily need the AGI itself to have human-like drives, intuitions, etc. It just needs to be able to model the human reasoning algorithm well enough to figure out what values humans assign to e.g. an em.

(I expect an AI which relied heavily on human-like reasoning for things other than values would end up doing something catastrophically stupid, much as humans are prone to do.)

5Steve Byrnes6d
I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc. So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI [https://www.alignmentforum.org/posts/5F5Tz3u6kJbTNMqsb/intro-to-brain-like-agi-safety-13-symbol-grounding-and-human] , and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.” From that perspective, I would be concerned that if the (so-called) subroutine never wanted to do anything bad or stupid, then the outer AI is redundant, and if the (so-called) subroutine did want to do something bad or stupid, then the outer AI may not be able to recognize and stop it. Separately, shouldn't “doing something catastrophically stupid” become progressively less of an issue as the AGI gets “smarter”? And insofar as caution / risk-aversion / etc. is a personality type, presumably we could put a healthy dose of it into our AGIs.
[Intro to brain-like-AGI safety] 14. Controlled AGI

I expect that there will be concepts the AI finds useful which humans don't already understand. But these concepts should still be of the same type as human concepts - they're still the same kind of natural abstraction. Analogy: a human who grew up in a desert tribe with little contact with the rest of the world may not have any concept of "snow", but snow is still the kind-of-thing they're capable of understanding if they're ever exposed to it. When the AI uses concepts humans don't already have, I expect them to be like that.

As long as the concepts are the type of thing humans can recognize/understand, then it should be conceptually straightforward to model how humans would reason about those concepts or value them.

4Steve Byrnes6d
Let’s say that the concept of an Em [https://ageofem.com/] had never occurred to me before, and now you knock on my door and tell me that there’s a thing called Ems, and you know how to make them but you need my permission, and now I have to decide whether or not I care about the well-being of Ems. What do I do? I dunno, I would think about the question in different ways, I would try to draw analogies to things I already knew about, maybe I would read some philosophy papers, and most of all I would be implicitly probing my own innate "caring" reaction(s) and seeing exactly what kinds of thoughts do or don't trigger it. Can we make an AGI that does all that? I say yes: we can build an AGI with human-like “innate drives” such that it has human-like moral intuitions, and then it applies those human-like intuitions in a human-like way when faced with new out-of-distribution situations. That’s what I call the “Social-Instinct AGI” research path, see Post #12 [https://www.alignmentforum.org/posts/Sd4QvG4ZyjynZuHGt/intro-to-brain-like-agi-safety-12-two-paths-forward] . But if we can do that, we’ve already arguably solved the whole AGI safety problem. I suspect you have something different in mind?
[Intro to brain-like-AGI safety] 14. Controlled AGI

This part of Proof Strategy 1 is a basically-accurate description of what I'm working towards:

We try to come up with an unambiguous definition of what [things] are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

... it's just not necessarily about objects localized in 3D space.

Also, there's several possible paths, and they don't all require ... (read more)

4Steve Byrnes6d
Thanks! One of my current sources of mild skepticism right now (which again you might talk me out of) is: * For capabilities reasons, the AGI will probably need to be able to add things to its world-model / ontology, including human-illegible things, and including things that don't exist in the world but which the AGI imagines (and could potentially create). * If the AGI is entertaining a plan of changing the world in important ways (e.g. inventing and deploying mind-upload technology, editing its own code, etc.), it seems likely that the only good way of evaluating whether it's a good plan would involve having opinions about features of the future world that the plan would bring about—as opposed to basing the evaluation purely on current-world-features of the plan, like the process by which it was made. * …And in that case, it's not sufficient to have rigorous concepts / things that apply in our world, but rather we need to be able to pick those concepts / things out of any possible future world that the AGI might bring about. * I'm mildly skeptical that we can find / define such concepts / things, especially for things that we care about like “corrigibility”. * …And thus the story needs something along the lines of out-of-distribution edge-case detection and handling systems like Section 14.4.
[Intro to brain-like-AGI safety] 14. Controlled AGI

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothe

2Steve Byrnes7d
Thanks! Follow-up question: Do you see yourself as working towards “Proof Strategy 2”? Or “none of the above”?
[Request for Distillation] Coherence of Distributed Decisions With Different Inputs Implies Conditioning

to be paid out to anyone you think did a fine job of distilling the thing

Needing to judge submissions is the main reason I didn't offer a bounty myself. Read the distillation, and see if you yourself understand it. If "Coherence of Distributed Decisions With Different Inputs Implies Conditioning" makes sense as a description of the idea, then you've probably understood it.

If you don't understand it after reading an attempted distillation, then it wasn't distilled well enough.

3jacobjacob3d
An update on this: sadly I underestimated how busy I would be after posting this bounty. I spent 2h reading this and Thomas post the other day, but didn't not manage to get into the headspace of evaluating the bounty (i.e. making my own interpretation of John's post, and then deciding whether Thomas' distillation captured that). So I will not be evaluating this. (Still happy to pay if someone else I trust claim Thomas' distillation was sufficient.) My apologies to John and Thomas about that.
[$20K in Prizes] AI Safety Arguments Competition I'd like to complain that this project sounds epistemically absolutely awful. It's offering money for arguments explicitly optimized to be convincing (rather than true), it offers money only for prizes making one particular side of the case (i.e. no money for arguments that AI risk is no big deal), and to top it off it's explicitly asking for one-liners. I understand that it is plausibly worth doing regardless, but man, it feels so wrong having this on LessWrong. 3David Manheim21d Think of it as a "practicing a dark art of rationality" post, and I'd think it would seem less off-putting. [Request for Distillation] Coherence of Distributed Decisions With Different Inputs Implies Conditioning Short answer: about one full day. Longer answer: normally something like this would sit in my notebook for a while, only informing my own thinking. It would get written up as a post mainly if it were adjacent to something which came up in conversation (either on LW or in person). I would have the idea in my head from the conversation, already be thinking about how best to explain it, chew on it overnight, and then if I'm itching to produce something in the morning I'd bang out the post in about 3-4 hours. Alternative paths: I might need this idea as backgrou... (read more) 4jacobjacob22d Cool, I'll add$500 to the distillation bounty then, to be paid out to anyone you think did a fine job of distilling the thing :) (Note: this should not be read as my monetary valuation for a day of John work!) (Also, a cooler pay-out would be basis points, or less, of Wentworth impact equity)
[Request for Distillation] Coherence of Distributed Decisions With Different Inputs Implies Conditioning

I haven't put a distillation bounty on this, but if anyone else wants to do so, leave a comment and I'll link to it in the OP.

2jacobjacob23d
How long would it have taken you to do the distillation step yourself for this one? I'd be happy to post a bounty, but price depends a bit on that.
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

If you wouldn't mind one last question before checking out: where did that formula you're using come from?

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Oh, melting the GPUs would not actually be a pivotal act. There would need to be some way to prevent new GPUs from being built in order for it to be a pivotal act.

Military capability is not strictly necessary; a pivotal act need not necessarily piss off world governments. AGI-driven propaganda, for instance, might avoid that.

Alternatively, an AGI could produce nanomachines which destroy GPUs, are extremely hard to eradicate, but otherwise don't do much of anything.

(Note that these aren't intended to be very good/realistic suggestions, they're just meant to point to different dimensions of the possibility space.)

3interstice1mo
Well yeah, that's my point. It seems to me that any pivotal act worthy of the name would essentially require the AI team to become an AGI-powered world government, which seems pretty darn difficult to pull off safely. The superpowered-AI-propaganda plan falls under this category. The long-lasting nanomachines idea is cute, but I bet people would just figure out ways to evade the nanomachines' definition of 'GPU'. Fair enough...but if the pivotal act plan is workable, there should be some member of that space which actually is good/seems like it has a shot of working out in reality(and which wouldn't require a full FAI). I've never heard any and am having a hard time thinking of one. Now it could be that MIRI or others think they have a workable plan which they don't want to share the details of due to infohazard concerns. But as an outside observer, I have to assign a certain amount of probability to that being self-delusion.
“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

+1 to the distinction between "Regulating AI is possible/impossible" vs "pivotal act framing is harmful/unharmful".

I'm sympathetic to a view that says something like "yeah, regulating AI is Hard, but it's also necessary because a unilateral pivotal act would be Bad". (TBC, I'm not saying I agree with that view, but it's at least coherent and not obviously incompatible with how the world actually works.) To properly make that case, one has to argue some combination of:

• A unilateral pivotal act would be so bad that it's worth accepting a much higher chance of
4interstice1mo
What I've never understood about the pivotal act plan is exactly what the successful AGI team is supposed to do after melting the GPUs or whatever. Every government on Earth will now consider them their enemy; they will immediately be destroyed unless they can defend themselves militarily, then countries will simply rebuild the GPU factories and continue on as before(except now in a more combative, disrupted, AI-race-encouraging geopolitical situation). So any pivotal act seems to require, at a minimum, an AI capable of militarily defeating all countries' militaries. Then in order to not have society collapse, you probably need to become the government yourself, or take over or persuade existing governments to go along with your agenda. But an AGI that would be capable of doing all this safely seems...not much easier to create than a full-on FAI? It's not like you could get by with an AI that was freakishly skilled at designing nanomachines but nothing else, you'd need something much more general. But isn't the whole idea of the pivotal act plan that you don't need to solve alignment in full generality to execute a pivotal act? For these reasons, executing a unilateral pivotal act(that actually results in an x-risk reduction) does not seem obviously easier than convincing governments to me.
“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

In fact, before you get to AGI, your company will probably develop other surprising capabilities, and you can demonstrate those capabilities to neutral-but-influential outsiders who previously did not believe those capabilities were possible or concerning.  In other words, outsiders can start to help you implement helpful regulatory ideas...

It is not for lack of regulatory ideas that the world has not banned gain-of-function research.

It is not for lack of demonstration of scary gain-of-function capabilities that the world has not banned gain-of-functi... (read more)

(edited)

There are/could be crucial differences between GoF and some AGI examples.

Eg, a convincing demonstration of the ability to overthrow the government. States are also agents, also have convergent instrumental goals. GoF research seems much more threatening to individual humans, but not that much threatening to states or governments.

Various thoughts that this inspires:

Gain of Function Ban as Practice-Run/Learning for relevant AI Bans

I have heard vague-musings-of-plans in the direction of "get the world to successfully ban Gain of Function research, as a practice-case for getting the world to successfully ban dangerous AI."

I have vague memories of the actual top bio people around not being too focused on this, because they thought there were easier ways to make progress on biosecurity. (I may be conflating a few different statements – they might have just critiquing a particular ... (read more)

Selection Theorems: A Program For Understanding Agents

That is definitely a selection theorem, and sounds like a really cool one! Well done.

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

Update: I too have now spent like 1.5 hours reading about AC design and statistics, and I can now give a reasonable guess at exactly where the I-claim-obviously-ridiculous 20-30% number came from. Summary: the SACC/CEER standards use a weighted mix of two test conditions, with 80% of the weight on conditions in which outdoor air is only 3°F/1.6°C hotter than indoor air.

The whole backstory of the DOE's SACC/CEER rating rules is here. Single-hose air conditioners take center stage. The comments on the DOE's rule proposals can basically be summarized as:

• Singl
5Paul Christiano1mo
I still the 25-30% estimate in my original post was basically correct. I think the typical SACC adjustment for single-hose air conditioners ends up being 15%, not 25-30%. I agree this adjustment is based on generous assumptions (5.4 degrees of cooling whereas 10 seems like a more reasonable estimate). If you correct for that, you seem to get to more like 25-30%. The Goodhart effect is much smaller than this 25-30%, I still think 10% is plausible. I admit that in total I’ve spent significantly more than 1.5 hours researching air conditioners :) So I’m planning to check out now. If you want to post something else, you are welcome to have the last word. SACC for 1-hose AC seems to be 15% lower than similar 2-hose models, not 25-30%: * This site [https://www.pickhvac.com/portable-air-conditioner/best-dual-hose/#tab-con-6] argues for 2-hose ACs being better than 1-hose ACs and cites SACC being 15% lower. * The top 2-hose AC on amazon [https://www.amazon.com/dp/B0028AYQDC/ref=sspa_dk_detail_0?pd_rd_i=B0028AYQDC&pd_rd_w=nvqxG&pf_rd_p=57cbdc41-b731-4e3d-aca7-49078b13a07b&pd_rd_wg=BBRgP&pf_rd_r=T91PEX8JNVQP0BCVG1J3&pd_rd_r=b21dcf5c-5b06-4e8b-8378-7b9fa9640d49&s=appliances&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExUVY1WFgyNFpEOEc1JmVuY3J5cHRlZElkPUEwOTUwMzk4Mlg3TEtaQVBVTkFGSyZlbmNyeXB0ZWRBZElkPUEwMjYyNDc4M09NQjVaOVVNM1c2VCZ3aWRnZXROYW1lPXNwX2RldGFpbF90aGVtYXRpYyZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU&th=1] has 14,000 BTU that gets adjusted down to 9500 BTU = 68%. This similarly-sized 1-hose AC [https://www.amazon.com/dp/B084Q2TZ5N/ref=sspa_dk_detail_4?pd_rd_i=B084Q2TZ5N&pd_rd_w=AoHgw&pf_rd_p=1f2c02dc-66c8-4296-801a-31d671985c95&pd_rd_wg=0z3zN&pf_rd_r=46C4SRMFTJ8ZAN1B3RJ9&pd_rd_r=dc114b2b-04af-4e67-a33e-f815897d9133&s=appliances&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUFYV0pBTTc1UFg0SUUmZW5jcnlwdGVkSWQ9QTAyOTkyODJPOFJWSDNTMkk1VVcmZW5jcnlwdGVkQWRJZD1BMDczOTIxMjMyNERQNk01SFRBSUgmd2lkZ2V0TmFtZT1zcF9kZXRhaWwyJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dH
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

... and will both be reduced by a factor of about (exhaust - outdoor) / (exhaust - indoor) which will be much more than 50%.

I assume you mean much less than 50%, i.e. (T_outside - T_inside) averaged over the room will be less than 50% greater with two hoses than with one?

I'm open to such a bet in principle, pending operational details. \$1k at even odds?

Operationally, I'm picturing the general plan I sketched four comments upthread. (In particular note the three bulleted conditions starting with "The day being hot enough and the room large enough that the A... (read more)

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

Or you could get to it before I do and I could perform a replication.

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

I bought my single-hose AC for the 2019 heat wave in Mountain View (which was presumably basically similar to Berkeley).

When I was in Vegas, summer was just three months of permanent extreme heat during the day; one does not stay somewhere without built-in AC in Vegas.

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

I'm curious what your BOTEC was / if you think 130 is too high an estimate for the exhaust temp?

I don't remember what calculation I did then, but here's one with the same result. Model the single-hose air conditioner as removing air from the room, and replacing with a mix of air at two temperatures:  (the temperature of cold air coming from the air conditioner), and  (the temperature outdoors). If we assume that  is constant and that the cold and hot air are introduced in roughly 1:1 proportions (i.e. the flow rate from ... (read more)

3Paul Christiano1mo
Ok, I think that ~50% estimate is probably wrong. Happy to bet about outcome (though I think someone with working knowledge of air conditioners will also be able to confirm). I'd bet that efficiency and Delta t will be linearly related and will both be reduced by a factor of about (exhaust - outdoor) / (exhaust - indoor) which will be much more than 50%.
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

... I actually already started a post titled "Preregistration: Air Conditioner Test (for AI Alignment!)". My plan was to use the one-hose AC I bought a few years ago during that heat wave, rig up a cardboard "second hose" for it, and try it out in my apartment both with and without the second hose next time we have a decently-hot day. Maybe we can have an air conditioner test party.

Predictions: the claim which I most do not believe right now is that going from one hose to two hose with the same air conditioner makes only a 20%-30% difference. The main metr... (read more)

3Paul Christiano1mo
I would have thought that the efficiency lost is roughly (outside temp - inside temp) / (exhaust temp - inside temp). And my guess was that exhaust temp is ~130. I think the main way the effect could be as big as you are saying is if that model is wrong or if the exhaust is a lot cooler than I think. Those both seem plausible; I don't understand how AC works, so don't trust that calculation too much. I'm curious what your BOTEC was / if you think 130 is too high an estimate for the exhaust temp? If that calculation is right, and exhaust is at 130, outside is 100, and house is 70, you'd have 50% loss. But you can't get 50% in your setup this way, since your 2-hose AC definitely isn't going to get the temp below 65 or so. Maybe most plausible 50% scenario would be something like 115 exhaust, 100 outside, 85 inside with single-hose, 70 inside with double-hose. I doubt you'll see effects that big. I also expect the improvised double hose will have big efficiency losses. I think that 20% is probably the right ballpark (e.g. 130/95/85/82). If it's >50% I think my story above is called into question. (Though note that the efficiency lost from one hose is significantly larger than the bottom line "how much does people's intuitive sense of single-hose AC quality overstate the real efficacy?") Your AC could also be unusual. My guess is that it just wasn't close to being able to cool your old apartment and that single vs double-hoses was a relatively small part of that, in which case we'd still see small efficiency wins in this experiment. But it's conceivable that it is unreasonably bad in part because it has an unreasonably low exhaust temp, in which case we might see an unreasonably large benefit from a second hose (though I'd discard that concern if it either had similarly good Amazon reviews or a reasonable quoted SACC).
3Raymond Arnold1mo
Also, like, Berkeley heat waves may just significantly different than, like, Reno heat waves. My current read is that part of the issue here is that a lot of places don't actually get that hot so having less robustly good air conditioners is fine.
1Ben Pace1mo
Sweet! I could also perform a replication I guess.
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

Alright, I am more convinced than I was about the temperature issue, but the test setup still sounds pretty bad.

First, Boston does not usually get all that sweltering. I grew up in Connecticut (close to Boston and similar weather), summer days usually peaked in the low 80's. Even if they waited for a really hot week, it was probably in the 90's. A quick google search confirms this: typical July daily high temp is 82, and google says "Overall during July, you should expect about 4-6 days to reach or exceed 90 F (32C) while the all-time record high for Bosto... (read more)

5Paul Christiano1mo
Boston summers are hotter than the average summers in the US, and I'd guess are well above the average use case for an AC in the US. I agree having two hoses are more important the larger the temperature difference, and by the time you are cooling from 100 to 70 the difference is fairly large (though there is basically nowhere in the US where that difference is close to typical). I'd be fine with a summary of "For users who care about temp in the whole house rather than just the room with the AC, one-hose units are maybe 20% less efficient than they feel. Because this factor is harder to measure than price or the convenience of setting up a one-hose unit, consumers don't give it the attention it deserves. As a result, manufacturers don't make as many cheap two-hose units as they should."
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

The best thing we took away from our tests was the chance at a direct comparison between a single-hose design and a dual-hose design that were otherwise identical, and our experience confirmed our suspicions that dual-hose portable ACs are slightly more effective than single-hose models but not effective enough to make a real difference

I roll to disbelieve. I think it is much more likely that something is wrong with their test setup than that the difference between one-hose and two-hose is negligible.

Just on priors, the most obvious problem is that they're... (read more)

(Also, I expect it to seem like I am refusing to update in the face of any evidence, so I'd like to highlight that this model correctly predicted that the tests were run someplace where it was not hot outside. Had that evidence come out different, I'd be much more convinced right now that one hose vs two doesn't really matter.)

From how we tested:

Over the course of a sweltering summer week in Boston, we set up our five finalists in a roughly 250-square-foot space, taking notes and rating each model on the basic setup process, performance, portability, acces

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

On the physics: to be clear, I'm not saying the air conditioner does not work at all. It does make the room cooler than it started, at equilibrium.

I also am not surprised (in this particular example) to hear that various expert sources already account for the inefficiency in their evaluations; it is a problem which should be very obvious to experts. Of course that doesn't apply so well to e.g. the example of medical research replication failures. The air conditioner example is not meant to be an example of something which is really hard to notice for human... (read more)

In this particular case, I indeed do not think the conflict is worth the cost of exploring - it seems glaringly obvious that people are buying a bad product because they are unable to recognize the ways in which it is bad.

The wirecutter recommendation for budget portable ACs is a single-hose model. Until very recently their overall recommendation was also a single-hose model.

The wirecutter recommendations (and other pages discussing this tradeoffs) are based on a combination of "how cold does it make the room empirically?" and quantitative estimates of coo... (read more)

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

There is an important difference here between "obvious in advance" and "obvious in hindsight", but your basic point is fair, and the virus example is a good one. Humanity's current state is indeed so spectacularly incompetent that even the obvious problems might not be solved, depending on how things go.

6Steve Byrnes1mo
I would say “Humanity's current state is so spectacularly incompetent that even the obvious problems with obvious solutions might not be solved”. If humanity were not spectacularly incompetent, then maybe we wouldn't have to worry about the obvious problems with obvious solutions. But we would still need to worry about the obvious problems with extremely difficult and non-obvious solutions.
Takeoff speeds have a huge effect on what it means to work on AI x-risk

Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.

This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which ... (read more)

Takeoff speeds have a huge effect on what it means to work on AI x-risk

I agree with the basic difference you point to between fast- and slow-takeoff worlds, but disagree that it has important strategic implications for the obviousness of takeover risk.

In slow takeoff worlds, many aspects of the alignment problem show up well before AGI goes critical. However, people will by-default train systems to conceal those problems. (This is already happening: RL from human feedback is exactly the sort of strategy which trains systems to conceal problems, and we've seen multiple major orgs embracing it within the past few months.) As a ... (read more)

I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.

Examples of small AI c... (read more)

A broad basin of attraction around human values?

Pithy one-sentence summary: to the extent that I value corrigibility, a system sufficiently aligned with my values should be corrigible.

[Link] A minimal viable product for alignment

A heuristic argument that says "evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it" seems obviously wrong to me.

Yup, that sounds like a crux. Bookmarked for later.

[Link] A minimal viable product for alignment

A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.

A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the ... (read more)

6Paul Christiano1mo
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says "evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it" seems obviously wrong to me. If that's a good summary of the disagreement I'm happy to just leave it there.
[Link] A minimal viable product for alignment

That falls squarely under the "other reasons to think our models are not yet deceptive" - i.e. we have priors that we'll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.

[Link] A minimal viable product for alignment

Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:

• Number of proposals which look good to the human
• Number of proposals which look good to the human AND are actually good

Now, the key heuristic: in a high-dimen... (read more)

5Paul Christiano1mo
I think that argument applies just as easily to a human as to a model, doesn't it? So it seems like you are making an equally strong claim that "if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad." And I think that's kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument. (I think the fact that "how smart the human is" doesn't matter mostly just proves that the counting argument is untethered from the key considerations.)
2Jan Leike1mo
I strongly agree with you that it'll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren't good from ones that look good and are actually good. There is a much stronger version of the claim "alignment proposals are easier to evaluate than to generate" that I think we're discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I'm not certain that I will be false for the first systems that can do useful alignment research. As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you're just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I'm pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I've written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback [https://aligned.substack.com/p/ai-assisted-human-feedback]

I understand that deceptive models won't show signs of deception :) That's why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?

It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)

6Raymond Arnold1mo
Not sure I buy this – I have a model of how hard it is to be deceptive, and how competent our current ML systems are, and it looks like it's more like "as competent as a deceptive four-year old" (my parents totally caught me when I told my first lie), than "as competent as a silver-tongued sociopath playing a long game." I do expect there to be signs of deceptive alignment, in a noticeable fashion before we get so-deceptive-we-don't-notice deception.
[Link] A minimal viable product for alignment

I think it's very unclear how big a problem Goodhart is for alignment research---it seems like a question about a particular technical domain.

Just a couple weeks ago I had this post talking about how, in some technical areas, we've been able to find very robust formulations of particular concepts (i.e. "True Names"). The domains where evaluation is much easier - math, physics, CS - are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we're in a domain where we don't have a robust mathematical f... (read more)

5Paul Christiano1mo
I don't buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think "recognition is not trivial" is different from "recognition is as hard as generation."
[Link] A minimal viable product for alignment

This seems to completely ignore the main problem with approaches which try to outsource alignment research to AGI: optimizing for alignment strategies which look promising to a human reviewer will also automatically incentivize strategies which fool the human reviewer. Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.

6Evan Hubinger1mo
I think this concern is only relevant if your strategy is to do RL on human evaluations of alignment research. If instead you just imitate the distribution of current alignment research, I don't think you get this problem, at least anymore than we have it now--and I think you can still substantially accelerate alignment research with just imitation. Of course, you still have inner alignment issues, but from an outer alignment perspective I think imitation of human alignment research is a pretty good thing to try.

Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.

I think it's very unclear how big a problem Goodhart is for alignment research---it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don't have formal theorem statements. There are also domains where it's not much easier, where the whole thing rests on compl... (read more)

Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you're being fooled). It's still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.

If it turns out that evaluation of alignment proposals is not easier than generation, we're in pretty big trouble because we'll struggle to convince others that any good alignment proposals humans come up with are worth implementing.

You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we're not sure about those alignment proposals. But then you're still susceptible to the same kinds of problems.

Why Agent Foundations? An Overly Abstract Explanation

Precision feels pretty far from the true name of the important feature of true names

You're right, I wasn't being sufficiently careful about the wording of a bolded sentence. I should have said "robust" where it said "precise". Updated in the post; thankyou.

Also I basically agree that robustness to optimization is not the True Name of True Names, though it might be a sufficient condition.

Why Agent Foundations? An Overly Abstract Explanation

The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.

2TLW2mo
Let us make a distinction here between two cases: 1. Observing the input and output of a blackbox X, and checking a property thereof. 2. Whitebox knowledge of X, and checking a property thereof. In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1] [#fnuiy5e5a93o]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that's about the best we can say[2] [#fnuktixyib16i]. And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3] [#fnsoql1xti4q], even though checking if two specific Turing-complete whiteboxes may be decidable. It is not exactly the same way, due to the above. 1. ^ [#fnrefuiy5e5a93o]Namely, 'the laws of physics' 2. ^ [#fnrefuktixyib16i](And worse, often doesn't exactly match in the observations thus far, or results in contradictions.) 3. ^ [#fnrefsoql1xti4q]Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X' which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.
Why Agent Foundations? An Overly Abstract Explanation

There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.

Thanks for bringing this up; it raises to a technical point which didn't make sense to include in the post but which I was hoping someone would raise in the comments.

The key point: Goodhart problems are about generalization, not approximation.

Suppose I have a proxy  for a true utility function , and  is always within  of u (i.e. ... (read more)

This is an interesting observation; I don't see how it addresses my point.

There is no exact solution to mutual information from two finite samples. There is no -approximation of mutual information from two finite samples, either.

=====

On the topic of said observation: beware that -approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin's Constant[1].)

In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable... (read more)

Why Agent Foundations? An Overly Abstract Explanation

Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks.

I think you missed the point of that particular metaphor. The claim was not that revenue of a nail factory is a robust operationalization of nail value. The claim was that a competitive nail market plus nail-maker reputation tracking is a True Name for a p... (read more)

Why Agent Foundations? An Overly Abstract Explanation

Imagine it's 1665 and we're trying to figure out the True Name of physical force - i.e. how hard it feels like something is pushing or pulling.

One of the first steps is to go through our everyday experience, paying attention to what causes stronger/weaker sensations of pushing and pulling, or what effects stronger/weaker sensations have downstream. We might notice, for instance, that heavier objects take more force to push, or that a stronger push accelerates things faster. So, we might expect to find some robust relationship between the True Names of forc... (read more)

Why Agent Foundations? An Overly Abstract Explanation

It's not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent.

What's the evidence for this claim?

When I look at e.g. nails, the economic value of a nail seems reasonably complicated. Yet the "pointers to nail value" which we use in practice - i.e. competitive markets and reputation systems - do have clean, robust mathematical formulations.

Furthermore, before the mid-20th century, I expect that most people would have expected that competitive market... (read more)

4Steve Byrnes2mo
I think you're saying: if a thing is messy, at least there can be a non-messy procedure / algorithm that converges to (a.k.a. points to) the thing. I think I'm with Charlie in feeling skeptical about this in regards to value learning, because I think value learning is significantly a normative question. Let me elaborate: My genes plus 1.2e9 seconds of experience have built have built a fundamentally messy set of preferences, which are in some cases self-inconsistent, easily-manipulated, invalid-out-of-distribution, etc. It's easy enough to point to the set of preferences as a whole—you just say “Steve's preferences right now”. In fact, one might eventually (I expect) be able to write down the learning algorithm, reward function, etc., that led to those preferences (but we won't be able to write down the many petabytes of messy training data), and we'll be able to talk about what the preferences look like in the brain. But still, you shouldn't and can't directly optimize according those preferences because they're self-inconsistent, invalid-out-of-distribution, they might involve ghosts [https://www.lesswrong.com/posts/gQY6LrTWJNkTv8YJR/the-pointers-problem-human-values-are-a-function-of-humans] , etc. So then we have a normative question: if “fulfill Steve’s preferences” isn’t a straightforward thing, then what exactly should the AGI do? Maybe we should ask Steve what value learning ought to look like? But maybe I say “I don’t know”, or maybe I give an answer that I wouldn’t endorse upon reflection, or in hindsight. So maybe we should have the AGI do whatever Steve will endorse in hindsight? No, that leads to brainwashing. Anyway, it's possible that we'll come up with an operationalization of value learning that really nails down what we think the AGI ought to do. (Let's say, for example, something like CEV but more specific.) If we do, to what extent should we expect this operationalization to be simple and elegant, versus messy? (For example, in my book, Stua

It's not clear to me that your metaphors are pointing at something in particular.

Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks. If this was all we needed, then yes, absolutely, I'm sure there's a similarly neat and simple way to instrumentalize human values - it's just going to fail if things are too ... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

To be clear, I do not mean to use the label "mainline prediction" for this whole technique. Mainline prediction tracking is one way of implementing this general technique, and I claim that the usefulness of the general technique is the main reason why mainline predictions are useful to track.

(Also, it matches up quite well with Nate's model based on his comment here, and I expect it also matches how Eliezer wants to use the technique.)

4Rohin Shah3mo
Ah, got it. I agree that: 1. The technique you described is in fact very useful 2. If your probability distribution over futures happens to be such that it has a "mainline prediction", you get significant benefits from that (similar to the benefits you get from the technique you described).
Late 2021 MIRI Conversations: AMA / Discussion

As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)

But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal t

6Rohin Shah3mo
Man, I would not call the technique you described "mainline prediction". It also seems kinda inconsistent with Vaniver's usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique. Vaniver, is this what you meant? If so, my new answer is that I and others do in fact talk about "mainline predictions" -- for me, there was that whole section talking about natural language debate as an alignment strategy. (It ended up not being about a plausible world, but that's because (a) Eliezer wanted enough concreteness that I ended up talking about the stupidly inefficient version rather than the one I'd actually expect in the real world and (b) I was focused on demonstrating an existence proof for the technical properties, rather than also trying to include the social ones.)
Shah and Yudkowsky on alignment failures

This came up with Aysajan about two months ago. An exercise which I recommended for him: first, pick a technical academic paper. Read through the abstract and first few paragraphs. At the end of each sentence (or after each comma, if the authors use very long sentences), pause and write/sketch a prototypical example of what you currently think they're talking about. The goal here is to get into the habit of keeping a "mental picture" (i.e. prototypical example) of what the authors are talking about as you read.

Other good sources on which to try this exerci... (read more)

The Big Picture Of Alignment (Talk Part 1)

I find that plausible, a priori. Mostly doesn't affect the stuff in the talk, since that would still come from the environment, and the same principles would apply to culturally-derived values as to environment-derived values more generally. Assuming the hardwired part is figured out, we should still be able to get an estimate of human values within the typical-human-value-distribution-for-a-given-culture from data which is within the typical-human-environment-distribution-for-that-culture.

The Big Picture Of Alignment (Talk Part 1)

No plans in motion. Thank you very much if you decide to do so! Also, you might want to message Rob to get the images.

3Raymond Arnold3mo
1Raymond Arnold3mo
I've put in a request for a transcript.
Abstractions as Redundant Information

You can think of everything I'm doing as occurring in a "God's eye" model. I expect that an agent embedded in this God's-eye model will only be able to usefully measure natural abstractions within the model. So, shifting to the agent's perspective, we could say "holding these abstractions fixed, what possible models are compatible with them?". And that is indeed a direction I plan to go. But first, I want to get the nicest math I possibly can for computing the abstractions within a model, because the cleaner that is the cleaner I expect that computing poss... (read more)

Ngo and Yudkowsky on scientific reasoning and pivotal acts

I think a lot of 10%ers could learn to do wedding-cake multiplication, if sufficiently well-paid as adults rather than being tortured in school, out to 6 digits

What is wedding-cake multiplication? A quick google just turned up a lot of people who want to sell me wedding cakes...

The Big Picture Of Alignment (Talk Part 1)

how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?

This question needs a whole essay (or several) on its own. If I don't get around to leaving a longer answer in the next few days, ping me.

Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?

How is the "the genome is small, therefore generators of human values (that can't be learned from the environment) are no more c