# All of paulfchristiano's Comments + Replies

Where I agree and disagree with Eliezer

On 22, I agree that my claim is incorrect. I think such systems probably won't obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that's compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)

2Rohin Shah5d
Great, I agree with all of that.
Let's See You Write That Corrigibility Tag

I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.

If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understand... (read more)

Where I agree and disagree with Eliezer

I'm not sure if you are saying that you skimmed the report right now and couldn't find the list, or that you think that it was a mistake for the report not to contain a "centralized bullet point list of difficulties."

If you are currently looking for the list of difficulties: see the long footnote

If you think the ELK report should have contained such a list: I definitely don't think we wrote this report optimally, but we tried our best and I'm not convinced this would be an improvement. The report is about one central problem that we attempt to state... (read more)

Where I agree and disagree with Eliezer

The question is when you get a misaligned mesaoptimizer relative to when you get superhuman behavior.

I think it's pretty clear that you can get an optimizer which is upstream of the imitation (i.e. whose optimization gives rise to the imitation), or you can get an optimizer which is downstream of the imitation (i.e. which optimizes in virtue of its imitation). Of course most outcomes are messier than those two extremes, but the qualitative distinction still seems really central to these arguments.

Epistemic status: some of these ideas only crystallized today, normally I would take at least a few days to process before posting to make sure there are no glaring holes in the reasoning, but I saw this thread and decided to reply since it's topical.

Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers). In order for Bayesian inference to converge to exact imitation, you usually need realizability. Obviously today we don't have realizability because the ANNs c... (read more)

Let's See You Write That Corrigibility Tag

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp." For example, it doesn't depend on exactly how you draw the line.

It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.

ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions... (read more)

5Matthew "Vaniver" Graves6d
The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)? [This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul's sense that they're disconnected in 1D, or when do you think the difficulty comes in?]
2Ben Pace7d
Thanks!
Let's See You Write That Corrigibility Tag

A list of "corrigibility principles" sounds like it's approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates. I'm not clear on what you would do with a long list of aspects of corrigibility like "shuts down when asked."

I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a questi... (read more)

1Oliver Habryka4d
I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I've read so far.

Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I'm wrong)

I think what would typically count as "principles" in Eliezer's meaning are
1. designable things which make the "true corrigibility" basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the "incorrigible" basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the "incorrigible" basin less lethal

• I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.

Can someone explain to me what this crispness is?

As I'm reading Paul's comment, there's an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI's optimization power, it gets better at the task, but it also gets better at breaking m... (read more)

Where I agree and disagree with Eliezer

It sounds like we are broadly on the same page about 1 and 2 (presumably partly because my list doesn't focus on my spiciest takes, which might have generated more disagreement).

Here are some extremely rambling thoughts on point 3.

I agree that the interaction between AI and existing conflict is a very important consideration for understanding or shaping policy responses to AI, and that you should be thinking a lot about how to navigate (and potentially leverage) those dynamics if you want to improve how well we handle any aspect of AI. I was trying to most... (read more)

Not very coherent response to #3. Roughly

• Caring about visible power is a very human motivation, and I'd expect will draw many people to care about "who are the AI principals", "what are the AIs actually doing", and few other topics, which have significant technical components
• Somewhat wild datapoints in this space: nuclear weapons, space race. in each case, salient motivations such as "war" led some of the best technical people to work on hard technical problems. in my view, the problems the technical people ended up working on were often "vs. nature" and d
Where I agree and disagree with Eliezer

My sense is that we are on broadly the same page here. I agree that "AI improving AI over time" will look very different from "humans improving humans over time" or even "biology improving humans over time." But I think that it will look a lot like "humans improving AI over time," and that's what I'd use to estimate timescales (months or years, most likely years) for further AI improvements.

Where I agree and disagree with Eliezer

I'm guessing the disagreement is that Yudkowsky thinks the holes are giant visible and gaping, whereas you think they are indeed holes but you have some ideas for how to fix them

I think we don't know whether various obvious-to-us-now things will work with effort. I think we don't really have a plan that would work with an acceptably high probability and stand up to scrutiny / mildly pessimistic assumptions.

I would guess that if alignment is hard, then whatever we do ultimately won't follow any existing plan very closely (whether we succeed or not). I do th... (read more)

Where I agree and disagree with Eliezer

Do you think that some of my disagreements should change if I had shorter timelines?

(As mentioned last time we talked, but readers might not have seen: I'm guessing ~15% on singularity by 2030 and ~40% on singularity by 2040.)

I think most of your disagreements on this list would not change.
However, I think if you conditioned on 50% chance of singularity by 2030 instead of 15%, you'd update towards faster takeoff, less government/societal competence (and thus things more likely to fail at an earlier, less dignified point), more unipolar/local takeoff, lower effectiveness of coordination/policy/politics-style strategies, less interpretability and other useful alignment progress, less chance of really useful warning shots... and of course, significantly higher p(doom).

To put it an... (read more)

Where I agree and disagree with Eliezer

I think most worlds, surviving or not, don't have a plan in the sense that Eliezer is asking about.

I do agree that in the best worlds, there are quite a lot of very good plans and extensive analysis of how they would play out (even if it's not the biggest input into decision-making). Indeed, I think there are a lot of things that the best possible world would be doing that we aren't, and I'd give that world a very low probability of doom even if alignment was literally impossible-in-principle.

ETA: this is closely related to Richard's point in the sibling.

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

Yeah, I agree that if you learn a probabilistic model then you mostly have a difference in degree rather than difference in kind with respect to interpretability. It's not super clear that the difference in degree is large or important (it seems like it could be, just not clear). And if you aren't willing to learn a probabilistic model, then you are handicapping your system in a way that will probably eventually be a big deal.

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

Two potentially relevant distinctions:

• Is your model produced by optimization for accuracy, or by optimizing local pieces for local accuracy (e.g. for your lane prediction algorithm accurately predicting lane boundaries), or by human engineering (e.g. pieces of the model reflecting facts that humans know about the world)? Most practical systems will do some of each.
• Is your policy divided into pieces (like a generative model, inference algorithm, planner) which are optimized for local objectives, or is it all optimized end-to-end for performance? If most of
3johnswentworth22d
Great points. I definitely agree with your argument quantitatively: these distinctions mean that a probabilistic model will be quantitatively more interpretable for the same system, or be able to handle more complex systems for a given interpretability metric (like e.g. "running into catastrophic misalignment"). That said, it does seem like the vast majority of interpretability for both probabilistic and ML systems is in "how does this internal stuff correspond to stuff in the world". So qualitatively, it seems like the central interpretability problem is basically the same for both.
RL with KL penalties is better seen as Bayesian inference

I broadly agree with this perspective, and think the modeling vs inference distinction is a valuable one to make.

That said, it seems to me that in practice you often should be trying to converge to a zero entropy policy. The optimal action is not random; we want the model to be picking the output that looks best in light of its beliefs. (Old Eliezer post.)

For some applications you are sampling from your model multiple times and ensembling. In this case randomness can help if you have no memory. But in the same setting, optimizing the correct reward functio... (read more)

High-stakes alignment via adversarial training [Redwood Research report]

Ah, that makes sense. But the 26 minutes --> 13 minutes is from adversarial training holding the threshold fixed, right?

1DMZ2mo
Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)
High-stakes alignment via adversarial training [Redwood Research report]

I think that 3 orders of magnitude is the comparison between "time taken to find a failure by randomly sampling" and "time taken to find a failure if you are deliberately looking using tools."

1DMZ2mo
I read Oli's comment as referring to the 2.4% -> 0.002% failure rate improvement from filtering.
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

I still the 25-30% estimate in my original post was basically correct. I think the typical SACC adjustment for single-hose air conditioners ends up being 15%, not 25-30%. I agree this adjustment is based on generous assumptions (5.4 degrees of cooling whereas 10 seems like a more reasonable estimate). If you correct for that, you seem to get to more like 25-30%.  The Goodhart effect is much smaller than this 25-30%, I still think 10% is plausible.

I admit that in total I’ve spent significantly more than 1.5 hours researching air conditioners :) So I’m ... (read more)

2johnswentworth2mo
If you wouldn't mind one last question before checking out: where did that formula you're using come from?
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

Ok, I think that ~50% estimate is probably wrong. Happy to bet about outcome (though I think someone with working knowledge of air conditioners will also be able to confirm). I'd bet that efficiency and Delta t will be linearly related and will both be reduced by a factor of about (exhaust - outdoor) / (exhaust - indoor) which will be much more than 50%.

4johnswentworth2mo
I assume you mean much less than 50%, i.e. (T_outside - T_inside) averaged over the room will be less than 50% greater with two hoses than with one? I'm open to such a bet in principle, pending operational details. 1k at even odds? Operationally, I'm picturing the general plan I sketched four comments upthread. (In particular note the three bulleted conditions starting with "The day being hot enough and the room large enough that the AC runs continuously..."; I'd consider it a null result if one of those conditions fails.) LMK if other conditions should be included. Also, you're welcome to come to the Air Conditioner Testing Party (on some hot day TBD). There's a pool at the apartment complex, could swim a bit while the room equilibrates. Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon I think labeling requirements are based on the expectation of cooling from 95 to 80 (and I expect typical use cases for portable AC are more like that). Actually hot places will usually have central air or window units. Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon I would have thought that the efficiency lost is roughly (outside temp - inside temp) / (exhaust temp - inside temp). And my guess was that exhaust temp is ~130. I think the main way the effect could be as big as you are saying is if that model is wrong or if the exhaust is a lot cooler than I think. Those both seem plausible; I don't understand how AC works, so don't trust that calculation too much. I'm curious what your BOTEC was / if you think 130 is too high an estimate for the exhaust temp? If that calculation is right, and exhaust is at 130, outs... (read more) 3johnswentworth2mo I don't remember what calculation I did then, but here's one with the same result. Model the single-hose air conditioner as removing air from the room, and replacing with a mix of air at two temperatures:TC(the temperature of cold air coming from the air conditioner), andTH(the temperature outdoors). If we assume thatTCis constant and that the cold and hot air are introduced in roughly 1:1 proportions (i.e. the flow rate from the exhaust is roughly equal to the flow rate from the cooling outlet), then we should end up with an equilibrium average temperature ofTC+TH2. If we model the switch to two-hose as just turning off the stream of hot air, then the equilibrium average temperature should drop toTC. Some notes on this: * It's talking about equilibrium temperature rather than power efficiency, because equilibrium temperature on a hot day was mostly what I cared about when using the air conditioner. * The assumption of roughly-equal flow rates seems to be at least the right order of magnitude based on seeing this air conditioner in operation, though I haven't measured carefully. If anything, it seemed like the exhaust had higher throughput. * The assumption of constantTCis probably the most suspect part. Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon • The top wirecutter recommendation is roughly 3x as expensive as the Amazon AC being reviewed. The top budget pick is a single-hose model. • People usually want to cool the room they are spending their time in. Those ACs are marketed to cool a 300 sq ft room, not a whole home. That's what reviewers are clearly doing with the unit. • I'd guess that in extreme cases (where you care about the room with AC no more than other rooms in the house + rest of house is cool) consumers are overestimating efficiency by ~30%. On average in reality I'd guess they are over ... (read more) Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon They measure the temperature in the room, which captures the effect of negative pressure pulling in hot air from the rest of the building. It underestimates the costs if the rest of the building is significantly cooler than the outside (I'd guess by the ballpark of 20-30% in the extreme case where you care equally about all spaces in the building, the rest of your building is kept at the same temp as the room you are cooling, and a negligible fraction of air exchange with the outside is via the room you are cooling). Which... seems to misunderstand the actu ... (read more) 3Oliver Habryka2mo Yeah, sorry, I didn't mean to imply the section is saying something totally wrong. The section just makes it sound like that is the only concern with infiltration, which seems wrong, and my current model of the author of the post is that they weren't actually thinking through heat-related infiltration issues (though it's hard to say from just this one paragraph, of course). Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon Here is the wirecutter discussion of the distinction for reference: Starting in 2019, we began comparing dual- and single-hose models according to the same criteria, and we didn’t dismiss any models based on their hose count. Our research, however, ultimately steered us toward single-hose portable models—in part because so many newer models use this design. In fact, we found no compelling new double-hose models from major manufacturers in 2019 or 2020 (although a few new ones cropped up in 2021, including our new top pick). Owner reviews indicate that most ... (read more) Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon Boston summers are hotter than the average summers in the US, and I'd guess are well above the average use case for an AC in the US. I agree having two hoses are more important the larger the temperature difference, and by the time you are cooling from 100 to 70 the difference is fairly large (though there is basically nowhere in the US where that difference is close to typical). I'd be fine with a summary of "For users who care about temp in the whole house rather than just the room with the AC, one-hose units are maybe 20% less efficient than they feel. B... (read more) Does anyone in-thread (or reading along) have any experiments they'd be interested in me running with this air conditioner? It doesn't seem at all hard for me to do some science and get empirical data, with a different setup to Wirecutter, so let me know. Added: From a skim of the thread, it seems to me the experiment that would resolve matters is testing in a large room with temperature sensors more like 15 feet away in a city or country that's very hot outside, and to compare this with (say) Wirecutter's top pick with two-hoses. Confirm? Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon (Also, I expect it to seem like I am refusing to update in the face of any evidence, so I'd like to highlight that this model correctly predicted that the tests were run someplace where it was not hot outside. Had that evidence come out different, I'd be much more convinced right now that one hose vs two doesn't really matter.) From how we tested: Over the course of a sweltering summer week in Boston, we set up our five finalists in a roughly 250-square-foot space, taking notes and rating each model on the basic setup process, performance, portability, acces ... (read more) 2johnswentworth2mo Alright, I am more convinced than I was about the temperature issue, but the test setup still sounds pretty bad. First, Boston does not usually get all that sweltering. I grew up in Connecticut (close to Boston and similar weather), summer days usually peaked in the low 80's. Even if they waited for a really hot week, it was probably in the 90's. A quick google search confirms this: typical July daily high temp is 82, and google says "Overall during July, you should expect about 4-6 days to reach or exceed 90 F (32C) while the all-time record high for Boston was 103 F (39.4C)". It's still a way better test than April (so I'm updating from that), but probably well short of keeping a room at 70 on a 100 degree day. I'm guessing they only had about half that temperature delta. Second, their actual test procedure (thankyou for finding that, BTW): Three feet and six feet away? That sure does sound like they're measuring the temperature right near the unit, rather than the other side of the room where we'd expect infiltration to matter. I had previously assumed they were at least measuring the other side of the room (because they mention for the two-hose recommendation "In our tests, it was also remarkably effective at distributing the cool air, never leaving more than a 1-degree temperature difference across the room"), but apparently "across the room" actually meant "6 feet away" based on this later quote: ... which sure does sound more like what we'd expect. So I'm updating away from "it was just not hot outside" - probably a minor issue, but not a major one. That said, it sure does sound like they were not measuring temperature across the room, and even just between 3 and 6 feet away the two-hose model apparently had noticeably less drop-off in effectiveness. Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon In this particular case, I indeed do not think the conflict is worth the cost of exploring - it seems glaringly obvious that people are buying a bad product because they are unable to recognize the ways in which it is bad. The wirecutter recommendation for budget portable ACs is a single-hose model. Until very recently their overall recommendation was also a single-hose model. The wirecutter recommendations (and other pages discussing this tradeoffs) are based on a combination of "how cold does it make the room empirically?" and quantitative estimates of coo... (read more) The best thing we took away from our tests was the chance at a direct comparison between a single-hose design and a dual-hose design that were otherwise identical, and our experience confirmed our suspicions that dual-hose portable ACs are slightly more effective than single-hose models but not effective enough to make a real difference After having looked into this quite a bit, it does really seem like the Wirecutter testing process had no ability to notice infiltration issues, so it seems like the Wirecutter crew themselves is kind of confused here? ... (read more) 2johnswentworth2mo I roll to disbelieve. I think it is much more likely that something is wrong with their test setup than that the difference between one-hose and two-hose is negligible. Just on priors, the most obvious problem is that they're testing somewhere which isn't hot outside the room - either because they're inside a larger air-conditioned building, or because it's not hot outdoors. Can we check that? Well, they apparently tested it in April 2022, i.e. nowish, which is indeed not hot most places in the US, but can we narrow down the location more? The photo is by Michael Hession, who apparently operates near Boston [https://www.linkedin.com/in/michael-hession-photovideo/]. Daily high temps currently in the 50's to 60's (Fahrenheit). So yeah, definitely not hot there. Now, if they're measuring temperature delta compared to the outdoors, it could still be a valid test. On the other hand, if it's only in the 50's to 60's outside, I very much doubt that they're trying to really get a big temperature delta from that air conditioner - they'd have to get the room down below freezing in order to get the same temperature delta as a 70 degree room on a 100 degree day. If they're only trying to get a tiny temperature delta, then it really doesn't matter how efficient the unit is. For someone trying to keep a room at 70 on a 100 degree day, it's going to matter a lot more. So basically, I am not buying this test setup. It does not look like it is actually representative of real usage, and it looks nonrepresentative in the basically the ways we'd expect from a test that found little difference between one and two hoses. Generalizable lesson/heuristic: the supposed "experts" are also not even remotely trustworthy. (Also, I expect it to seem like I am refusing to update in the face of any evidence, so I'd like to highlight that this model correctly predicted that the tests were run someplace where it was not hot outside. Had that evidence come out different, I'd be much more convin A Small Negative Result on Debate I think that one of the key difficulties for debate research is having good tasks that call for more sophisticated protocols. I think this dataset seems great for that purpose, and having established a negative result for 1-turn debate seems like a good foundation for follow-up work exploring more sophisticated protocols. (It seems like a shame that people don't normally publish early-stage and negative results.) In comparison with other datasets (e.g. in the negative results described by Beth), it seems like QuALITY is identifying pretty crisp failures and... (read more) Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon Regulation does not fix the problem, just moves it from the consumer to the regulator. A regulator will only regulate a problem which is obvious to the regulator. A regulator may sometimes have more expertise than a layperson, but even that requires that the politicians ultimately appointing people can distinguish real from fake expertise, which is hard in general. It seems like the DOE decided to adopt energy-efficiency standards that take into account infiltration. They could easily have made a different decision (e.g. because of pressure from portab... (read more) Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon I agree that people can easily fail to fix alignment problems, and can instead paper over them, even given a long time to iterate. But I'm not really convinced about your analogy with single-hose air conditioners. Physics: The air coming out of the exhaust is often quite a bit hotter than the outside air. I've never checked myself, but just googling has many people reporting 130+ degree temperatures coming out of exhaust from single-hose units. I'm not sure how hot this unit's exhaust is in particular, but I'd guess it's significantly hotter than outside air... (read more) My overall take on this post and comment (after spending like 1.5 hours reading about AC design and statistics): Overall I feel like both the OP and this reply say some wrong things. The top Wirecutter recommendation is a dual-hose design. The testing procedure of Wirecutter does not seem to address infiltration in any way, and indeed the whole article does not discuss infiltration as it relates to cooling-efficiency. Overall efficiency loss from going to dual to single is something like 20-30%, which I do think is much lower than I think the OP ... (read more) 3Oliver Habryka2mo It is important to note that the current top wirecutter pick is a 2-hose unit, though one that combined the two hoses into one big hose. I guess maybe that is recent, but it does seem important to acknowledge here (and it wouldn't surprise me that much if Wirecutter went through reasoning pretty similar to the one in this post, and then updated towards the two-hose unit because of concerns about infiltration and looking at more comprehensive metrics like SACC). 3johnswentworth2mo On the physics: to be clear, I'm not saying the air conditioner does not work at all. It does make the room cooler than it started, at equilibrium. I also am not surprised (in this particular example) to hear that various expert sources already account for the inefficiency in their evaluations; it is a problem which should be very obvious to experts. Of course that doesn't apply so well to e.g. the example of medical research replication failures. The air conditioner example is not meant to be an example of something which is really hard to notice for humanity as a whole; it's meant to be an example of something which is too hard for a typical consumer to notice, and we should extrapolate from there to the existence of things which people with more expertise will also not notice (e.g. the medical research example). Also, it's a case-in-point that experts noticing a problem with some product is not enough to remove the economic incentive to produce the product. When the argument specifically includes reasons to expect people to not notice the problem, it seems obviously correct to discount reported experiences. Of course there are still ways to gain evidence from reported experience - e.g. if someone specifically said "this unit cooled even the far corners of the house", then that would partially falsify our theory for why people will overlook the one-hose problem. But we should not blindly trust reports when we have reasons to expect those reports to overlook problems. In this particular case, I indeed do not think the conflict is worth the cost of exploring - it seems glaringly obvious that people are buying a bad product because they are unable to recognize the ways in which it is bad. Positive reports do not contradict this; there is not a conflict here. The model already predicts that there will be positive reports - after all, the air conditioner is very convenient and pumps lots of cool air out the front in very obvious ways. Regulation does not fix the problem, just moves it from the consumer to the regulator. A regulator will only regulate a problem which is obvious to the regulator. A regulator may sometimes have more expertise than a layperson, but even that requires that the politicians ultimately appointing people can distinguish real from fake expertise, which is hard in general. It seems like the DOE decided to adopt energy-efficiency standards that take into account infiltration. They could easily have made a different decision (e.g. because of pressure from portab... (read more) Early 2022 Paper Round-up I really liked the summarizing differences between distributions paper. I think I'm excited for broadly the same reasons you are, but to state the case in my own words: • "ML systems learn things that humans don't know from lots of data" seems like one of the central challenges in alignment. • Summarizing differences between distributions seems like a good model for that problem, it seems to cover the basic dynamics of the problem and appears crisply even for very weak systems. • The solution you are pursuing seems like it could be scaled quite far. • I don't think tha ... (read more) [Link] A minimal viable product for alignment I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says "evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it" seems obviously wrong to me. If that's a good summary of the disagreement I'm happy to just leave it there. 5johnswentworth3mo Yup, that sounds like a crux. Bookmarked for later. [Link] A minimal viable product for alignment It feels to me like there's basically no question that recognizing good cryptosystems is easier than generating them. And recognizing attacks on cryptosystems is easier than coming up with attacks (even if they work by exploiting holes in the formalisms). And recognizing good abstract arguments for why formalisms are inadequate is easier than generating them. And recognizing good formalisms is easier than generating them. This is all true notwithstanding the fact that we often make mistakes. (Though as we've discussed before, I think that a lot of the examp... (read more) 2Wei Dai3mo The example of cryptography was mainly intended to make the point that humans are by default too credulous when it comes to informal arguments. But consider your statement: Consider some cryptosystem widely considered to be secure, like AES. How much time did humanity spend on learning / figuring out how to recognize good cryptosystems (e.g. finding all the attacks one has to worry about, like differential cryptanalysis), versus specifically generating AES with the background knowledge in mind? Maybe the latter is on the order of 10% of the former? Then consider that we don't actually know that AES is secure, because we don't know all the possible attacks and we don't know how to prove it secure, i.e., we don't know how to recognize a good cryptosystem. Suppose one day we figure that out, wouldn't finding an actually good cryptosystem be trivial at that point compared to all the previous effort? Some of your other points are valid, I think, but cryptography is just easier than alignment (don't have time to say more as my flight is about to take off), and philosophy is perhaps a better analogy for the more general point. [Link] A minimal viable product for alignment I don't buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think "recognition is not trivial" is different from "recognition is as hard as generation." [Link] A minimal viable product for alignment I think that argument applies just as easily to a human as to a model, doesn't it? So it seems like you are making an equally strong claim that "if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad." And I think that's kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument. (I think the fact that "how smart the human is" doesn't matter mostly just proves that the counting argument is untethered from the key considerations.) 3johnswentworth3mo A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output. A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI's thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start. I think "how smart the human is" is not a key consideration. [Link] A minimal viable product for alignment Is your story: 1. AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment. 2. Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal. It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations "fool us" is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work). 3johnswentworth3mo Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers: * Number of proposals which look good to the human * Number of proposals which look good to the human AND are actually good Now, the key heuristic: in a high-dimensional space, adding any non-simple constraint will exponentially shrink the search space. "Number of proposals which look good to the human AND are actually good" has one more complicated constraint than "Number of proposals which look good to the human", and will therefore be exponentially smaller. So in "it would be much easier to trick us than to write down a good proposal", the relevant operationalization of "easier" for this argument is "the number of proposals which both look good and are good is exponentially smaller than the number which look good". [Link] A minimal viable product for alignment I personally have pretty broad error bars; I think it's plausible enough that AI won't help with automating alignment that it's still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary. Eliezer has consistently expressed c... (read more) [Link] A minimal viable product for alignment Building weak AI systems that help improve alignment seems extremely important to me and is a significant part of my optimism about AI alignment. I also think it's a major reason that my work may turn out not to be relevant in the long term. I still think there are tons of ways that delegating alignment can fail, such that it matters that we do alignment research in advance: • AI systems could have comparative disadvantage at alignment relative to causing trouble, so that AI systems are catastrophically risky before they solve alignment. Or more realistically, ... (read more) What do you (or others) think is the most promising, soon-possible way to use language models to help with alignment? A couple of possible ideas: 1. Using LMs to help with alignment theory (e.g., alignment forum posts, ELK proposals, etc.) 2. Using LMs to run experiments (e.g., writing code, launching experiments, analyzing experiments, and repeat) 3. Using LMs as research assistants (what Ought is doing with Elicit) 4. Something else? [Link] A minimal viable product for alignment Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with. I think it's very unclear how big a problem Goodhart is for alignment research---it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don't have formal theorem statements. There are also domains where it's not much easier, where the whole thing rests on compl... (read more) 5johnswentworth3mo Just a couple weeks ago I had this post [https://www.lesswrong.com/posts/FWvzwCDRgcjb9sigb/why-agent-foundations-an-overly-abstract-explanation] talking about how, in some technical areas, we've been able to find very robust formulations of particular concepts (i.e. "True Names"). The domains where evaluation is much easier - math, physics, CS - are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we're in a domain where we don't have a robust mathematical formulation of the phenomena of interest. The other point of that post is that we do not currently have such formulations for the phenomena of interest in alignment, and (one way of framing) the point of foundational agency research is to find them. So I agree that the difficulty of evaluation varies by domain, but I don't think it's some mysterious hard-to-predict thing. The places where robust evaluation is easy all build on qualitatively-similar foundational pieces, and alignment does not yet have those sorts of building blocks. Go take a look at that other post [https://www.lesswrong.com/posts/FWvzwCDRgcjb9sigb/why-agent-foundations-an-overly-abstract-explanation] , it has two good examples of how Goodhart shows up as a central barrier to alignment. ELK Computational Complexity: Three Levels of Difficulty Yeah, sorry, poor wording on my part. What I meant in that part was "argue that the direct translator cannot be arbitrarily complex", although I immediately mention the case you're addressing here in the parenthetical right after what you quote. Ah, I just totally misunderstood the sentence, the intended reading makes sense. Well, it might be that a proposed solution follows relatively easily from a proposed definition of knowledge, in some cases. That's the sort of solution I'm going after at the moment. I agree that's possible, and it does seem like a... (read more) ELK Computational Complexity: Three Levels of Difficulty We discuss the definition of "knowledge" a bit in this appendix; compared to your definitions, we want to only say that the model "knows" the value of X when it is actually behaving differently based on the value of X in order to obtain a lower loss. I think this is strictly weaker than your level 2 (since the model needs to actually be using that knowledge) and incomparable to your level 1 (since the model's behavior might depend on an estimate of X without the model having any introspective knowledge about that dependence), though I might be misunderstan... (read more) 3Abram Demski3mo Well, it might be that a proposed solution follows relatively easily from a proposed definition of knowledge, in some cases. That's the sort of solution I'm going after at the moment. This still leaves the question of borderline cases, since the definition of knowledge may be imperfect. So it's not necessarily that I'm trying to solve the borderline cases. Ah, yep, I missed that! Ahh, I see. I had 100% interpreted the computational complexity of the Reporter to be 'relative to the predictor' already. I'm not sure how else it could be interpreted, since the reporter is given the predictor's state as input, or at least given some form of query access. What's the intended mathematical content of the statement "the direct translation can be arbitrarily complex", then? Also, why don't you think the direct translator can be arbitrarily complex relative to the predictor? Yeah, sorry, poor wording on my part. What I meant in that part was "argue that the direct translator cannot be arbitrarily complex", although I immediately mention the case you're addressing here in the parenthetical right after what you quote. In any case, what you say makes sense. The Goldbach conjecture is probably correct; so was Fermat's last theorem I am extremely interested in examples of heuristic arguments like this where the naive version leads you astray and it takes much more sophisticated arguments to fix the problem. I'd be happy to pay a500 bounty for any examples that I find really interesting, in number theory or elsewhere. (Usual disclaimers, my judgment of what counts as interesting will be completely arbitrary and unaccountable, etc.)

The best example I found was the Chebyshev bias that primes are very slightly more likely to be 3 mod 4 than 1 mod 4. The simplest explanation I know of t... (read more)

The Goldbach conjecture is probably correct; so was Fermat's last theorem

Readers interested in this topic should probably read Terrence Tao's excellent discussions of probabilistic heuristics in number theory, e.g. this post discussing Fermat's last theorem, the ABC conjecture, and twin primes or this post on biases in prime number gaps. Those posts really helped improve my understanding of how such heuristic arguments work, and there's some cool surprises.

Isn't the Stuart conjecture an extremely weak form of the Lander, Parkin and Selfridge conjecture? If you specialize their conjecture to  then it implies your co... (read more)

2Stuart Armstrong4mo
Because it's the first case I thought of where the probability numbers work out, and I just needed one example to round off the post :-)
Late 2021 MIRI Conversations: AMA / Discussion

I think my way of thinking about things is often a lot like "draw random samples," more like drawing N random samples rather than particle filtering (I guess since we aren't making observations as we go---if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).

The main complexity feels like the thing you point out where it's impossible to make them fully fleshed out, so you build a bunch of intuitions about what is consistent (and could be fleshed out given enough time) and ... (read more)

1Matthew "Vaniver" Graves4mo
Oh whoa, you don't remember your samples from before? [I guess I might not either, unless I'm concentrating on keeping them around or verbalized them or something; probably I do something more expert-iteration-like where I'm silently updating my generating distributions based on the samples and then resampling them in the future.] Yeah, this seems likely; this makes me more interested in the "selectively ignoring variables" hypothesis for why Eliezer running this strategy might have something that would naturally be called a mainline. [Like, it's very easy to predict "number of apples sold = number of apples bought" whereas it's much harder to predict the price of apples.] But maybe instead he means it in the 'startup plan' sense, where you do actually assign basically no probability to your mainline prediction, but still vastly more than any other prediction that's equally conjunctive.
Late 2021 MIRI Conversations: AMA / Discussion

I don't think there is an "AGI textbook" any more than there is an "industrialization textbook." There are lots of books about general principles and useful kinds of machines. That said, if I had to make wild guesses about roughly what that future understanding would look like:

1. There is a recognizable concept of "learning" meaning something like "search for policies that perform well in past or simulated situations." That plays a large role, comparably important to planning or Bayesian inference. Logical induction is likely an elaboration of Bayesian infere
Late 2021 MIRI Conversations: AMA / Discussion

I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict").  At a high level I don't think "mainline" is a great concept for describing probability distributions over the future except in certain exceptional cases (though I may not understand what "mainline" means), and that neat stor... (read more)

I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict").

Sometimes I'll be tracking a finite number of "concrete hypotheses", where every hypothesis is 'fully fleshed out', and be doing a particle-filtering style updating process, where sometimes hypotheses gain or lose weight, sometimes... (read more)

Christiano and Yudkowsky on AI predictions and human intelligence

Certainly I don't see fusion reactors, solar panels or (use in electronics of) semiconductors as counterexamples, since each of these was invented at some point, and didn't gradually evolve from some completely different technology.

Your definition of "discontinuity" seems broadly compatible with my view of the future then. Definitely there are different technologies that are not all outgrowths of one another.

My main point of divergence is:

Now, when a QNI comes along, it doesn't necessarily look like a discontinuity, because there might be a lot of work to

5Vanessa Kosoy4mo
I'm not sure what's the difference between what you're saying here and what I said about QNIs. Is it that you expect being able to see the emergent technology before the singular (crossover) point? Actually, the fact you describe DL as "currently useless" makes me think we should be talking about progress as a function of two variables: time and "maturity", where maturity inhabits, roughly speaking, a scale from "theoretical idea" to "proof of concept" to "beats SOTA in lab conditions" to "commercial product". In this sense, the "lab progress" curve is already past the DL singularity but the "commercial progress" curve maybe isn't. On this model, if post-DL AI technology X appears tomorrow, it will take it some time to span the distance from "theoretical idea" to "commercial product", in which time we would notice it and update our predictions accordingly. But, two things to note here: First, it's not clear which level of maturity is the relevant reference point for AI risk. In particular, I don't think you need commercial levels of maturity for AI technology to become risky, for the reasons I discussed in my previous comment (and, we can also add regulatory barriers to that list, although I am not convinced they are as important as Yudkowsky seems to believe). Second, all this doesn't sound to me like "AI systems will grow relatively continuously and predictably", although maybe I just interpreted this statement differently from its intent. For instance, I agree that it's unlikely technology X will emerge specifically in the next year, so progress over the next year should be fairly predictable. On the other hand, I don't think it would be very surprising if technology X emerges in the next decade. IIUC, part of what you're saying can be rephrased as: TAI is unlikely to be created by a small team, since once a small team shows something promising, tonnes of resources will be thrown at them (and at other teams that might be able to copy the technology) and they
Christiano and Yudkowsky on AI predictions and human intelligence

Yeah, I think this was wrong. I'm somewhat skeptical of the numbers and suspect future revisions systematically softening those accelerations, but 4x still won't look that crazy.

(I don't remember exactly how I chose that number but it probably involved looking at the same time series so wasn't designed to be much more abrupt.)

Importance of foresight evaluations within ELK
• I agree that ELK would not directly help with problems like manipulating humans, and that our proposal in the appendices is basically "Solve the problem with decoupling." And  if you merely defer to future humans about hoe good things are you definitely reintroduce these problems.
• I agree that having humans evaluate future trajectories is very similar to having humans evaluate alternative trajectories. The main difference is that future humans are smarter---even if they aren't any better acquainted with those alternative futures, they've still learned