On 22, I agree that my claim is incorrect. I think such systems probably won't obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that's compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)
I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.
If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understand... (read more)
I'm not sure if you are saying that you skimmed the report right now and couldn't find the list, or that you think that it was a mistake for the report not to contain a "centralized bullet point list of difficulties."
If you are currently looking for the list of difficulties: see the long footnote.
If you think the ELK report should have contained such a list: I definitely don't think we wrote this report optimally, but we tried our best and I'm not convinced this would be an improvement. The report is about one central problem that we attempt to state... (read more)
The question is when you get a misaligned mesaoptimizer relative to when you get superhuman behavior.
I think it's pretty clear that you can get an optimizer which is upstream of the imitation (i.e. whose optimization gives rise to the imitation), or you can get an optimizer which is downstream of the imitation (i.e. which optimizes in virtue of its imitation). Of course most outcomes are messier than those two extremes, but the qualitative distinction still seems really central to these arguments.
I don't think you've made much argument about when the trans... (read more)
Epistemic status: some of these ideas only crystallized today, normally I would take at least a few days to process before posting to make sure there are no glaring holes in the reasoning, but I saw this thread and decided to reply since it's topical.
Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers). In order for Bayesian inference to converge to exact imitation, you usually need realizability. Obviously today we don't have realizability because the ANNs c... (read more)
If you have a space with two disconnected components, then I'm calling the distinction between them "crisp." For example, it doesn't depend on exactly how you draw the line.
It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.
ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions... (read more)
A list of "corrigibility principles" sounds like it's approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates. I'm not clear on what you would do with a long list of aspects of corrigibility like "shuts down when asked."
I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a questi... (read more)
Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I'm wrong)
I think what would typically count as "principles" in Eliezer's meaning are
1. designable things which make the "true corrigibility" basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the "incorrigible" basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the "incorrigible" basin less lethal
4. preventin... (read more)
- I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.
Can someone explain to me what this crispness is?
As I'm reading Paul's comment, there's an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI's optimization power, it gets better at the task, but it also gets better at breaking m... (read more)
It sounds like we are broadly on the same page about 1 and 2 (presumably partly because my list doesn't focus on my spiciest takes, which might have generated more disagreement).
Here are some extremely rambling thoughts on point 3.
I agree that the interaction between AI and existing conflict is a very important consideration for understanding or shaping policy responses to AI, and that you should be thinking a lot about how to navigate (and potentially leverage) those dynamics if you want to improve how well we handle any aspect of AI. I was trying to most... (read more)
Not very coherent response to #3. Roughly
My sense is that we are on broadly the same page here. I agree that "AI improving AI over time" will look very different from "humans improving humans over time" or even "biology improving humans over time." But I think that it will look a lot like "humans improving AI over time," and that's what I'd use to estimate timescales (months or years, most likely years) for further AI improvements.
I'm guessing the disagreement is that Yudkowsky thinks the holes are giant visible and gaping, whereas you think they are indeed holes but you have some ideas for how to fix them
I think we don't know whether various obvious-to-us-now things will work with effort. I think we don't really have a plan that would work with an acceptably high probability and stand up to scrutiny / mildly pessimistic assumptions.
I would guess that if alignment is hard, then whatever we do ultimately won't follow any existing plan very closely (whether we succeed or not). I do th... (read more)
Do you think that some of my disagreements should change if I had shorter timelines?
(As mentioned last time we talked, but readers might not have seen: I'm guessing ~15% on singularity by 2030 and ~40% on singularity by 2040.)
I think most of your disagreements on this list would not change.
However, I think if you conditioned on 50% chance of singularity by 2030 instead of 15%, you'd update towards faster takeoff, less government/societal competence (and thus things more likely to fail at an earlier, less dignified point), more unipolar/local takeoff, lower effectiveness of coordination/policy/politics-style strategies, less interpretability and other useful alignment progress, less chance of really useful warning shots... and of course, significantly higher p(doom).
To put it an... (read more)
I think most worlds, surviving or not, don't have a plan in the sense that Eliezer is asking about.
I do agree that in the best worlds, there are quite a lot of very good plans and extensive analysis of how they would play out (even if it's not the biggest input into decision-making). Indeed, I think there are a lot of things that the best possible world would be doing that we aren't, and I'd give that world a very low probability of doom even if alignment was literally impossible-in-principle.
ETA: this is closely related to Richard's point in the sibling.
Yeah, I agree that if you learn a probabilistic model then you mostly have a difference in degree rather than difference in kind with respect to interpretability. It's not super clear that the difference in degree is large or important (it seems like it could be, just not clear). And if you aren't willing to learn a probabilistic model, then you are handicapping your system in a way that will probably eventually be a big deal.
Two potentially relevant distinctions:
I broadly agree with this perspective, and think the modeling vs inference distinction is a valuable one to make.
That said, it seems to me that in practice you often should be trying to converge to a zero entropy policy. The optimal action is not random; we want the model to be picking the output that looks best in light of its beliefs. (Old Eliezer post.)
For some applications you are sampling from your model multiple times and ensembling. In this case randomness can help if you have no memory. But in the same setting, optimizing the correct reward functio... (read more)
Ah, that makes sense. But the 26 minutes --> 13 minutes is from adversarial training holding the threshold fixed, right?
I think that 3 orders of magnitude is the comparison between "time taken to find a failure by randomly sampling" and "time taken to find a failure if you are deliberately looking using tools."
I still the 25-30% estimate in my original post was basically correct. I think the typical SACC adjustment for single-hose air conditioners ends up being 15%, not 25-30%. I agree this adjustment is based on generous assumptions (5.4 degrees of cooling whereas 10 seems like a more reasonable estimate). If you correct for that, you seem to get to more like 25-30%. The Goodhart effect is much smaller than this 25-30%, I still think 10% is plausible.
I admit that in total I’ve spent significantly more than 1.5 hours researching air conditioners :) So I’m ... (read more)
Ok, I think that ~50% estimate is probably wrong. Happy to bet about outcome (though I think someone with working knowledge of air conditioners will also be able to confirm). I'd bet that efficiency and Delta t will be linearly related and will both be reduced by a factor of about (exhaust - outdoor) / (exhaust - indoor) which will be much more than 50%.
I think labeling requirements are based on the expectation of cooling from 95 to 80 (and I expect typical use cases for portable AC are more like that). Actually hot places will usually have central air or window units.
I would have thought that the efficiency lost is roughly (outside temp - inside temp) / (exhaust temp - inside temp). And my guess was that exhaust temp is ~130.
I think the main way the effect could be as big as you are saying is if that model is wrong or if the exhaust is a lot cooler than I think. Those both seem plausible; I don't understand how AC works, so don't trust that calculation too much. I'm curious what your BOTEC was / if you think 130 is too high an estimate for the exhaust temp?
If that calculation is right, and exhaust is at 130, outs... (read more)
They measure the temperature in the room, which captures the effect of negative pressure pulling in hot air from the rest of the building. It underestimates the costs if the rest of the building is significantly cooler than the outside (I'd guess by the ballpark of 20-30% in the extreme case where you care equally about all spaces in the building, the rest of your building is kept at the same temp as the room you are cooling, and a negligible fraction of air exchange with the outside is via the room you are cooling).
... (read more)Which... seems to misunderstand the actu
Here is the wirecutter discussion of the distinction for reference:
... (read more)Starting in 2019, we began comparing dual- and single-hose models according to the same criteria, and we didn’t dismiss any models based on their hose count. Our research, however, ultimately steered us toward single-hose portable models—in part because so many newer models use this design. In fact, we found no compelling new double-hose models from major manufacturers in 2019 or 2020 (although a few new ones cropped up in 2021, including our new top pick). Owner reviews indicate that most
Boston summers are hotter than the average summers in the US, and I'd guess are well above the average use case for an AC in the US. I agree having two hoses are more important the larger the temperature difference, and by the time you are cooling from 100 to 70 the difference is fairly large (though there is basically nowhere in the US where that difference is close to typical).
I'd be fine with a summary of "For users who care about temp in the whole house rather than just the room with the AC, one-hose units are maybe 20% less efficient than they feel. B... (read more)
Does anyone in-thread (or reading along) have any experiments they'd be interested in me running with this air conditioner? It doesn't seem at all hard for me to do some science and get empirical data, with a different setup to Wirecutter, so let me know.
Added: From a skim of the thread, it seems to me the experiment that would resolve matters is testing in a large room with temperature sensors more like 15 feet away in a city or country that's very hot outside, and to compare this with (say) Wirecutter's top pick with two-hoses. Confirm?
(Also, I expect it to seem like I am refusing to update in the face of any evidence, so I'd like to highlight that this model correctly predicted that the tests were run someplace where it was not hot outside. Had that evidence come out different, I'd be much more convinced right now that one hose vs two doesn't really matter.)
From how we tested:
... (read more)Over the course of a sweltering summer week in Boston, we set up our five finalists in a roughly 250-square-foot space, taking notes and rating each model on the basic setup process, performance, portability, acces
In this particular case, I indeed do not think the conflict is worth the cost of exploring - it seems glaringly obvious that people are buying a bad product because they are unable to recognize the ways in which it is bad.
The wirecutter recommendation for budget portable ACs is a single-hose model. Until very recently their overall recommendation was also a single-hose model.
The wirecutter recommendations (and other pages discussing this tradeoffs) are based on a combination of "how cold does it make the room empirically?" and quantitative estimates of coo... (read more)
The best thing we took away from our tests was the chance at a direct comparison between a single-hose design and a dual-hose design that were otherwise identical, and our experience confirmed our suspicions that dual-hose portable ACs are slightly more effective than single-hose models but not effective enough to make a real difference
After having looked into this quite a bit, it does really seem like the Wirecutter testing process had no ability to notice infiltration issues, so it seems like the Wirecutter crew themselves is kind of confused here? ... (read more)
I think that one of the key difficulties for debate research is having good tasks that call for more sophisticated protocols. I think this dataset seems great for that purpose, and having established a negative result for 1-turn debate seems like a good foundation for follow-up work exploring more sophisticated protocols. (It seems like a shame that people don't normally publish early-stage and negative results.)
In comparison with other datasets (e.g. in the negative results described by Beth), it seems like QuALITY is identifying pretty crisp failures and... (read more)
Regulation does not fix the problem, just moves it from the consumer to the regulator. A regulator will only regulate a problem which is obvious to the regulator. A regulator may sometimes have more expertise than a layperson, but even that requires that the politicians ultimately appointing people can distinguish real from fake expertise, which is hard in general.
It seems like the DOE decided to adopt energy-efficiency standards that take into account infiltration. They could easily have made a different decision (e.g. because of pressure from portab... (read more)
I agree that people can easily fail to fix alignment problems, and can instead paper over them, even given a long time to iterate. But I'm not really convinced about your analogy with single-hose air conditioners.
Physics:
The air coming out of the exhaust is often quite a bit hotter than the outside air. I've never checked myself, but just googling has many people reporting 130+ degree temperatures coming out of exhaust from single-hose units. I'm not sure how hot this unit's exhaust is in particular, but I'd guess it's significantly hotter than outside air... (read more)
My overall take on this post and comment (after spending like 1.5 hours reading about AC design and statistics):
Overall I feel like both the OP and this reply say some wrong things. The top Wirecutter recommendation is a dual-hose design. The testing procedure of Wirecutter does not seem to address infiltration in any way, and indeed the whole article does not discuss infiltration as it relates to cooling-efficiency.
Overall efficiency loss from going to dual to single is something like 20-30%, which I do think is much lower than I think the OP ... (read more)
Regulation does not fix the problem, just moves it from the consumer to the regulator. A regulator will only regulate a problem which is obvious to the regulator. A regulator may sometimes have more expertise than a layperson, but even that requires that the politicians ultimately appointing people can distinguish real from fake expertise, which is hard in general.
It seems like the DOE decided to adopt energy-efficiency standards that take into account infiltration. They could easily have made a different decision (e.g. because of pressure from portab... (read more)
I really liked the summarizing differences between distributions paper.
I think I'm excited for broadly the same reasons you are, but to state the case in my own words:
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says "evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it" seems obviously wrong to me. If that's a good summary of the disagreement I'm happy to just leave it there.
It feels to me like there's basically no question that recognizing good cryptosystems is easier than generating them. And recognizing attacks on cryptosystems is easier than coming up with attacks (even if they work by exploiting holes in the formalisms). And recognizing good abstract arguments for why formalisms are inadequate is easier than generating them. And recognizing good formalisms is easier than generating them.
This is all true notwithstanding the fact that we often make mistakes. (Though as we've discussed before, I think that a lot of the examp... (read more)
I don't buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think "recognition is not trivial" is different from "recognition is as hard as generation."
I think that argument applies just as easily to a human as to a model, doesn't it?
So it seems like you are making an equally strong claim that "if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad." And I think that's kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that "how smart the human is" doesn't matter mostly just proves that the counting argument is untethered from the key considerations.)
Is your story:
It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations "fool us" is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
I personally have pretty broad error bars; I think it's plausible enough that AI won't help with automating alignment that it's still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.
Eliezer has consistently expressed c... (read more)
Building weak AI systems that help improve alignment seems extremely important to me and is a significant part of my optimism about AI alignment. I also think it's a major reason that my work may turn out not to be relevant in the long term.
I still think there are tons of ways that delegating alignment can fail, such that it matters that we do alignment research in advance:
What do you (or others) think is the most promising, soon-possible way to use language models to help with alignment? A couple of possible ideas:
Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.
I think it's very unclear how big a problem Goodhart is for alignment research---it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don't have formal theorem statements. There are also domains where it's not much easier, where the whole thing rests on compl... (read more)
Yeah, sorry, poor wording on my part. What I meant in that part was "argue that the direct translator cannot be arbitrarily complex", although I immediately mention the case you're addressing here in the parenthetical right after what you quote.
Ah, I just totally misunderstood the sentence, the intended reading makes sense.
Well, it might be that a proposed solution follows relatively easily from a proposed definition of knowledge, in some cases. That's the sort of solution I'm going after at the moment.
I agree that's possible, and it does seem like a... (read more)
We discuss the definition of "knowledge" a bit in this appendix; compared to your definitions, we want to only say that the model "knows" the value of X when it is actually behaving differently based on the value of X in order to obtain a lower loss. I think this is strictly weaker than your level 2 (since the model needs to actually be using that knowledge) and incomparable to your level 1 (since the model's behavior might depend on an estimate of X without the model having any introspective knowledge about that dependence), though I might be misunderstan... (read more)
I am extremely interested in examples of heuristic arguments like this where the naive version leads you astray and it takes much more sophisticated arguments to fix the problem. I'd be happy to pay a $500 bounty for any examples that I find really interesting, in number theory or elsewhere. (Usual disclaimers, my judgment of what counts as interesting will be completely arbitrary and unaccountable, etc.)
The best example I found was the Chebyshev bias that primes are very slightly more likely to be 3 mod 4 than 1 mod 4. The simplest explanation I know of t... (read more)
Readers interested in this topic should probably read Terrence Tao's excellent discussions of probabilistic heuristics in number theory, e.g. this post discussing Fermat's last theorem, the ABC conjecture, and twin primes or this post on biases in prime number gaps. Those posts really helped improve my understanding of how such heuristic arguments work, and there's some cool surprises.
Isn't the Stuart conjecture an extremely weak form of the Lander, Parkin and Selfridge conjecture? If you specialize their conjecture to then it implies your co... (read more)
I think my way of thinking about things is often a lot like "draw random samples," more like drawing N random samples rather than particle filtering (I guess since we aren't making observations as we go---if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).
The main complexity feels like the thing you point out where it's impossible to make them fully fleshed out, so you build a bunch of intuitions about what is consistent (and could be fleshed out given enough time) and ... (read more)
I don't think there is an "AGI textbook" any more than there is an "industrialization textbook." There are lots of books about general principles and useful kinds of machines. That said, if I had to make wild guesses about roughly what that future understanding would look like:
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict"). At a high level I don't think "mainline" is a great concept for describing probability distributions over the future except in certain exceptional cases (though I may not understand what "mainline" means), and that neat stor... (read more)
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict").
Sometimes I'll be tracking a finite number of "concrete hypotheses", where every hypothesis is 'fully fleshed out', and be doing a particle-filtering style updating process, where sometimes hypotheses gain or lose weight, sometimes... (read more)
Certainly I don't see fusion reactors, solar panels or (use in electronics of) semiconductors as counterexamples, since each of these was invented at some point, and didn't gradually evolve from some completely different technology.
Your definition of "discontinuity" seems broadly compatible with my view of the future then. Definitely there are different technologies that are not all outgrowths of one another.
My main point of divergence is:
... (read more)Now, when a QNI comes along, it doesn't necessarily look like a discontinuity, because there might be a lot of work to
Yeah, I think this was wrong. I'm somewhat skeptical of the numbers and suspect future revisions systematically softening those accelerations, but 4x still won't look that crazy.
(I don't remember exactly how I chose that number but it probably involved looking at the same time series so wasn't designed to be much more abrupt.)
My line of argument here is not trying to prove a particular story about AI progress (e.g. "TAI will be similar to a startup") but push pack against (/ voice my confusions about) the confidence level of predictions made by Christiano's model.
What is the confidence level of predictions you are pushing back against? I'm at like 30% on fast takeoff in the sense of "1 year doubling without preceding 4 year doubling" (a threshold roughly set to break any plausible quantitative historical precedent a threshold intended to be faster than historical precedent but ... (read more)
Edited to clarify.