Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one.
I agree. I'm not criticizing the people who are trying to make sure that policies are well-targeted and grounded in high-quality evidence. I'm arguing in favor of their work. I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum. [ETA, rewording: To the extent I was arguing against a single line of work, I was primarily arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum. However, as I wrote in the post, I also think that we should re-evaluate which problems will be solved by default, which means I'm not merely letting other AI governance people off the hook.]
Operationalizing disagreements well is hard and time-consuming especially when we're betting on "how things would go without intervention from a community that is intervening a lot", but a few very rough forecasts, all conditional on no TAI before resolve date:
I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I'm more interested in predicting regulatory stringency, quality, and scope.
Even if fewer than 10% of Americans consider AI to be the most important issue in 2028, I don't think that necessarily indicates that regulations will have low stringency, low quality, or poor scope. Likewise, I'm not sure whether I want to predict on Evan Hubinger's opinion, since I'd probably need to understand more about how he thinks to get it right, and I'd prefer to focus the operationalization instead on predictions about large, real world outcomes. I'm not really sure what disagreement the third prediction is meant to operationalize, although I find it to be an interesting question nonetheless.
The key question that the debate was about was whether building AGI would require maybe 1-2 major insights about how to build it, vs. it would require the discovery of a large number of algorithms that would incrementally make a system more and more up-to-par with where humans are at.
Robin Hanson didn't say that AGI would "require the discovery of a large number of algorithms". He emphasized instead that AGI would require a lot of "content" and would require a large "base". He said,
My opinion, which I think many AI experts will agree with at least, including say Doug Lenat who did the Eurisko program that you most admire in AI [gesturing toward Eliezer], is that it's largely about content. There are architectural insights. There are high-level things that you can do right or wrong, but they don't, in the end, add up to enough to make vast growth. What you need for vast growth is simply to have a big base. [...]
Similarly, I think that for minds, what matters is that it just has lots of good, powerful stuff in it, lots of things it knows, routines, strategies, and there isn't that much at the large architectural level.
This is all vague, but I think you can interpret his comment here as emphasizing the role of data, and making sure the model has learned a lot of knowledge, routines, strategies, and so on. That's different from saying that humans need to discover a bunch of algorithms, one by one, to incrementally make a system more up-to-par with where humans are at. It's compatible with the view that humans don't need to discover a lot of insights to build AGI. He's saying that insights are not sufficient: you need to make sure there's a lot of "content" in the AI too.
I personally find his specific view here to have been vindicated more than the alternative, even though there were many details in his general story that ended up aging very poorly (especially ems).
Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones
I have three things to say here:
I feel like this post is to some extent counting our chickens before they hatch (tbc I agree with the directional update as I said above). [...] I don't have strong views on the value of AI advocacy, but this post seems overconfident in calling it out as being basically not useful based on recent shifts.
I want to emphasize that the current policies were crafted in an environment in which AI still has a tiny impact on the world. My expectation is that policies will get much stricter as AI becomes a larger part of our life. I am not making the claim that current policies are sufficient; instead I am making a claim about the trajectory, i.e. how well we should expect society to respond at a time, given the evidence and level of AI capabilities at that time. I believe that current evidence supports my interpretation of our general trajectory, but I'm happy to hear someone explain why they disagree and highlight concrete predictions that could serve to operationalize this disagreement.
Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
If ordinary humans can't single out concepts that are robustly worth optimizing for, then either,
Can you be more clear about which of these you believe?
I'm also including "indirect" ways that humans can single out concepts that are robustly worth optimizing for. But then I'm allowing that GPT-N can do that too. Maybe this is where the confusion lies?
If you're allowing for humans to act in groups and come up with these concepts after e.g. deliberation, and still think that ordinary humans can't single out concepts that are robustly worth optimizing for, then I think this view is a little silly, although the second interpretation at least allows for the possibility that the future goes well and we survive AGI, and that would be nice to know.
Thanks for the continued clarifications.
Our primary existing disagreement might be this part,
My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.
Of course, there's no way of proving what these three people would have said in 2016, and I sympathize with the people who are saying they don't care much about the specific question of who said what when. However, here's a passage from the Arbital page on the Problem of fully updated deference, which I assume was written by Eliezer,
One way to look at the central problem of value identification in superintelligence is that we'd ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.
Here, Eliezer describes the problem of value identification similar to the way I had in the post, except he refers to a function that reflects "value V in all its glory" rather than a function that reflects V with fidelity comparable to the judgement of an ordinary human. And he adds that "as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down". My interpretation here is therefore as follows,
If interpretation (1) is accurate, then I mostly just think that we don't need to specify an objective function that matches something like the full coherent extrapolated volition of humanity in order to survive AGI. On the other hand, if interpretation (2) is accurate, then I think in 2017 and potentially earlier, Eliezer genuinely felt that there was an important component of the alignment problem that involved specifying a function that reflected the human value function at a level that current LLMs are relatively close to achieving, and he considered this problem unsolved.
I agree there are conceivable alternative ways of interpreting this quote. However, I believe the weight of the evidence, given the quotes I provided in the post, in addition to the one I provided here, supports my thesis about the historical argument, and what people had believed at the time (even if I'm wrong about a few details).
I don't think MIRI ever considered this an important part of the alignment problem, and I don't think we expect humanity to solve lots of the alignment problem as a result of having such a tool
If you don't think MIRI ever considered coming up with an "explicit function that reflects the human value function with high fidelity" to be "an important part of the alignment problem", can you explain this passage from the Arbital page on The problem of fully updated deference?
One way to look at the central problem of value identification in superintelligence is that we'd ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.
Eliezer (who I assume is the author) appears to say in the first paragraph that solving the problem of value identification for superintelligences would "probably [solve] the whole problem", and by "whole problem" I assume he's probably referring to what he saw as an important part of the alignment problem (maybe not though?)
He referred to the problem of value identification as getting "some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory." This seems to be very similar to my definition, albeit with the caveat that my definition isn't about revealing "V in all its glory" but rather, is more about revealing V at the level that an ordinary human is capable of revealing V.
Unless the sole problem here is that we absolutely need our function that reveals V to be ~perfect, then I think this quote from the Arbital page directly supports my interpretation, and overall supports the thesis in my post pretty strongly (even if I'm wrong about a few minor details).
Thanks for trying to understand my position. I think this interpretation that you gave is closest to what I'm arguing,
Attempting again: on Matthew's model of past!Nate's model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn't take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like "diamond" and less like "a bunch of random noise", which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes "picking something worth optimizing for").
I have a quick response to what I see as your primary objection:
The hard part of value specification is not "figure out that you should call 911 when Alice is in labor and your car has a flat", it's singling out concepts that are robustly worth optimizing for.
I think this is kinda downplaying what GPT-4 is good at? If you talk to GPT-4 at length, I think you'll find that it's cognizant of many nuances in human morality that go way deeper than the moral question of whether to "call 911 when Alice is in labor and your car has a flat". Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for". I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well, and to the extent it can't, I expect almost all the bugs to be ironed out in near-term multimodal models.
It would be nice if you made a precise prediction about what type of moral reflection or value specification multimodal models won't be capable of performing in the near future, if you think that they are not capable of the 'deep' value specification that you care about. And here, again, I'm looking for some prediction of the form: humans are able to do X, but LLMs/multimodal models won't be able to do X by, say, 2028. Admittedly, making this prediction precise is probably hard, but it's difficult for me to interpret your disagreement without a little more insight into what you're predicting.
This comment is valuable for helping to clarify the disagreement. So, thanks for that. Unfortunately, I am not sure I fully understand the comment yet. Before I can reply in-depth, I have a few general questions:
(In general, I agree that discussions about current arguments are way more important than discussions about what people believed >5 years ago. However, I think it's occasionally useful to talk about the latter, and so I wrote one post about it.)
Can you explain how you're defining outer alignment and value specification?
I'm using this definition, provided by Hubinger et al.
the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.
Evan Hubinger provided clarification about this definition in his post "Clarifying inner alignment terminology",
Outer Alignment: An objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.[2]
I deliberately avoided using the term "outer alignment" in the post because I wanted to be more precise and not get into a debate about whether the value specification problem matches this exact definition. (I think the definitions are subtly different but the difference is not very relevant for the purpose of the post.) Overall, I think the two problems are closely associated and solving one gets you a long way towards solving the other. In the post, I defined the value identification/specification problem as,
I am mainly talking about the problem of how to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans.
This was based on the Arbital entry for the value identification problem, which was defined as a
subproblem category of value alignment which deals with pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes.
I should say note that I used this entry as the primary definition in the post because I was not able to find a clean definition of this problem anywhere else.
I'd appreciate if you clarified whether you are saying:
I'm more sympathetic to (3) than (2), and more sympathetic to (2) than (1), roughly speaking.
Yes, the post was about more than that. To the extent I was arguing against a single line of work, it was mainly intended as a critique of public advocacy. Separately, I asked people to re-evaluate which problems will be solved by default, to refocus our efforts on the most neglected, important problems, and went into detail about what I currently expect will be solved by default.
I offered a concrete prediction in the post. If people don't think my prediction operationalizes any disagreement, then I think (1) either they don't disagree with me, in which case maybe the post isn't really aimed at them, or (2) they disagree with me in some other way that I can't predict, and I'd prefer they explain where they disagree exactly.
It seems relatively valueless to predict on what will happen without intervention, since AI x-risk people will almost certainly intervene.
I mostly agree. But I think it's still better to offer a precise prediction than to only offer vague predictions, which I perceive as the more common and more serious failure mode in discussions like this one.