Do you think part of it might be that even people with graduate philosophy educations are too prone to being wedded to their own ideas, or don't like to poke holes at them as much as they should? Because part of what contributes to my wanting to go more meta is being dissatisfied with my own object-level solutions and finding more and more open problems that I don't know how to solve. I haven't read much academic philosophy literature, but did read some anthropic reasoning and decision theory literature earlier, and the impression I got is that most of the authors weren't trying that hard to poke holes in their own ideas.
I don't understand your ideas in detail (am interested but don't have the time/ability/inclination to dig into the mathematical details), but from the informal writeups/reviews/critiques I've seen of your overall approach, as well as my sense from reading this comment of how far away you are from a full solution to the problems I listed in the OP, I'm still comfortable sticking with "most are wide open". :)
On the object level, maybe we can just focus on Problem 4 for now. What do you think actually happens in a 2IBH-1CDT game? Presumably CDT still plays D,...
...I'm still comfortable sticking with "most are wide open".
Allow me to rephrase. The problems are open, that's fair enough. But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely." On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA, which has implications on those problems among other things, and so far this progress seems to only r...
Even items 1, 3, 4, and 6 are covered by your research agenda? If so, can you quickly sketch what you expect the solutions to look like?
@jessicata @Connor Leahy @Domenic @Daniel Kokotajlo @romeostevensit @Vanessa Kosoy @cousin_it @ShardPhoenix @Mitchell_Porter @Lukas_Gloor (and others, apparently I can only notify 10 people by mentioning them in a comment)
Sorry if I'm late in responding to your comments. This post has gotten more attention and replies than I expected, in many different directions, and it will probably take a while for me to process and reply to them all. (In the meantime, I'd love to see more people discuss each other's ideas here.)
Do you have any examples that could illustrate your theory?
It doesn't seem to fit my own experience. I became interested in Bayesian probability, universal prior, Tegmark multiverse, and anthropic reasoning during college, and started thinking about decision theory and ideas that ultimately led to UDT, but what heuristics could I have been applying, learned from what "domains with feedback"?
Maybe I used a heuristic like "computer science is cool, lets try to apply it to philosophical problems" but if the heuristics are this coarse grained, it doesn't seem like the idea can explain how detailed philosophical reasoning happens, or be used to ensure AI philosophical competence?
I expect at this moment in time me building a company is going to help me deconfuse a lot of things about philosophy more than me thinking about it really hard in isolation would
Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?
...It seems plausible that there is no such thing as “correct” metaphilosophy, and humans are just making up random stuff based on our priors and environment and that’s it and there is no “right way”
Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?
The fact you ask this question is interesting to me, because in my view the opposite question is the more natural one to ask: What kind of questions can you make progress on without constant grounding and dialogue with reality? This is the default of how we humans build knowledge and solve hard new questions, the places where we do best and get the least drawn astray is ...
Philosophy is a social/intellectual process taking place in the world. If you understand the world, you understand how philosophy proceeds.
What if I'm mainly interested in how philosophical reasoning ideally ought to work? (Similar to how decision theory studies how decision making normatively should work, not how it actually works in people.) Of course if we have little idea how real-world philosophical reasoning works, understanding that first would probably help a lot, but that's not the ultimate goal, at least not for me, for both intellectual and A...
If we keep stumbling into LLM type things which are competent at a surprisingly wide range of tasks, do you expect that they’ll be worse at philosophy than at other tasks?
I'm not sure but I do think it's very risky to depend on LLMs to be good at philosophy by default. Some of my thoughts on this:
Here's another bullet point to add to the list:
If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.
Ok I'll quote 5.1.4-5 to make it easier for others to follow this discussion:
...5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).
5.1.5. The incentives under
"AI-powered memetic warfare makes all humans effectively insane" a catastrophe that I listed in an earlier comment, which seems one of the hardest to formally specify. It seems values-complete or metaphilosophy-complete to me, since without having specified human values or having solved metaphilosophy, how can we check whether an AI-generated argument is trying to convince us of something that is wrong according to actual human values, or wrong according to normative philosophical reasoning?
I don't see anything in this post or the linked OAA post that addr...
Is this the final update from Ought about their factored cognition experiments? (I can't seem to find anything more recent.) The reason I ask is that the experiments they reported on here do not seem very conclusive, and they talked about doing further experiments but then did not seem to give any more updates. Does anyone know the story of what happened, and what that implies about the viability of factored-cognition style alignment schemes?
goodness of HCH
What is the latest thinking/discussion about this? I tried to search LW/AF but haven't found a lot of discussions, especially positive arguments for HCH being good. Do you have any links or docs you can share?
How do you think about the general unreliability of human reasoning (for example, the majority of professional decision theorists apparently being two-boxers and favoring CDT, and general overconfidence of almost everyone on all kinds of topics, including morality and meta-ethics and other topics relevant for AI alignment) in relatio...
But yeah also I think that AGIs will be by default way better than humans at this sort of stuff.
What's your reasons for thinking this? (Sorry if you already explained this and I missed your point, but it doesn't seem like you directly addressed my point that if AGIs learn from or defer to humans, they'll be roughly human-level at this stuff?)
...When you say “the top tier of rational superintelligences exploits everyone else” I say that is analogous to “the most rational/clever/capable humans form an elite class which rules over and exploits the masses.”
I think that agents worthy of being called “rational” will probably handle all this stuff more gracefully/competently than humans do
Humans are kind of terrible at this right? Many give in even to threats (bluffs) conjured up by dumb memeplexes and back up by nothing (i.e., heaven/hell), popular films are full of heros giving in to threats, apparent majority of philosophers have 2-boxing intuitions (hence the popularity of CDT, which IIUC was invented specifically because some philosophers were unhappy with EDT choosing to 1-box), governments negotiate w...
I think I'm less sure than @Eliezer Yudkowsky that there is a good solution to the problem of commitment races, even in theory, or that if there is a solution, it has the shape that he thinks it has. I've been thinking about this problem off and on since 2009, and haven't made much progress. Others have worked on this too (as you noted in the OP), and all seem to have gotten stuck at roughly the same place that I got stuck. Eliezer described what he would do in a particular game, but I don't know how to generalize his reasoning (which you call "nonconseque...
I was informed by an OpenAI insider that the 4 year goal is actually “build a roughly human-level automated alignment researcher”.
I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning?
Thinking about this a bit more, a better plan would be to train AI to do all kinds of reasoning required for automated alignment research in parallel, so that you can monitor what kinds of reasoning are harder for the AI to learn than the others, and have more of an early warning if it looked like that would cause the project to fail. Assuming your plan looks more like t...
Does anyone have guesses or direct knowledge of:
Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I’m not sure that it’s always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.
Given this, what is your current plan around AI and philosophy? I gue...
Thanks for engaging with my questions here. I'll probably have more questions later as I digest the answers and (re)read some of your blog posts. In the meantime, do you know what Paul meant by "it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents" in the other subthread?
Thanks for this helpful explanation.
it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents
Can you point me to the original claims? While trying to find it myself, I came across https://aligned.substack.com/p/alignment-optimism which seems to be the most up to date explanation of why Jan thinks his approach will work (and which also contains his views on the obfuscated arguments problem and how RRM relates to IDA, so should be a good resource for me to read more carefu...
I don't think I disagree with many of the claims in Jan's post, generally I think his high level points are correct.
He lists a lot of things as "reasons for optimism" that I wouldn't consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn't list the analogous reasons for pessimism (e.g. stuff that hasn't worked well yet). Similarly I'm not sure conviction in language models is a good thing but it may depend on your priors.
One potential substantive disagreement with Jan's position is that I'm somewhat more scared ...
I've been trying to understand (catch up with) the current alignment research landscape, and this seems like a good opportunity to ask some questions.
Yes, we are currently planning continue to pursue these directions for scalable oversight. My current best guess is that scalable oversight will do a lot of the heavy lifting for aligning roughly human-level alignment research models (by creating very precise training signals), but not all of it. Easy-to-hard generalization, (automated) interpretability, adversarial training+testing will also be core pieces, but I expect we'll add more over time.
I don't really understand why many people updated so heavily on the obfuscated arguments problem; I don't t
1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn't depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.
2. I think it's unlikely debate or IDA will scale up indefinitely without major conceptual progress (which...
I'd like to read this post but can't find it in your post history. Any chance it might be sitting in your drafts folder? Also, do you know if obfuscated argument (or the equivalent problem for IDA) is the main reason that research interest in IDA has seemingly declined a lot from 3 years ago?
Then the AI Alignment Awards contest came along, and an excellent entry by Elliot Thornley proposed that incomplete preferences could be used as a loophole to make the shutdown problem tractable.
Where is this contest entry? All my usual search methods are failing me...
One way that things could go wrong, not addressed by this playbook: AI may differentially accelerate intellectual progress in a wrong direction, or in other words create opportunities for humanity to make serious mistakes (by accelerating technological progress) faster than wisdom to make right choices (philosophical progress). Specific to the issue of misalignment, suppose we get aligned human-level-ish AI, but it is significantly better at speeding up AI capabilities research than the kinds of intellectual progress needed to continue to minimize misalign...
I guess part of the problem is that the people who are currently most receptive to my message are already deeply enmeshed in other x-risk work, and I don't know how to reach others for whom the message might be helpful (such as academic philosophers just starting to think about AI?). If on reflection you think it would be worth spending some of your time on this, one particularly useful thing might be to do some sort of outreach/field-building, like writing a post or paper describing the problem, presenting it at conferences, and otherwise attracting more ...
Even at 10% p(doom), which I consider to be unreasonably low, it would probably be worth delaying a few years.
Someone with with 10% p(doom) may worry that if they got into a coalition with others to delay AI, they can't control the delay precisely, and it could easily become more than a few years. Maybe it would be better not to take that risk, from their perspective.
And lots of people have p(doom)<10%. Scott Aaronson just gave 2% for example, and he's probably taken AI risk more seriously than most (currently working on AI safety at OpenAI), so prob...
Why is 1 important? It seems like something we can defer discussion of until after (if ever) alignment is solved, no?
If aging was solved or looked like it will be solved within next few decades, it would make efforts to stop or slow down AI development less problematic, both practically and ethically. I think some AI accelerationists might be motivated directly by the prospect of dying/deterioration from old age, and/or view lack of interest/progress on that front as a sign of human inadequacy/stagnation (contributing to their antipathy towards humans)....
It seems like just 4 months ago you still endorsed your second power-seeking paper:
This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.
Why are you now "fantasizing" about retracting it?
...I think a healthy alignment community would have rebuked me for that line of research, but s
To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.
The problem is that I don't trust people to wield even the non-instantly-doomed results.
For example, one EAG presentation cited my retargetability results as showing that most rewar...
Is it just me or is it nuts that a statement this obvious could have gone outside the overton window, and is now worth celebrating when it finally (re?)enters?
How is it possible to build a superintelligence at acceptable risk while this kind of thing can happen? What if there are other truths important to safely building a superintelligence, that nobody (or very few) acknowledges because they are outside the overton window?
Now that AI x-risk is finally in the overton window, what's your vote for the most important and obviously true statement that is still...
Note that this paper already used "Language Agents" to mean something else. See link below for other possible terms. I will keep using "Language Agents" in this comment/thread (unless the OP decide to change their terminology).
I added the tag Chain-of-Thought Alignment, since there's a bunch of related discussion on LW under that tag. I'm not very familiar with this discussion myself, and have some questions below that may or may not already have good answers.
How competent will Language Agents be at strategy/planning, compared to humans and other AI approa...
For example, making numerous copies of itself to work in parallel would again raise the dangers of independently varying goals.
The AI could design a system such that any copies made of itself are deleted after a short period of time (or after completing an assigned task) and no copies of copies are made. This should work well enough to ensure that the goals of all of the copies as a whole never vary far from its own goals, at least for the purpose of researching a more permanent alignment solution. It's not 100% risk-free of course, but seems safe enoug...
I don't think I understand, what's the reason to expect that the "acausal economy" will look like a bunch of acausal norms, as opposed to, say, each civilization first figuring out what its ultimate values are, how to encode them into a utility function, then merging with every other civilization's utility function? (Not saying that I know it will be the latter, just that I don't know how to tell at this point.)
Also, given that I think AI risk is very high for human civilization, and there being no reason to suspect that we're not a typical pre-AGI civiliz...
To your first question, I'm not sure which particular "the reason" would be most helpful to convey. (To contrast: what's "the reason" that physically dispersed human societies have laws? Answer: there's a confluence of reasons.). However, I'll try to point out some things that might be helpful to attend to.
First, committing to a policy that merges your utility function with someone else's is quite a vulnerable maneuver, with a lot of boundary-setting aspects. For instance, will you merge utility functions multiplicatively (as in Nas...
We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime.
What? A major reason we're in the current mess is that we don't know how to do this. For example we don't seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don't act like Hollywood villains (race for AI to make a competitor 'dance')? Even our "AGI safety" organizations don't behave safely (e.g., racing for capabilities, handing them over to others, e.g....
My personal view is that given all of this history and the fact that this forum is named the "AI Alignment Forum", we should not redefine "AI Alignment" to mean the same thing as "Intent Alignment". I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment”), and we should fix that error by collectively going back to the original definition, or in some circumstances where the risk of confusion...
Other relevant paragraphs from the Arbital post:
...“AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.
Other terms that have been used to describe this research problem include “robust and beneficial AI” and “Friendly AI”. The term “value alignment problem” was coined by Stuart Russell to refer to the primary subproblem of aligning AI preferences wit
Here are some clearer evidence that broader usages of "AI alignment" were common from the beginning:
The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.
(I couldn't find a easy way to view the original 2015 version, but do have a screenshot that I can produce upon request showing a Jan 2017 edit on Arbital that already had this broad def...
Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to ai-alignment.com I think).
Eliezer used "AI alignment" as early as 2016 and ai-alignment.com wasn't registered until 2017. Any other usage of the term that potentially predates Eliezer?
I’m not sure what order the history happened in and whether “AI Existential Safety” got rebranded into “AI Alignment” (my impression is that AI Alignment was first used to mean existential safety, and maybe this was a bad term, but it wasn’t a rebrand)
There was a pretty extensive discussion about this between Paul Christiano and me. tl;dr "AI Alignment" clearly had a broader (but not very precise) meaning than "How to get AI systems to try to do what we want" when it first came into use. Paul later used "AI Alignment" for his narrower meaning, but after...
Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don’t see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).
Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins? What...
If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI.
Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash. The kind of backlash you're worried abo...
Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.
I'm not sure I agree that this is unfair.
OpenAI is clearly on the cutting edge of AI research.
This is obviously a good reason to focus on them more.
OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.
Perhaps we have responsibility to scrutinize/criticize them more because of this...
I guess it depends on the specific alignment approach being taken, such as whether you're trying to build a sovereign or an assistant. Assuming the latter, I'll list some philosophical problems that seem generally relevant:
To the extent that alignment research involves solving philosophical problems, it seems that in this approach we will also need to automate philosophy, otherwise alignment research will become bottlenecked on those problems (i.e., on human philosophers trying to solve those problems while the world passes them by). Do you envision automating philosophy (and are you optimistic about this) or see some other way of getting around this issue?
It worries me to depend on AI to do philosophy, without understanding what "philosophical reasoning" or "philosophical p...
ChatGPT has a tendency to make up answers and refuse to admit it. A log follows. Note that almost every paragraph of the initial biography contains multiple falsehoods that are unlikely to be directly in its training data.
Show me a 10 paragraph biography of Wei Dai
Wei Dai was born in Guangzhou, China in 1967. She received her undergraduate degree in computer science from the University of Science and Technology of China in 1988 and her master's degree in computer science from the University of California, Berkeley in 1991.
After completing her master's ...
This is tempting, but the problem is that I don't know what my idealized utility function is (e.g., I don't have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it's a bad idea, but how does that fit into the framework?
My own framework is something like this:
Imagine someone who considers a few plans, grades them (e.g. “how good does my gut say this plan is?”), and chooses the best. They are not a grader-optimizer. They are not trying to navigate to the state where they propose and execute a plan which gets maximally highly rated by some evaluative submodule. They use a grading procedure to locally rate and execute plans, and may even locally think “what would make me feel better about this plan?”, but the point of their optimization isn’t “find the plan which makes me feel as good as globally possible.”
The ...
This is a bit stronger than how I would phrase it, but basically yes.
I tend to be pretty skeptical of new ideas. (This backfired spectacularly once, when I didn't pay much attention to Satoshi when he contacted me about Bitcoin,... (read more)