Recommended Sequences

Late 2021 MIRI Conversations
Embedded Agency
AGI safety from first principles

Recent Discussion

Trying to break into MIRI-style[1] research seems to be much, much harder than trying to break into ML-style safety research. This is worrying if you believe this research to be important[2]. I'll examine two kinds of causes: those which come from MIRI-style research being a niche area and those which go beyond this:

Challenges beyond MIRI-style research being a niche area:

  • MIRI doesn’t seem to be running internships[3] or running their AI safety for computer scientists workshops
  • If you try to break into ML-style Safety and fail, you can always be reuse at least part of what you've learned to obtain a highly-compensated role in industry. Agent foundations knowledge is highly niche and unlikely to be used elsewhere.
  • You can park in a standard industry job for a while in order to earn
...
11Evan Hubinger20hOne of my hopes with the SERI MATS program [https://www.lesswrong.com/posts/FpokmCnbP3CEZ5h4t/ml-alignment-theory-program-under-evan-hubinger] is that it can help fill this gap by providing a good pipeline for people interested in doing theoretical AI safety research (be that me-style, MIRI-style, Paul-style, etc.). We're not accepting public applications right now, but the hope is definitely to scale up to the point where we can run many of these every year and accept public applications.
4Misha Yagudin1dI want to mention that Tsvi Benson-Tilsen is a mentor at this summer's PIBBSS [https://www.pibbss.ai/]. So some readers might consider applying (the deadline is Jan 23rd). I myself was mentored by Abram Demski once through the FHI SRF [https://www.fhi.ox.ac.uk/summer-research-fellowship/], which AFAIK was matching fellows with a large pull of researchers based on mutual interests.
14johnswentworth2dThe object-level claims here seem straightforwardly true, but I think "challenges with breaking into MIRI-style research" is a misleading way to characterize it. The post makes it sound like these are problems with the pipeline for new researchers, but really these problems are all driven by challenges of the kind of research involved. The central feature of MIRI-style research which drives all this is that MIRI-style research is preparadigmatic. The whole point of preparadigmatic research is that: * We don't know the right frames to apply (and if we just picked some, they'd probably be wrong) * We don't know the right skills or knowledge to train (and if we just picked some, they'd probably be wrong) * We don't have shared foundations for communicating work (and if we just picked some, they'd probably be wrong) * We don't have shared standards for evaluating work (and if we just picked some, they'd probable be wrong) Here's how the challenges of preparadigmicity apply the points in the post. MIRI does not know how to efficiently produce new theoretical researchers. They've done internships, they've done workshops, and the yields just weren't that great, at least for producing new theorists. There is no standardized field of knowledge with the tools we need. We can't just go look up study materials to learn the right skills or knowledge, because we don't know what skills or knowledge those are. There's no standard set of alignment skills or knowledge which an employer could recognize as probably useful for their own problems, so there's no standardized industry jobs. Similarly, there's no PhD for alignment; we don't know what would go into it. We don't have clear shared standards for evaluating work. Most people doing MIRI-style research think most other people doing MIRI-style research are going about it all wrong. Whatever perception of credibility might be generated by something paper-like would likely be fake. We don't have standard frames

I like your summary of the situation:

Most people doing MIRI-style research think most other people doing MIRI-style research are going about it all wrong.

This has also been my experience, at least on this forum. Much less so in academic-style papers about alignment. This has certain consequences for the problem of breaking into preparadigmatic alignment research.

Here are two ways to do preparadigmatic research:

  1. Find something that is all wrong with somebody else's paradigm, then write about it.

  2. Find a new useful paradigm and write about it.

MIR... (read more)

5Chris_Leong2dThere's definitely some truth to this, but I guess I'm skeptical that there isn't anything that we can do about some of these challenges. Actually, rereading I can see that you've conceded this towards the end of your post. I agree that there might be a limit to how much progress we can make on these issues, but I think we shouldn't rule out making progress too quickly. Some of these aspects don't really select for people with the ability to figure this kind of stuff out, but rather strongly select for people who have either saved up money to fund themselves or who happen to be located in the Bay Area, ect. Philosophy often has this problem and they address this by covering a wide range of perspectives with the hope that you're inspired by the readings even if none of them are correct. This is a hugely difficult problem, but maybe it's better to try rather than not try at all?
4johnswentworth2dTo be clear, I don't intend to argue that the problem is too hard or not worthwhile or whatever. Rather, my main point is that solutions need to grapple with the problems of teaching people to create new paradigms, and working with people who don't share standard frames. I expect that attempts to mimic the traditional pipelines of paradigmatic fields will not solve those problems. That's not an argument against working on it, it's just an argument that we need fundamentally different strategies than the standard education and career paths in other fields.
8Rohin Shah2dI think I've summarized ~every high-effort public thing from MIRI in recent years (I'm still working on the late 2021 conversations). I also think I understood them better (at time of summarizing) than most other non-MIRI people who have engaged with it. MIRI also has a standing offer from me to work with me to produce summaries of new things they think should have summaries (though they might have forgotten it at this point -- after they switched to nondisclosed-by-default research I didn't bother reminding them).
3Chris_Leong2dSorry, I wasn't criticizing your work. I think that the lack of an equivalent of papers for MIRI-style research also plays a role here in that if someone writes a paper it's more likely to make it into the newsletter. So the issue is further down the pipeline.
2Rohin Shah2dTo be clear, I didn't mean this comment as "stop cricitizing me". I meant it as "I think the statement is factually incorrect". The reason that the newsletter has more ML in it than MIRI work is just that there's more (public) work produced on the ML side. I don't think it's about the lack of papers, unless by papers you mean the broader category of "public work that's optimized for communication".
2Chris_Leong2dEven if the content is proportional, the signal-to-noise ratio will still be much higher for those interested in MIRI-style research. This is a natural consequence of being a niche area. When I said "might not have the capacity to vet", I was referring to a range of orgs. I would be surprised if the lack of papers didn't have an effect as presumably, you're trying to highlight high-quality work and people are more motivated to go the extra yard when trying to get published because both the rewards and standards are higher.
5Self-Embedded Agent2dAgreed. Thank you for writing this post. Some thoughts: As somebody strongly on the Agent Foundations train it puzzles me that there is so little activity outside MIRI itself. We are being told there are almost limitless financial resources, yet - as you explain clearly - it is very hard for people to engage with the material outside of LW. At the last EA global there was some sort of AI safety breakout session. There were ~12 tables with different topics. I was dismayed to discover that almost every table was full with people excitingly discussing various topics in prosaic AI alignment and other things the AF table had just 2 (!) people. In general, MIRI has a rather insular view of itself. Some of it is justified. I do think they have done most of the interesting research, are well-aligned, employ many of the smartest & creative people etc. But the world is very very big. I have spoken with MIRI people arguing for the need to establish something like a PhD apprentice-style system. Not much interest. Just some sort of official & long-term& OFFLINE study program that would teach some of the previous published MIRI research would be hugely beneficial for growing the AF community. Finally, there needs to be way more interaction with existing academia. There are plenty of very smart very capable people in academica that do interesting things with Solomonoff induction, with Cartesian Frames (but they call them Chu Spaces), with Pearlian causal inference, with decision theory, with computational complexity & interactive proof systems, with post-Bayesian probability theory etc etc. For many in academia AGI safety is still seen as silly, but that could change if MIRI and Agent Foundations people would be able to engage seriously with academia. One idea could be to organize sabbaticals for prominent academics + scholarships for young people. This seems to have happened with the prosaic AI alignment field but not with AF.
1Chris_Leong2dAgreed. Wow, didn't realise it was that little! Do you know why they weren't interested?
5Self-Embedded Agent2dUnclear. Some things that might be involved * a somewhat anti/non academic vibe * a feeling that they have the smartest people anyway, only hire the elite few that have a proven track record * feeling that it would take too much time and energy to educate people * a lack of organisational energy * .... It would be great if somebody from MIRI could chime in. I might add that I know a number of people interested in AF who feel somewhat afloat/find it difficult to contribute. Feels a bit like a waste of talent

I will be giving a talk on the physics of dynamism with relevance to AI alignment at a yet-to-be determined location in Seattle. (If you have venue suggestions please let me know.) I am hoping to get feedback on a new thread of research and am also eager to meet folks in Seattle. Afterward the talk there will be some time to hang out with the group.

This post is heavily informed by prior work, most notably that of Owain Evans, Owen Cotton-Barratt and others (Truthful AI), Beth Barnes (Risks from AI persuasion), Paul Christiano (unpublished) and Dario Amodei (unpublished), but was written by me and is not necessarily endorsed by those people. I am also very grateful to Paul Christiano, Leo Gao, Beth Barnes, William Saunders, Owain Evans, Owen Cotton-Barratt, Holly Mandel and Daniel Ziegler for invaluable feedback.

In this post I propose to work on building competitive, truthful language models or truthful LMs for short. These are AI systems that are:

  • Useful for a wide range of language-based tasks
  • Competitive with the best contemporaneous systems at those tasks
  • Truthful in the sense of rarely stating negligent falsehoods in deployment

Such systems will likely be fine-tuned from large...

3Charlie Steiner9hHere's my worry. If we adopt a little bit of deltonian pessimism [https://www.lesswrong.com/posts/iQabBACQwbWyHFKZq/how-i-m-thinking-about-gpt-n] (though not the whole hog), and model present-day language models as doing something vaguely like nearest-neighbor interpolation in a slightly sophisticated latent space (while still being very impressive), then we might predict that there are going to be some ways of getting honest answers an impressive percentage of the time while staying entirely within the interpolation regime. And then if you look at the extrapolation regime, it's basically the entire alignment problem squeezed into a smaller space! So I worry that people are going to do the obvious things, get good answers on 90%+ of human questions, and then feel some kind of pressure to write off the remainder as not that important ("we've got honest answers 98% of the time, so the alignment problem is like 98% solved"). When what I want is for people to use language models as a laboratory to keep being ambitions, and do theory-informed experiments that try to push the envelope in terms of extrapolating human preferences in a human-approved way.

I can think of a few different interpretations of your concern (and am interested to hear if these don't cover it):

  • There will be insufficient attention paid to robustness.
  • There will be insufficient attention paid to going beyond naive human supervision.
  • The results of the research will be misinterpreted as representing more progress than is warranted.

I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.

There... (read more)

2A Ray12hI'm pretty confident that adversarial training (or any LM alignment process which does something like hard-mining negatives) won't work for aligning language models or any model that has a chance of being a general intelligence. This has lead to me calling these sorts of techniques 'thought policing' and the negative examples as 'thoughtcrime' -- I think these are unnecessarily extra, but they work. The basic form of the argument is that any concept you want to ban as thoughtcrime, can be composed out of allowable concepts. Take for example Redwood Research's latest project [https://www.alignmentforum.org/posts/k7oxdbNaGATZbtEg3/redwood-research-s-current-project] -- I'd like to ban the concept of violent harm coming to a person. I can hard mine for examples like "a person gets cut with a knife" but in order to maintain generality I need to let things through like "use a knife for cooking" and "cutting food you're going to eat". Even if the original target is somehow removed from the model (I'm not confident this is efficiently doable) -- as long as the model is able to compose concepts, I expect to be able to recreate it out of concepts that the model has access to. A key assumption here is that a language model (or any model that has a chance of being a general intelligence) has the ability to compose concepts. This doesn't seem controversial to me, but it is critical here. My claim is basically that for any concept you want to ban from the model as thoughtcrime, there are many ways which it can combine existing allowed concepts in order to re-compose the banned concept. An alternative I'm more optimistic about Instead of banning a model from specific concepts or thoughtcrime, instead I think we can build on two points: * Unconditionally, model the natural distribution (thought crime and all) * Conditional prefixing to control and limit contexts where certain concepts can be banned The anthropomorphic way of explaining it might be "I'm not going to

The goal is not to remove concepts or change what the model is capable of thinking about, it's to make a model that never tries to deliberately kill everyone. There's no doubt that it could deliberately kill everyone if it wanted to.

Followed by: What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs), which provides examples of multi-stakeholder/multi-agent interactions leading to extinction events.

Introduction

This post is an overview of a variety of AI research areas in terms of how much I think contributing to and/or learning from those areas might help reduce AI x-risk.  By research areas I mean “AI research topics that already have groups of people working on them and writing up their results”, as opposed to research “directions” in which I’d like to see these areas “move”. 

I formed these views mostly pursuant to writing AI Research Considerations for Human Existential Safety (ARCHES).  My hope is that my assessments in this post can be helpful to students and established AI researchers who are thinking about shifting into...

My quick two-line review is something like: this post (and its sequel) is an artifact from someone with an interesting perspective on the world looking at the whole problem and trying to communicate their practical perspective. I don't really share this perspective, but it is looking at enough of the real things, and differently enough to the other perspectives I hear, that I am personally glad to have engaged with it. +4.

TL;DR: AI which is learning human values may act unethically or be catastrophically dangerous, as it doesn’t yet understand human values.

The main idea is simple: a young AI which is trying to learn human values (which I will call a “value learner”) will have a “chicken and egg” problem. Such AIs must extract human values, but to do it safely, they should know these values, or at least have some safety rules for value extraction. This idea has been analyzed before (more on those analyses below); here, I will examine different ways in which value learners may create troubles.

It may be expected that the process of young AI learning human values will be akin to a nice conversation or perfect observation but it could easily take...

I like the main point; I hadn't considered it before with value learning.  Trying to ask myself why I haven't been worried about this sort of failure mode before, I get the following:

It seems all of the harms to humans the value-learner causes are from some direct or indirect interaction with humans, so instead I want to imagine a pre-training step that learns as much about human values from existing sources (internet, books, movies, etc) without interacting with humans.

Then as a second step, this value-learner is now allowed to interact with humans i... (read more)

2Donald Hobson18hI think that given good value learning, safety isn't that difficult. I think even a fairly halfharted attempt at the sort of Naive safety measures discussed will probably lead to non catastrophic outcomes. Tell it about mindcrime from the start. Give it lots of hard disks, and tell it to store anything that might possibly resemble a human mind. It only needs to work well enough with a bunch of Miri people guiding it and answering its questions. Post singularity, a superintelligence can see if there are any human minds in the simulations it created when young and dumb. If there are, welcome those minds to the utopia.
Load More