HN comment unsure about the meta-learning generalization claims that OpenAI has a "serious duty [...] to frame their results more carefully"
Having printed and read the full version, this ultra-simplified version was an useful summary.
Happy to read a (not-so-)simplified version (like 20-30 paragraphs).
Does that summarize your comment?
1. Proposals should make superintelligences less likely to fight you by using some conceptual insight true in most cases.
2. With CIRL, this insight is "we want the AI to actively cooperate with humans", so there's real value from it being formalized in a paper.
3. In the counterfactual paper, there's the insight "what if the AI thinks he's not on but still learns".
For the last bit, I have two interpretations:
4.a. However, it's unclear that this design avoids all manipulative behaviour and is completely safe.
4.b. However, it's unclear that adding the counterfactual feature to another design (e.g. CIRL) would make systems overall safer / would actually reduce manipulation incentives.
If I understand you correctly, there are actual insights from counterfactual oracles--the problem is that those might not be insights that would apply to a broad class of Alignment failures, but only to "engineered" cases of boxed oracle AIs (as opposed to CIRL where we might want AIs to be cooperative in general). Was it what you meant?
The zero reward is in the paper. I agree that skipping would solve the problem. From talking to Stuart, my impression is that he thinks that r=0 would be equivalent to skipping for specifying "no learning", or would just slow down learning. My disagreement on that I think it can confuse learning to the point of not learning the right thing.
Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts?
Yes, that should work. My quote saying that online learning "won't work and is unsafe" is imprecise. I should have said "if epsilon is small enough to be comparable to the probability of shooting an escape message at random, then it is not safe. Also, if we continue sending the wrong r=0 instead of skipping, then it might not learn the correct thing if ϵ is not big enough".
Although I guess that probably isn't really original either. What seems original is that during any episode where learning will take place, don't let humans (or any other system that might be insecure against the oracle) see the oracle's output until the episode is over.
That's exactly it!
The string is read with probability 1-ϵ
Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won't care about future versions of itself nor want to escape.
I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between "having the rule to shutdown the system" and "having a current timestep maximizer". For it to really be a "current timestep maximizer" it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at "I might get shutdown at the next timestep".
As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl's work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/"training then testing" here.
The Asymptotically Unambitious AGI thread helped me clarify my thoughts, thanks. I agree that an optimal CDT agent won't think about future versions, and I don't see any optimization pressure towards escape message nor disproportionately common "escape message" regions.
However, it still assumes we have access to this magic oracle that optimizes for R′=R.IE where E is the event where humans don't see the answer, IE its indicator function, and R the counterfactual reward (given by the automatic machine). If humans were able to build an oracle performing optimally (w.r.t R′) from day 1, then humans would be able to specify some kind of "god oracle". The rest of the design seems to be just "how to interact with a god oracle so that humans are not influenced by the answers".
In practice, you'll want something that is able to learn from its (question, prediction, reward) history. That's why there is this automatic machine rewarding the oracle with some probability ϵ. In an online learning setting, most of the time the model gets r=0 (probability 1−ϵ), and it sometimes gets some useful feedback r>0 (probability ϵ). Therefore, if ϵ is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random. Even worse, the (question, prediction, r=0) datapoints in the history could get so misleading that the oracle never learns anything.
Now, if we're not in an online learning process but instead there is a separation between a "training phase" and a "deployment phase where the AI continue to learns with probability ϵ", then the setup is just "have a model that learns to do the useful stuff in sandbox, and then have the weights (almost) fixed in deployment"
In short, I think the CDT setup without machine learning assumes the problem already solved, that online learning won't work and is unsafe, which leaves us with a "training then deployment" setup that isn't really original.
Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication link
This is very clear. Communication link made me understand that it didn't have a direct physical effect on the agent. It you want to make it even more intuitive you could do a diagram, but this explanation is already great!
Thanks for updating the rest of the post and trying to make it more clear!
1. I don't understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/framework are you using?
2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own goals. Was that what you meant? I think it's useful to define it in the introduction.
3. I don't understand how an agent that "[lacks] any capacity to press its shutdown button" could have any shutdown ability. It's seems like a contradiction, unless you mean "any capacity to directly press its shutdown button".
4. What's the "default value function" and the "normal utility function" in "Optimisation incentive"? Is it clearly defined in the litterature?
5. "Worse still... for any action..." -> if you choose b as some action with bad corrigibility property, it seems reasonable that it can be better than most actions on v_N + v_S (for instance if b is the argmax). I don't see how that's a "worse still" scenario, it seems plausible and normal.
6. "From this reasoning, we conclude" -> are you infering things from some hypothetic b that would satisfy all the things you mention? If that's the case, I would need an example to see that it's indeed possible. Even better would be a proof that you can always find such b.
7. "it is clear that we could in theory find a θ" -> could you expand on this?
8. "Given the robust optimisation incentive property, it is clear that the agent may score very poorly on UN in certain environments." -> again, can you expand on why it's clear?
9. In the appendix, in your 4 lines inequality, do you assume that U_N(a_s) is non-negative (from line 2 to 3)? If yes, why?
Hey Abram (and the MIRI research team)!
This post resonates with me on so many levels. I vividly remember the Human-Aligned AI Summer School where you used to be a "receiver" and Vlad was a "transmitter", when talking about "optimizers". Your "document" especially resonates with my experience running an AI Safety Meetup (Paris AI Safety).
On January 2019, I organized a Meetup about "Deep RL from human preferences". Essentially, the resources were by difficulty, so you could discuss the 80k podcast, the open AI blogpost, the original paper or even a recent relevant paper. Even if the participants were "familiar" to RL (because they got used to see written "RL" in blogs or hear people say "RL" in podcasts) none of them could explain to me the core structure of a RL setting (i.e. that a RL problem would need at least an environment, actions, etc.)
The boys were getting hungry (abram is right, $10 of chips is not enough for 4 hungry men between 7 and 9pm), when in the middle of a monologue ("in RL, you have so-and-so, and then it goes like so on and so forth..."), I suddenly realize that I'm talking to more than qualified attendees (I was lucky to have a PhD candidate in economics, a teenager who used to do international olympiads in informatics (IOI) and a CS PhD) that lack the necessary RL procedural knowledge to ask non-trivial questions about "Deep RL from human preferences".
That's when I decided to change the logistics of the Meetup to something much closer to what is described in "You and your research". I started thinking about what they would be interested in knowing. So I started telling the brillant IOI kid about this MIRI summer program, how I applied last year, etc. One thing lead to another, and I ended up asking what Tsvi had asked me one year ago for the AISFP interview:
If one of you was the only Alignment researcher left on Earth, and it was forbidden to convince other people to work on AI Safety research, what would you do?
That got everyone excited. The IOI boy took the black marker, and started to do math to the question, as a transmitter: "So, there is a probability p_0 that AI Researchers will solve the problem without me, and p_1 that my contribution will be neg-utility, so if we assume this and that, we get so-and-so."
The moment I asked questions I was truly curious about, the Meetup went from a polite gathering to the most interesting discussion of 2019.
Abram, if I were in charge of all agents in the reference class "organizer of Alignment-related events", I would tell instances of that class with my specific characteristics two things:
1. Come back to this document before and after every Meetup.
2. Please write below (can be in this thread or in the comments) what was your experience running an Alignment think-thank that resonates the most with the above "document".