Thanks for the post, it's a great idea to have both arguments.
My personal preference would be to have both arguments to be the same length to properly compare the strength of the arguments (skeptic is one paragraph, advocate is 3-6x longer), and not always in the same order skeptic then advocate, but also advocate -> skeptic or even skeptic -> advocate --> skeptic -> ..., so it does not appear like one is the "haven't thought about it much" view.
Right I just googled Marblestone and so you're approaching it with the dopamine side and not the acetylcholine. Without debating about words, their neuroscience paper is still at least trying to model the phasic dopamine signal as some RPE & the prefrontal network as an LSTM (IIRC), which is not acetylcholine based. I haven't read in detail this post & the one linked, I'll comment again when I do, thanks!
Awesome post! I happen to also have tried to distill links between RPE and phasic dopamine in the "Prefrontal Cortex as a Meta-RL System" of this blog.
In particular I reference this paper on DL in the brain & this other one for RL in the brain. Also, I feel like the part 3 about links between RL and neuro of the RL book is a great resource for this.
Funnily enough, I wrote a blog distilling what I learned from reproducing experiments of that 2018 Nature paper, adding some animations and diagrams. I especially look at the two-step task, the Harlow task (the one with monkeys looking at a screen), and also try to explain some brain things (e.g. how DA interacts with the PFN) at the end.
HN comment unsure about the meta-learning generalization claims that OpenAI has a "serious duty [...] to frame their results more carefully"
Having printed and read the full version, this ultra-simplified version was an useful summary.
Happy to read a (not-so-)simplified version (like 20-30 paragraphs).
Does that summarize your comment?
1. Proposals should make superintelligences less likely to fight you by using some conceptual insight true in most cases.
2. With CIRL, this insight is "we want the AI to actively cooperate with humans", so there's real value from it being formalized in a paper.
3. In the counterfactual paper, there's the insight "what if the AI thinks he's not on but still learns".
For the last bit, I have two interpretations:
4.a. However, it's unclear that this design avoids all manipulative behaviour and is completely safe.
4.b. However, it's unclear that adding the counterfactual feature to another design (e.g. CIRL) would make systems overall safer / would actually reduce manipulation incentives.
If I understand you correctly, there are actual insights from counterfactual oracles--the problem is that those might not be insights that would apply to a broad class of Alignment failures, but only to "engineered" cases of boxed oracle AIs (as opposed to CIRL where we might want AIs to be cooperative in general). Was it what you meant?
The zero reward is in the paper. I agree that skipping would solve the problem. From talking to Stuart, my impression is that he thinks that r=0 would be equivalent to skipping for specifying "no learning", or would just slow down learning. My disagreement on that I think it can confuse learning to the point of not learning the right thing.
Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts?
Yes, that should work. My quote saying that online learning "won't work and is unsafe" is imprecise. I should have said "if epsilon is small enough to be comparable to the probability of shooting an escape message at random, then it is not safe. Also, if we continue sending the wrong r=0 instead of skipping, then it might not learn the correct thing if ϵ is not big enough".
Although I guess that probably isn't really original either. What seems original is that during any episode where learning will take place, don't let humans (or any other system that might be insecure against the oracle) see the oracle's output until the episode is over.
That's exactly it!
The string is read with probability 1-ϵ
Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won't care about future versions of itself nor want to escape.
I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between "having the rule to shutdown the system" and "having a current timestep maximizer". For it to really be a "current timestep maximizer" it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at "I might get shutdown at the next timestep".
As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl's work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/"training then testing" here.