Adam Shimi

Half-researcher, half-distiller (see https://distill.pub/2017/research-debt/), both in AI Safety. Funded, and also PhD in theoretical computer science (distributed computing).

If you're interested by some research ideas that you see in my posts, know that I probably have many private docs complete in the process of getting feedback (because for my own work, the AF has proved mostly useless in terms of feedback https://www.lesswrong.com/posts/rZEiLh5oXoWPYyxWh/adamshimi-s-shortform?commentId=4ZciJDznzGtimvPQQ). I can give you access if you PM me!

Sequences

AI Alignment Unwrapped
Understanding Goal-Directedness
Toying With Goal-Directedness

Comments

adamShimi's Shortform

Right now, the incentives to get useful feedback on my research push me to go into the opposite policy that I would like: publish on the AF as late as I can allow.

Ideally, I would want to use the AF as my main source of feedback, as it's public, is read by more researchers that I know personally, and I feel that publishing there helps the field grow.

But I'm forced to admit that publishing anything on the AF means I can't really send it to people anymore (because the ones I ask for feedback read the AF, so that's feels wrong socially), and yet I don't get any valuable feedback 99% of the time. More specifically, I don't get any feedback 99% of the time. Whereas when I ask for feedback directly on a gdoc, I always end up with some useful remarks.

I also feel bad that I'm basically using a privileged policy, in the sense that a newcomer cannot use it.

Nonetheless, because I believe in the importance of my research, and I want to know if I'm doing stupid things or not, I'll keep to this policy for the moment: never ever post something on the AF for which I haven't already got all the useful feedback I could ask for.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

But you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the concrete implementation details to capture the most efficient way of doing fusion; whereas my internal model of "fusion power generators" has a more concrete form and include safety guidelines.

In general, I don't see why we should expect the abstraction most relevant for the AGI to be the one we're using. Maybe it uses the same words for something quite different, like how successive paradigms in physics use the same word (electricity, gravity) to talk about different things (at least in their connotations and underlying explanations).

(That makes me think that it might be interesting to see how Kuhn's arguments about such incomparability of paradigms hold in the context of this problem, as this seems similar).

Formal Solution to the Inner Alignment Problem

Thanks for sharing this work!

Here's my short summary after reading the slides and scanning the paper.

Because human demonstrator are safe (in the sense of almost never doing catastrophic actions), a model that imitates closely enough the demonstrator should be safe. The algorithm in this paper does that by keeping multiple models of the demonstrator, sampling the top models according to a parameter, and following what the sampled model does (or querying the demonstrator if the sample is "empty"). The probability that this algorithm does a very unlikely action for the demonstrator can be bounded above, and driven down.

If this is broadly correct (and if not, please tell me what's wrong), then I feel like this fall short of solving the inner alignment problem. I agree with most of the reasoning regarding imitation learning and its safety when close enough to the demonstrator. But the big issue with imitation learning by itself is that it cannot do much better than the demonstrator. In the event that any other other approach to AI can be superhuman, then imitation learning would be uncompetitive and there would be a massive incentive to ditch it.

Slide 8 actually points towards a way to use imitation learning to hopefully make a competitive AI: IDA. Yet in this case, I'm not sure that your result implies safety. For IDA isn't a one shot imitation learning problem; it's many successive imitation learning problem. Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step. (If you think it wouldn't, I'm interested by the argument)

Sorry if this feels a bit rough. Honestly, the result looks exciting in the context of imitation learning, but I feel it is a very bad policy to present a research as solving a major AI Alignment problem when it only does in a very, very limited setting, that doesn't feel that relevant to the actual risk.

Almost no mathematical background is required to follow the proofs. We feel our bounds could be made much tighter, and we'd love help investigating that.

This is really misleading for anyone that isn't used to online learning theory. I guess what you mean is that it doesn't rely on more uncommon fields of maths like gauge theory or category theory, but you still use ideas like measures and martingales which are far from trivial for someone with no mathematical background.

Suggestions of posts on the AF to review

Thanks for the suggestion! It's great to have some methodological posts!

We'll consider it. :)

Suggestions of posts on the AF to review

Thanks for the suggestion!

I didn't know about this post. We'll consider it. :)

Suggestions of posts on the AF to review

Thanks for the suggestion!

We want to go through the different research agendas (and I already knew about yours), as they give different views/paradigms on AI Alignment. Yet I'm not sure how relevant a review of such posts are. In a sense, the "reviewable" part is the actual research that underlies the agenda, right?

Suggestions of posts on the AF to review

I was indeed expecting you to suggest one of your post. But that's one of the valid reasons I listed, and I didn't know about this one, so it's great!

We'll consider it. :)

Suggestions of posts on the AF to review

But sometimes, you want to be like "come at me bro". You've got something that you're pretty highly confident is right, and you want people to really try to shoot it down (partly as a social mechanism to demonstrate that the idea is in fact as solid and useful as you think it is). This isn't something I'd want to be the default kind of feedback, but I'd like for authors to be able to say "come at me bro" when they're ready for it, and I'd like for posts which survive such a review to be perceived as more epistemically-solid/useful.

Yeah, when I think about implementing a review process for the Alignment Forum, I'm definitely thinking about something you can ask for more polished research, in order to get external feedback and a tag saying this is peer review (for prestige and reference).

Thanks for the suggestions! We'll consider them. :)

Tournesol, YouTube and AI Risk

If the main source of revenue is people buying stuff after seeing an ad on YouTube, then I agree with your point in the middle of the comment, that it seems hardly possible for the revenue to go 1.5 OOMs more by only 2OOMs on model size. I bet that there would be a big discontinuity here, where you need massive investment to actually see any significant improvement.

On the other hand, if the main source of revenue is money payed for the number of views of ads, then I believe a better model could improve that relatively smoothly. In part because just giving people interesting stuff to see makes them look at more ads.

Tournesol, YouTube and AI Risk

I suspect the best way to think about the polarizing political content thing which is going on right now is something like: The algorithm knows that if it recommends some polarizing political stuff, there's some chance you will head down a rabbit hole and watch a bunch more vids. So in terms of maximizing your expected watch time, recommending polarizing political stuff is a good bet. "Jumping out of the system" and noticing that recommending polarizing videos also tends to polarize society as a whole and gets them to spend more time on Youtube on a macro level seems to require a different sort of reasoning.

I agree, but I think you can have problems (and even Predict-O-Matic like problems) without reaching that different sort of reasoning. Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say.

One could also argue that because YouTube videos contain so much info about the real world, a powerful enough algorithm using them can probably develop a pretty good model of the world. And there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded.

For the stock thing, I think it depends on how the system is scored. When training a supervised machine learning model, we score potential models based on how well they predict past data--data the model itself has no way to affect (except if something really weird is going on?) There doesn't seem to be much incentive to select a model that makes self-fulfilling prophecies. A model which ignores the impact of its "prophecies" will score better, insofar as the prophecy would've affected the outcome.

Agreed, this is more the kind of problem that emerges from RL like training. The page on the Tournesol wiki about this subject points to this recent paper that propose a recommendation algorithm tried in practice on YouTube. AFAIK we don't have access to the actual algorithm used by YouTube, so it's hard to say whether it's using RL; but the paper above looks like evidence that it eventually will be.

Load More