I've been wanting to try SuperMemo for a while, especially given the difficulty that you mention with making Anki cards. But it doesn't run natively on linux AFAIK, and I can't be bothered for the moment to make it work using wine.
As outlined in the last paragraph of the post. I want to convince people that TDT-like decision theories won't give a "neat" game theory, by giving an example where they're even less neat than classical game theory.
Hum, then I'm not sure I understand in what way classical game theory is neater here?
I think you're thinking about a realistic case (same algorithm, similar environment) rather than the perfect symmetry used in the argument. A communication channel is of no use there because you could just ask yourself what you would send, if you had one, and th
Well, if I understand the post correctly, you're saying that these two problems are fundamentally the same problem, and so rationality should be able to solve them both if it can solve one. I disagree with that, because from the perspective of distributed computing (which I'm used to), these two problems are exactly the two kinds of problems that are fundamentally distinct in a distributed setting: agreement and symmetry-breaking.
Communication won't make a difference if you're playing with a copy.
Actually it could. Basically all of distributed computing as... (read more)
I don't see how the two problems are the same. They are basically the agreement and symmetry breaking problems of distributed computing, and those two are not equivalent in all models. What you're saying is simply that in the no-communication model (where the same algorithm is used on two processes that can't communicate), these two problems are not equivalent. But they are asking for fundamentally different properties, and are not equivalent in many models that actually allow communication.
I feel like doing a better job of motivating why we should care about this specific problem might help get you more feedback.
If we want to alter a decision theory to learn its set of inputs and outputs, your proposal makes sense to me at first glance. But I'm not sure why I should particularly care, or why there is even a problem to begin with solution. The link you provide doesn't help me much after skimming it, and I (and I assume many people) almost never read something that requires me to read other posts without even a summary of the references. I mad... (read more)
This project looks great! I especially like the focus on a more experimental kind of research, while still focused and informed on the specific concepts you want to investigate.
If you need some feedback on this work, don't hesitate to send me a message. ;)
Oh, right, that makes a lot of sense.
So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?
I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?
About the update
You're right, that's what would happen with an update.
I think that the model I have in mind (although I hadn't explicitly thought about it until know), is something like a distribution over ways to reach TAI (capturing how probable it is that they're the first way to reach AGI), and each option comes with its own distribution (let's say over years). Obviously you can compress that into a single distribution over years, but then you lose the ability to do fine grained updating.
For example, I imagine that someone with relatively low probabili... (read more)
Let me try to make an analogy with your argument.
Say we want to make X. What you're saying is "with 10^12 dollars, we could do it that way". Why on earth would I update at all whether it can be done with 10^6 dollars? If your scenario works with that amount, then you should have described it using only that much money. If it doesn't, then you're not providing evidence for the cheaper case.
Similarly here, if someone starts with a low credence on prosaic AGI, I can see how your arguments would make them put a bunch of probability mass close to +10^12 compute... (read more)
I'm not sure, but I think that's not how updating works? If you have a bunch of hypotheses (e.g. "It'll take 1 more OOM," "It'll take 2 more OOMs," etc.) and you learn that some of them are false or unlikely (only 10% chance of it taking more than 12" then you should redistribute the mass over all your remaining hypotheses, preserving their relative strengths. And yes I have the same intuition about analogical arguments too. For example, let's say you overhear me talking about a bridge being built near my h... (read more)
To put it another way: I don't actually believe we will get to +12 OOMs of compute, or anywhere close, anytime soon. Instead, I think that if we had +12 OOMs, we would very likely get TAI very quickly, and then I infer from that fact that the probability of getting TAI in the next 6 OOMs is higher than it would otherwise be (if I thought that +12 OOMs probably wasn't enough, then my credence in the next 6 OOMs would be correspondingly lower).To some extent this reply also partly addresses the concerns you raised about memory and bandwidth--I
To put it another way: I don't actually believe we will get to +12 OOMs of compute, or anywhere close, anytime soon. Instead, I think that if we had +12 OOMs, we would very likely get TAI very quickly, and then I infer from that fact that the probability of getting TAI in the next 6 OOMs is higher than it would otherwise be (if I thought that +12 OOMs probably wasn't enough, then my credence in the next 6 OOMs would be correspondingly lower).
To some extent this reply also partly addresses the concerns you raised about memory and bandwidth--I
However, it can do much better than that, by short-term quantilizing w.r.t. the user's reported success probability (with the user's policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user's reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation.
I don't understand what you mean here by quantilizing. The meanin... (read more)
Glad to be helpful!
I go into more detail in my answer to Alex, but what I want to say here is that I don't feel like you use the power-scarcity idea enough in the post itself. As you said, it's one of three final notes, and without any emphasis on it.
So while I agree that the power-scarcity is an important research question, it would be helpful IMO if this post put more emphasis on that connection.
Thanks for the detailed reply!
I want to go a bit deeper into the fine points, but my general reaction is "I wanted that in the post". You make a pretty good case for a way to come around at this definition that makes it particularly exciting. On the other hand, I don't think that stating a definition and proving a single theorem that has the "obvious" quality (whether or not it is actually obvious, mind you) is that convincing.
The best way to describe my interpretation is that I feel that you two went for the "scientific paper" style, but the current state... (read more)
Ok, that's fair. It's hard to know which notation is common knowledge, but I think that adding a sentence explaining this one will help readers who haven't studied game theory formally.
Maybe making all vector profiles bold (like for the action profile) would help to see at a glance the type of the parameter. If I had seen it was a strategy profile, I would have inferred immediately what it meant.
Exciting to see new people tackling AI Alignment research questions! (and I'm already excited by what Alex is doing, so him having more people work in his kind of research feels like a good thing).
That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalizatio... (read more)
Thanks so much for your comment! I'm going to speak for myself here, and not for Jacob.
That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I'm just
Just wanted to say that this comment made me add a lot of things on my reading list, so thanks for that (but I'm clearly not well-read enough to go into the discussion).
Thanks for writing this! I'm quite excited by learning more about your meta-agenda and your research process, and this reading stimulated me about my own research process.
But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
So you don't think that we could have a result of the sort "with these empirical facts, egregious misalignment is either certain or very hard to defend against, and so we should push towards not building AIs th... (read more)
To people reading this thread: we had a private conversation with John (faster and easier), which resulted in me agreeing with you.
The summary is that you can see the arguments made and constraints invoked as a set of equations, such that the adequate formalization is a solution of this set. But if the set has more than one solution (maybe a lot), then it's misleading to call that the solution.
So I've been working these last few days at arguing for the properties (generalization, explainability, efficiency) in such a way that the corresponding set of equations only has one solution.
Thanks for the feedback!
Who? It would be helpful to have some links so I can go read what they said.
That was one of my big frustrations when writing this post: I only saw this topic pop up in personal conversation, not really in published posts. And so I didn't want to give names of people who just discussed that with me on a zoom call or in a chat. But I totally feel you -- I'm always annoyed by posts that pretend to answer a criticism without pointing to it.
On this more complicated (but IMO more accurate) model, your post is itself an attempt to make AI
If we do only one, which one do you think matters the most?
Thanks for commenting on your reaction to this post!
That being said, I'm a bit confused by your comment. You seem to write off approaches which attempt to provide a computational model of mind, but my approach is literally the opposite: looking only at the behavior (but all the behavior), extract relevant statistics to study questions related to goal-directedness.
Can you maybe give more details?
Thanks for the spot-on pushback!
I do understand what a sufficient statistics is -- which probably means I'm even more guilty of what you're accusing me of. And I agree completely that I don't defend correctly that the statistics I provide are really sufficient.
If I try to explain myself, what I want to say in this post is probably something like
I still feel like you're missing something important here.
For instance... in the explainability factor, you measure "the average deviation of π from the actions favored by the action-value function qμ of μ", using the formula
. But why this particular formula? Why not take the log of qμ first, or use 3+maxaqμ(st,a) in the denominator? Indeed, there's a strong argument to be made this formula is a bad choice: the value function qμ is... (read more)
Nice post! Surprisingly, I'm interested in the topic. ^^
Funny too that you focus on an idea I am writing a post about (albeit from a different angle). I think I broadly agree with your conjectures, for sufficient competence and generalization at least.
Most discussion about goal-directed behavior has focused on a behavioral understanding, which can roughly be described as using the intentional stance to predict behavior.
I'm not sure I agree with that. Our lit review shows that there are both behavioral and mechanistic approaches (Richard's goal-directed age... (read more)
Thanks for the nice review! It's great to have the reading of someone who understand enough the current state of neuroscience to point to aspects of the book at odds with neuroscience consensus. My big takeaway is that I should look a bit more into neuroscience based approaches to AGI, because they might be important, and require different alignment approaches.
On a more rhetorical level, I'm impressed by how you manage to make me ask a question (okay, but what evidence is there for this uniformity of the neocortex) and then points to some previous work you... (read more)
Thanks for the very in-depth case you're making! I especially liked the parts about the objections, and your take on some AI Alignment researcher's opinions of this proposal.
Personally, I'm enthusiastic about it with caveats expanded below. If I try to interpret your proposal according to the lines of my recent epistemological framing of AI Alignment research, you're pushing for a specific kind of work on the Solving part of the field, where you assume a definition of the terms of the problem (what AIs will we build and what do we want). My caveats can be ... (read more)
Well, Paul's original post presents HCH as the specification of a human enlightened judgement.
For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.
And if we follow the links to Paul's previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time.
To define my considered judgment about a question Q, suppose I am told Q and spend
Welcome in the (for now) small family of people funded by Beth! Your research looks pretty cool, and I'm quite excited when seeing how different it is from mine. So Beth is funding quite a wide range of researchers, which is what makes most sense to me. :)
Thanks for telling me! I've changed that.
It might be because I copied and pasted the first sentence to each subsection.
Thanks for taking the time to give feedback!
Technical comment on the above postSo if I understand this correctly. then explg is a metric of goal-directedness. However, I am somewhat puzzled because explg only measures directedness to the single goal g.But to get close to the concept of goal-directedness introduced by Rohin, don't you need then do an operation over all possible values of g?
So if I understand this correctly. then explg is a metric of goal-directedness. However, I am somewhat puzzled because explg only measures directedness to the single goal g.
But to get close to the concept of goal-directedness introduced by Rohin, don't you need then do an operation over all possible values of g?
That's not what I had in mind, but it's probably on me for not explaining it clearly enough.
Thanks for the idea! I agree that it probably helps, and it solves my issue with the state of knowledge of the other.
That being said, I don't feel like this solves my main problem: it still feel to me as pushing too hard. Here the reason is that I post on a small venue (rarely more than a few posts per day) that I know the people I'm asking feedback too read regularly. So if I send them such a message at the moment I publish, it feels a bit like I'm saying that they wouldn't read and comment it without that, which is a bit of a problem.
(I'm interested to k... (read more)
curious for more detail on “what feels wrong about explicitly asking individuals for feedback after posting on AF” similar to how you might ask for feedback on a gDoc?
My main reason is steve's first point:
Maybe there's a sense in which everyone has already implicitly declared that they don't want to give feedback, because they could have if they wanted to, so it feels like more of an imposition.
Asking someone for feedback on work posted somewhere I know they read feels like I'm whining about not having feedback (and maybe whining about them not giving me f... (read more)
Right now, the incentives to get useful feedback on my research push me to go into the opposite policy that I would like: publish on the AF as late as I can allow.
Ideally, I would want to use the AF as my main source of feedback, as it's public, is read by more researchers that I know personally, and I feel that publishing there helps the field grow.
But I'm forced to admit that publishing anything on the AF means I can't really send it to people anymore (because the ones I ask for feedback read the AF, so that's feels wrong socially), and yet I don't get a... (read more)
I think there are a number for features LW could build to improve this situation, but first curious for more detail on “what feels wrong about explicitly asking individuals for feedback after posting on AF” similar to how you might ask for feedback on a gDoc?
In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.
But you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the c... (read more)
Thanks for sharing this work!
Here's my short summary after reading the slides and scanning the paper.
Because human demonstrator are safe (in the sense of almost never doing catastrophic actions), a model that imitates closely enough the demonstrator should be safe. The algorithm in this paper does that by keeping multiple models of the demonstrator, sampling the top models according to a parameter, and following what the sampled model does (or querying the demonstrator if the sample is "empty"). The probability that this algorithm does a very unlikely acti
Thanks for the suggestion! It's great to have some methodological posts!
We'll consider it. :)
Thanks for the suggestion!
I didn't know about this post. We'll consider it. :)
We want to go through the different research agendas (and I already knew about yours), as they give different views/paradigms on AI Alignment. Yet I'm not sure how relevant a review of such posts are. In a sense, the "reviewable" part is the actual research that underlies the agenda, right?
I was indeed expecting you to suggest one of your post. But that's one of the valid reasons I listed, and I didn't know about this one, so it's great!
But sometimes, you want to be like "come at me bro". You've got something that you're pretty highly confident is right, and you want people to really try to shoot it down (partly as a social mechanism to demonstrate that the idea is in fact as solid and useful as you think it is). This isn't something I'd want to be the default kind of feedback, but I'd like for authors to be able to say "come at me bro" when they're ready for it, and I'd like for posts which survive such a review to be perceived as more epistemically-solid/useful.
Yeah, when I think about ... (read more)
If the main source of revenue is people buying stuff after seeing an ad on YouTube, then I agree with your point in the middle of the comment, that it seems hardly possible for the revenue to go 1.5 OOMs more by only 2OOMs on model size. I bet that there would be a big discontinuity here, where you need massive investment to actually see any significant improvement.
On the other hand, if the main source of revenue is money payed for the number of views of ads, then I believe a better model could improve that relatively smoothly. In part because just giving people interesting stuff to see makes them look at more ads.
I suspect the best way to think about the polarizing political content thing which is going on right now is something like: The algorithm knows that if it recommends some polarizing political stuff, there's some chance you will head down a rabbit hole and watch a bunch more vids. So in terms of maximizing your expected watch time, recommending polarizing political stuff is a good bet. "Jumping out of the system" and noticing that recommending polarizing videos also tends to polarize society as a whole and gets them to spend more time on Youtube on a macro
Thanks for the feedback.
Your argument as I understand it is: the economic incentive to make the model bigger might disappear if the cost of computing the recommendation outweighs the gain of having "better" recommendations.
I think this is definitely relevant, but I don't feel like I have enough information to decide if the argument holds or not. Notably, it goes back to the parameter that we discussed in a call: whether increasing the model size/compute/dataset size improves the performance for the real world task until AGI is reached, or whether the... (read more)
I think it's a good summary. Thanks!
This looked exciting when you mentioned it, and it doesn't disappoint.
To check that I get it, here is my own summary:
Because ML looks like the most promising approach to AGI at the moment, we should adapt and/or instantiate the classical arguments for AI risks to a ML context. The main differences are the separation of a training and a deployment phase and the form taken by the objective function (mix of human and automated feedback from data instead of hardcoded function).(Orthogonality thesis) Even if any combination of goal and intelligence can exist in
Because ML looks like the most promising approach to AGI at the moment, we should adapt and/or instantiate the classical arguments for AI risks to a ML context. The main differences are the separation of a training and a deployment phase and the form taken by the objective function (mix of human and automated feedback from data instead of hardcoded function).
I'm not Alex, but here's my two cents.
I think your point 2 is far less obvious too me, especially without a clear-cut answer to the correctness of the strategy-stealing assumption. Because I agree that we might optimize the wrong goals, but I don't see why we would optimize some necessarily more than others. So each goal in S might have a spike (for a natural set of goals that are all similarly difficult to specify) and the resulting landscape would be flat.
That being said, I think you're pointing towards an interesting fact about the original post: in it,... (read more)
This post gives two distinct (but related) "pieces of knowledge".
Thanks for both your careful response and the pointer to Conceptual Engineering!
I believe I am usually thinking in terms of defining properties for their use, but it's important to keep that in mind. The post on Conceptual Engineering lead me to this follow up interview, which contains a great formulation of my position:
Livengood: Yes. The best example I can give is work by Joseph Halpern, a computer scientist at Cornell. He's got a couple really interesting books, one on knowledge one on causation, and big parts of what he's doing are informed by the long
I think the disagreement left is whether we should first find a definition of goal-directedness then study how it appears through training (my position), or if we should instead define goal-directedness according to the kind of training processes that generate similar properties and risks (what I take to be your position).
Does that make sense to you?
Thanks for the inclusion in the newsletter and the opinion! (And sorry for taking so long to answer)
This literature review on goal-directedness identifies five different properties that should be true for a system to be described as goal-directed:
It's implicit, but I think it should be made explicit that the properties/tests are what we extract from the literature, not what we say is fundamental. More specifically, we don't say they should be true per se, we just extract and articulate them to "force" a discussion of them when defining goal-directedness.