All of adamShimi's Comments + Replies

Thanks for your comment!

Actually, I don't think we really disagree. I might have just not made my position very clear in the original post.

The point of the post is not to say that these activities are not often valuable, but instead to point out that they can easily turn into "To do science, I need to always do [activity]". And what I'm getting from the examples is that in some cases, you actually don't need to do [activity]. There's a shortcut, or maybe just you're in a different phase of the problem.

Do you think there is still a disagreement after this clarification?

1Linda Linsefors2d
I think we agreement. I think the confusion is because it is not clear form that section of the post if you are saying 1)"you don't need to do all of these things" or 2) "you don't need to do any of these things". Because I think 1 goes without saying, I assumed you were saying 2. Also 2 probably is true in rare cases, but this is not backed up by your examples. But if 1 don't go without saying, then this means that a lot of "doing science" is cargo-culting? Which is sort of what you are saying when you talk about cached methodologies. So why would smart, curious, truth-seeking individuals use cached methodologies? Do I do this? Some self-reflection: I did some of this as a PhD student, because I was new, and it was a way to hit the ground running. So, I did some science using the method my supervisor told me to use, while simultaneously working to understand the reason behind this method. I did spend less time that I would have wanted to understand all the assumptions of the sub-sub field of physics I was working in, because of the pressure to keep publishing and because I got carried away by various fun math I could do if i just accepted these assumptions. After my PhD I felt that if I was going to stay in Physics, I wanted to take year or two for just learning, to actually understand Loop Quantum Gravit, and all the other competing theories, but that's not how academia works unfortunately, which is one of the reasons I left. I think that the fundament of good Epistemic is to not have competing incentives.

In a limited context, the first example that comes to me is high performers in competitive sports and games. Because if they truly only give a shit about winning (and the best generally do), they will throw away their legacy approaches when they find a new one, however it pains them.

Thanks for the kind words!

I'm not aware of any such statistics, but I'm guessing that MATS organizers might have some.

I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.

This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the "difficulties" are only positive numbers, then if the difficulty for the direct instillation is  and the one for the grader optimization is ... (read more)

2Rohin Shah4d
Two responses: 1. Grader-optimization has the benefit that you don't have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers. 2. Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I'd estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.

Thanks for taking time to answer my questions in detail!

About your example for other failure modes

Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you're pointing at the ability for the actor to "obfuscate" its plan in order to get high reward?

If so, it's not clear to me why this is valuable for the actor to do? How is it supposed to get better reward from confusion only? If it has another agenda (making paperclips instead of diamonds for example), then the obfuscation is clearly valuable to ... (read more)

2Alex Turner4d
No, the point is that the grader can only grade the current plan; it doesn't automatically know what its counterfactual branches output. The grader is scope-limited to its current invocation. This makes consistent grading harder (e.g. the soup-kitchen plan vs political activism, neither invocation knows what would be given by the other call to the grader, so they can't trivially agree on a consistent scale).

It... seems to be a significant simplification of the problem? I mean, not needing all the interpretability and surgery tools would be a bigger improvement, but that's probably not something we can have.

Why do you think so? Currently I'm seeing a couple of massive difficulties here that don't generally or necessarily appear in alternatives approaches:

  • You need to know that you're going to reach an AGI before it becomes superintelligent, or you'll waste your time training an AI that will be taken over by the competitors. Whereas many approaches don't require
... (read more)
1Thane Ruthenis5d
The crux is likely in a disagreement of which approaches we think are viable. In particular: What are the approaches you have in mind, that are both promising and don't require this? The most promising ones that come to my mind are the Shard Theory-inspired one and ELK. I've recently became much more [https://www.lesswrong.com/posts/heXcGuJqbx3HBmero/people-care-about-each-other-even-though-they-have-imperfect?commentId=rZPPm7SpGtsQvEvCC] skeptical [https://www.lesswrong.com/posts/kmpNkeqEGvFue7AvA/value-formation-an-overarching-model#9__Implications_for_Alignment] of the former, and the latter IIRC didn't handle mesa-optimizers/the Sharp Left Turn well (though I haven't read Paul's latest post yet, so I may be wrong on that). The core issue, as I see it, is that we'll need to aim the AI at humans in some precise way — tell it to precisely translate for us, or care about us in some highly specific way, or interpret commands in the exact way humans intend them, or figure out how to point it directly at the human values, or something along those lines. Otherwise it doesn't handle capability jumps well, whether we crank it up to superintelligence straight away or try to carefully steer it along. And the paradigm of loss functions and broad regularizers (e. g., speed/complexity penalties) seems to consist of tools too crude for this purpose. The way I see it, we'll need fine manipulation. Since writing the original post, I've been trying to come up with convincing-to-me ways to side-step this problem (as I allude to at the post's end), but no idea so far. Yeah, that's a difficulty unique to this approach.

The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:

  • Do you think that you are internally trying to approximate your own ?
  • Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
  • Can you think of concrete instances where you improved your own Eval?
  • Ca
... (read more)

> This includes “What would this specific and superintelligent CEV-universe-simulation say about this plan?”.

> This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]

But isn't 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and

... (read more)
  1. Intelligence => strong selection pressure => bad outcomes if the selection pressure is off target.
  2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into "what if the agent tricks the evaluator".
  3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into "what if the values / shards are different from what we wanted".
  4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.

O... (read more)

3Rohin Shah5d
Sounds right. How does this answer my point 4? I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don't really buy that, seems like it depends on the size of the discrepancies. For example, if you imagine an AI that's optimizing for my evaluation of good, I think the discrepancy between "Rohin's directly instilled goals" and "Rohin's CEV" is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I'd conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between "plans Rohin evaluates as good" and "Rohin's directly instilled goals".

A few questions to better understand your frame:

  • You mostly mention two outcomes for the various diamond-maximizer architectures: maximizing the number of diamonds produced and creating hypertuned-fooling-plans for the evaluator. If I could magically ensure that plan-space only contains plans that are not hypertuned-fooling-plans (they might try, but will most likely be figured out), would you say that then grader-optimization gives us an aligned AI? Or are there other failures modes that you see?
    • Intuitively if maximizing the number of diamonds and maximizi
... (read more)
4Alex Turner5d
Really appreciate the good questions! No, there are other failure modes due to unnaturality. Here's something I said in private communication: So, clarification: if I (not a grader-optimizer) wanted to become a grader-optimizer while pursuing my current goals, I'd need to harden my own evaluation procedures to keep up with my plan-search now being directed towards adversarial plan generation. Furthermore, for a given designer-intended task (e.g. "make diamonds"), to achieve that with grader-optimization, the designer pays in the extra effort they need to harden the grader relative to just... not evaluating adversarial plans to begin with. Given an already pointed-to/specified grader, the hardening is already baked in to that grader, and so both evaluation- and values-child should come out about the same in terms of compute usage. I think that a values-executing AGI can also search over as many plans which actually make sense, I don't think its options are limited or anything. But it'll be generating different kinds of plans, using reflective reasoning to restrict its search to non-adversarial-to-own-values parts of plan space (e.g. "don't think about basilisks"). 1. I don't see why that should exist; any plan-inhabiting adversary wishes to fool the boundary of whatever rule you provide. EDIT: I'm most confident in this point if you want your AI to propose plans which you can't generate but can maybe verify. 2. See the last 10+years of alignment researchers failing to do this. Probably wise to not spend further increments of research time on such matters, once the fault is pointed out.

Thanks for the kind words!

  1. Are there any particular lessons/ideas from Refine that you expect (or hope) SERI MATS to incorporate?

I have shared some of my models related to epistemology and key questions to MATS organizers, and I think they're supposed to be integrated in one of the future programs. Mostly things regarding realizing the importance of productive mistakes in science (which naturally pushes back a bit from the mentoring aspect of MATS) and understanding how less "clean" most scientific progress actually look like historically (with a basic read... (read more)

Thanks for the kind words and useful devil's advocate! (I'm expecting nothing less from you ;p)

  1. I expect it's unusual that [replace methodology-1 with methodology-2] will be a pareto improvement: other aspects of a researcher's work will tend to have adapted to fit methodology-1. So I don't think the creation of some initial friction is a bad sign. (also mirrors therapy - there's usually a [take things apart and better understand them] phase before any [put things back together in a more adaptive pattern] phase)
    1. It might be useful to predict this kind of thi
... (read more)

You probably know better than me, but I still have this intuition that seed-AI and FOOM have oriented the framing of the problem and the sort of question asked. I think people who came to agent foundations from different routes ended up asking slightly different questions.

I could totally be wrong though, thanks for making this weakness of my description explicit!

That's a great point!

There's definitely one big difference between how Scott defined it and how I'm using it, which you highlighted well. I think a better way of explaining my change is that in Scott's original example, the AI being flawed result in some sense in the alignment scheme (predict human values and do that) to be flawed too.

I hadn't made the explicit claim in my head or in the post, but thanks to your comment, I think I'm claiming that the version I'm proposing generalize one of the interesting part of the original definition, and let it be appl... (read more)

2William Saunders4mo
Yep, that clarifies.

Yeah, I will be posting updates, and probably the participants themselves will post some notes and related ideas. Excited too about how it's going to pan out!

Thanks for the comment!

To be honest, I had more trouble classifying you, and now that you commented, I think you're right that I got the wrong label. My reasoning was that your agenda and directions look far more explicit and precise than Paul or Evan's, which is definitely a more mosaic-y trait. On the other hand, there is the iteration that you describe, and I can clearly see a difference in terms of updating between you and let's say John/Eliezer.

My current model is that you're more palimpsest-y, but compared with most of us, you're surprisingly good at making your current iteration fit into a proper structure that you can make explicit and legible.

(Will update the post in consequence. ;) )

Nice post! Two things I particularly like are the explicit iteration (demonstrating by example how and why not to only use one framing), as well as the online learning framing.

The policy behaves in a competent yet undesirable way which gets low reward according to the original reward function.[2] This is an inner alignment failure, also known as goal misgeneralization. Langosco et al. (2022) provide a more formal definition and some examples of goal misgeneralization.

It seems like a core part of this initial framing relies on the operationalisation of ... (read more)

1Lawrence Chan3mo
I think here, competent can probably be defined in one of two (perhaps equivalent) ways: 1. Restricted reward spaces/informative priors over reward functions: as the appropriate folk theorem goes, any policy is optimal according to some reward function. "Most" policies are incompetent; consequently, many reward functions incentivize behavior that seems incoherent/incompetent to us. It seems that when I refer to a particular agent's behavior as "competent", I'm often making reference to the fact that it achieves high reward according to a "reasonable" reward function that I can imagine. Otherwise, the behavior just looks incoherent. This is similar to the definition used in Langosco, Koch, Sharkey et al's goal misgeneralization paper [https://arxiv.org/abs/2105.14111], which depends on a non-trivial prior over reward functions. 2. Demonstrates instrumental convergence/power seeking behavior. In environments with regularities, certain behaviors are instrumentally convergent/power seeking [https://arxiv.org/abs/1912.01683]. That is, they're likely to occur for a large class of reward functions. To evaluate if behavior is competent, we can look for behavior that seem power-seeking to us (i.e., not dying in a game). Incompetent behavior is that which doesn't exhibit power-seeking or instrumentally convergent drives. The reason these two can be equivalent is the aforementioned folk theorem: as every policy has a reward function that rationalizes it, there exists priors over reward functions where the implied prior over optimal policies doesn't demonstrate power seeking behavior.

Well, isn't having multiple modules a precondition to something being modular? That seems like what's happening in your example: it has only one module, so it doesn't even make sense to apply John's criterion.

Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.

You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off.

I agree that avoiding the Hard parts is rarely productive, but you also don't address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by tryin... (read more)

8johnswentworth5mo
Yes, although I consider that one more debatable. When there's not a "right" operationalization, that usually means that the concepts involved were fundamentally confused in the first place. Actually, I think starting from a behavioral theorem is fine. It's just not where we're looking to end up, and the fact that we want to open the black box should steer what starting points we look for, even when those starting points are behavioral.

In what way is AF not open to new ideas? I think it is a bit scary to publish a post here, but that has more to do with it being very public, and less to do with anything specific about the AF. But if AF has a culture of being non welcoming of new ideas, maybe we should fix that?

It's not that easy to justify a post from a year ago, but I think that what I meant was that the alignment forum has a certain style of alignment research, and thus only reading it means you don't see stuff like CHAI research or other works that are and aim at alignment without being shared that much on the AF.

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes?

This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

Other possible names would

... (read more)
1Robert Kirk5mo
This seems to me that you want a word for whatever the opposite of complex/chaotic systems are, right? Although obviously "Simple" is probably not the best word (as it's very generic). It could be "Simple Dynamics" or "Predictable Dynamics"?

Thanks for this post, it's clear and insightful about RLHF.

From an alignment perspective, would you say that your work gives evidence that we should focus most of the energy on finding guarantees about the distribution that we're aiming for and debugging problems there, rather than thinking about the guarantees of the inference?

(I still expect that we want to understand the inference better and how it can break, but your post seems to push towards a lesser focus on that part)

Another way to put it: coherence theorems assume the existence of some resources (e.g. money), and talk about systems which are pareto optimal with respect to those resources - e.g. systems which “don’t throw away money”. Implicitly, we're assuming that the system generally "wants" more resources (instrumentally, not necessarily as an end goal), and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from Y to X. If the agent reaches a wo

... (read more)

One thing that I had to remind myself while reading this post is that "far away" is across space-time, emphasis on time. So "far away" can be about optimizing the future.

Do you think that thinking explicitly about distributed systems (in the theoretical computer science sense) could be useful for having different frames or understanding of the tradeoffs? Or are you mostly using the idea of distributed systems as an intuitive frame without seeing much value in taking it too seriously?

3johnswentworth6mo
Two answers: * I agree with Self-Embedded Agent that there's likely powerful frames for thinking about distributed compute which have not yet been discovered, and existing work may hint toward those. That's the sort of thing which is probably not useful for most researchers to think about, but worth at least some thinking about. * There's a shared core to distributed models which I do think basically-all technical researchers in the field should be familiar with. That's best picked up by seeing it in a few different contexts, and theory of distributed systems is one possible context to pick it up from. (Some others: Bayes nets/causality, working with structured matrices, distributed programming in practice.)
5Alexander Gietelink Oldenziel6mo
If I may be so bold, the answer should be a guarded yes. A snag is that the correct theory of what John calls 'distributed systems' or 'Time' and what theoretical CS academics generally call 'concurrency' is as of yet not fully constructed. To be sure, there are many quite well-developed theoretical frameworks - e.g. the Pi calculus [https://www.amazon.com/Pi-Calculus-Theory-Mobile-Processes/dp/0521543274] or the various models of concurrency like Petri nets, transitions systems, event structures [https://www.cl.cam.ac.uk/~gw104/winskel-nielsen-models-for-concurrency.pdf]etc. They're certainly on my list of 'important things I'd like to understand better'. Our world, and our sensemaking of it, is fundamentally concurrent. If we had the 'correct' theory of concurrency and we would be able to coherently combine it with decision theory under uncertainty that would be very powerful.

Thanks for trying to make the issue more concrete and provide a way to discuss it!

One thing I want to point out is that you don't really need to put the non-constrained variables at the worst possible state; you just have the degree of freedom to put them to whatever helps you and is not too hard to reach.

Using sets, you have a set of world you want, and a proxy that is a superset of this (because you're not able to aim exactly at what you want). The problem is that the AI is optimizing to get in the superset with high guarantees and stay there, and so it'... (read more)

Great post!

For instance, if I’m planning a party, then the actions I take now are far away in time (and probably also space) from the party they’re optimizing. The “intermediate layers” might be snapshots of the universe-state at each time between the actions and the party. (... or they might be something else; there are usually many different ways to draw intermediate layers between far-apart things.)

This applies surprisingly well even in situations like reinforcement learning, where we don’t typically think of the objective as “far away” from the agent.

... (read more)

I followed approximately the technical discussion, and now I'm wondering what that would buy us if you are correct.

  • Max entropy distributions seem nicely behaved and well-studied, so maybe we get some computations, properties, derivation for free? (Basically applying a productive frame to the problem of abstraction)
  • It would reduce computing the influence of the summary statistics on the model to computing the constraints, as I'm guessing that this is the hard part in computing the max entropy distribution (?)

Are these correct, and what am I missing?

That's basically correct; the main immediate gain is that it makes it much easier to compute abstractions and compute using abstractions.

One additional piece is that it hints towards a probably-more-fundamental derivation of the theorems in which maximum entropy plays a more central role. The maximum entropy Telephone Theorem already does that, but the resampling + gKPD approach routes awkwardly through gKPD instead; there's probably a nice way to do it directly via constrained maximization of entropy. That, in turn, would probably yield stronger and simpler theorems.

Thanks for the post!

So if I understand correctly, your result is aiming at letting us estimate the dimensionality of the solution basins based on the gradients for the training examples at my local min/final model? Like, I just have to train my model, and then compute the Hessian/behavior gradients and I would (if everything you're looking at works as intended) have a lot of information about the dimensionality of the basin (and I guess the modularity is what you're aiming at here)? That would be pretty nice.

What other applications do you see for this resu... (read more)

1Vivek Hebbar6mo
About the contours: While the graphic shows a finite number of contours with some spacing, in reality there are infinite contour planes and they completely fill space (as densely as the reals, if we ignore float precision). So at literally every point in space there is a blue contour, and a red one which exactly coincides with it.

I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd

... (read more)

Thanks for the answer.

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed b

... (read more)

I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd... (read more)

I like that you're proposing an explicit heuristic inspired by the history of science for judging research directions and approaches, and acknowledge that it leads to conclusion that are counter intuitive to my Richard-model (pushing for Agents foundations for example), so you're not just retrofitting your own conclusion AFAIK. I also like that you're applying it to object-level directions in alignment — that's something I'm working on at the moment for my own research, based on your pushback.

That being said, my prediction/retrodiction is that this is too ... (read more)

0Spencer Becker-Kahn7mo
I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate. One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed breakthrough seems to just 'fall out' of a bunch of partial work without there being a compelling explanation after the fact.

Sorry to make you work more, but happy to fill a much needed niche. ^^

Thanks! Yes, this is very much an experiment, and even if it fails, I expect it to be a productive mistake we can learn from. ;)

I disagree, so I'm curious about what are great examples for you of good research on alignment that is not done by x-risk motivated people? (Not being dismissive, I'm genuinely curious, and discussing specifics sounds more promising than downvoting you to oblivion and not having a conversation at all).

1Joe_Collman8mo
Examples would be interesting, certainly. Concerning the post's point, I'd say the relevant claim is that [type of alignment research that'll be increasingly done in slow takeoff scenarios] is already being done by non x-risk motivated people. I guess the hope is that at some point there are clear-to-everyone problems with no hacky solutions, so that incentives align to look for fundamental fixes - but I wouldn't want to rely on this.

I have a framing of AI risks scenarios that I think is more general and more powerful than most available online, and that might be a good frame before going into examples. It's not posted yet (I'm finishing the sequence now) but I could sent somethings to you if you're interested. ;)

1Aryeh Englander8mo
Yes please!

(I will be running the Incubator at Conjecture)

The goal for the incubator is to foster new conceptual alignment research bets that could go on to become full-fledged research directions, either at Conjecture or at other places. We’re thus planning to select mostly on the quality we expect for a very promising independent conceptual researcher, that is proactivity (see Paul Graham’s Relentlessly Resourceful post) and some interest or excitement about not fully tapped streams of evidence (see this recent post).

Although experience with alignment cou... (read more)

Thanks for the answer!

Unfortunately, I don't have the time at the moment to answer in detail and have more of a conversation, as I'm fully focused on writing a long sequence about pushing for pluralism in alignment and extracting the core problem out of all the implementation details and additional assumption. I plan on going back to analyzing timeline research in the future, and will probably give better answers then.

That being said, here are quick fire thoughts:

  • I used the evolution case because I consider it the most obvious/straightforward case, in that
... (read more)

Great idea!

I was talking with someone just today about the need for something like that. I also expect that InfraBayes is one of the approaches of alignment that need a lot of translation effort to become palatable to proponents of other approaches.

Additional point though: I feel like the applicant should probably also have a decent understanding of alignment, as a lot of the value of such a communication and translation (for me at least) would come from understanding its value for alignment.

This thread makes me thing that my post is basically a hardness result for ELK when you don't have access to the planner. I agree with you that in settings like the ones you describe, the reporter would have access to the planner, and thus the examples described in this post wouldn't really apply. But the need to have control of the planner is not stated in the ELK report.

So if this post is correct, solving ELK isn't enough if you don't have access to the planner. Which means we either need to ensure that in all case we can train/observe the planner (which... (read more)

So this line of thinking came from considering the predictor and the planner as separate, which is what ELK does AFAIR. It would thus be the planner that executes, or prepares a treacherous turn, while the predictor doesn't know about it (but would be in principle able to find out if it actually needed to).

4Rohin Shah10mo
What is the planner? How is it trained? I often imagine a multihead architecture, where most of the computation is in the shared layers, then one (small) head is the "predictor" (trained by self-supervised learning), one (small) head is the "actor / planner" (trained by RL from human feedback), and the other (small) head is the "reporter" (trained via ELK). In this version the hope is that ~all of the relevant knowledge is in the shared layers and so is accessible to the reporter. You could also be fancier and take the activations, weights, and/or outputs of the planner head and feed them as inputs into the reporter, if you really wanted to be sure that the information is accessible in principle.

Thanks for the post! Two general points I want to make before going into more general comments:

  • I liked the section on concepts difference across, and hadn't thought much about it before, so thanks!
  • One big aspect of the natural abstraction hypothesis that you missed IMO is "how do you draw the boundaries around abstractions?" — more formally how do you draw the markov blanket. This to me is the most important question to answer for settling the NAH, and John's recent work on sequences of markov blanket is IMO him trying to settle this.

In general, we should

... (read more)

Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:

  • I'm not convinced that robustness to relative scale is as fundamental as the other two, because there is no reason to expect that in general the subcomponents will be significantly different in power, especially in settings like adversarial training where both parts are trained according to the same approach. That being said, I still agree that this is an interesting question to ask, and some proposal might indeed depend on a version of this.
  • Robustn
... (read more)

Thanks for pushing back on my interpretation.

I feel like you're using "strongest" and "weakest" to design "more concrete" and "more abstract", with maybe the value judgement (implicit in your focus on specific testable claims) that concreteness is better. My interpretation doesn't disagree with your point about Bio Anchors, it simply says that this is a concrete instantiation of a general pattern, and that the whole point of the original post as I understand it is to share this pattern. Hence the title who talks about all biology-inspired timelines, the th... (read more)

Thanks so much for the effort your putting in this work! It looks particularly relevant to my current interest of understanding the different approximations and questions used in alignment, and what forbids us the Grail of paradigmaticity.

Here is my more concrete feedback

A common approach when setting research agendas in AI Alignment is to be specific, and focus on a threat model. That is, to extrapolate from current work in AI and our theoretical understanding of what to expect, to come up with specific stories for how AGI could cause an existential catas

... (read more)

Thanks for this post!

That being said, my model of Yudkowsky, which I built by spending time interpreting and reverse engineering the post you're responding to, feels like you're not addressing his points (obviously, I might have missed the real Yudkowsky's point)

My interpretation is that he is saying that Evolution (as the generator of most biological anchors) explores the solution space in a fundamentally different path than human research.  So what you have is two paths through a space. The burden of proof for biological anchors thus lies in arguing... (read more)

3HoldenKarnofsky8mo
I don't think I am following the argument here. You seem focused on the comparison with evolution, which is only a minor part of Bio Anchors, and used primarily as an upper bound. (You say "the number is so vastly large (and actually unknown due to the 'level of details' problem) that it's not really relevant for timelines calculations," but actually Bio Anchors still estimates that the evolution anchor implies a ~50% chance of transformative AI this century.) Generally, I don't see "A and B are very different" as a knockdown counterargument to "If A required ___ amount of compute, my guess is that B will require no more." I'm not sure I have more to say on this point that hasn't already been said - I acknowledge that the comparisons being made are not "tight" and that there's a lot of guesswork, and the Bio Anchors argument doesn't go through without some shared premises and intuitions, but I think the needed intuitions are reasonable and an update from Bio-Anchors-ignorant starting positions is warranted.

It displeases me that this is currently the most upvoted response: I believe you are focusing on EY's weakest rather than strongest points.

My interpretation is that he is saying that Evolution (as the generator of most biological anchors) explores the solution space in a fundamentally different path than human research. So what you have is two paths through a space. The burden of proof for biological anchors thus lies in arguing that there are enough connections/correlations between the two paths to use one in order to predict the other.

It's hardly su... (read more)

First, I want to clarify that I feel we're going into a more interesting place, where there's a better chance that you might find a point that invalidates Yudkowsky's argument, and can thus convince him of the value of the model.

But it's also important to realize that IMO, Yudkowsky is not just saying that biological anchors are bad. The more general problem (which is also developed in this post) is that predicting the Future is really hard. In his own model of AGI timelines, the factor that is basically impossible to predict until you can make AGI is the ... (read more)

I do think you are misconstruing Yudkowsky's argument. I'm going to give evidence (all of which are relatively strong IMO) in order of "ease of checkability". So I'll start with something anyone can check in a couple of minutes, and close by the more general interpretation that requires rereading the post in details.

Evidence 1: Yudkowsky flags Simulated-Eliezer as talking smack in the part you're mentioning

If I follow you correctly, your interpretation mostly comes from this part:

OpenPhil:  We did already consider that and try to take it into account:

... (read more)

Here I think I share your interpretation of Yudkowsky; I just disagree with Yudkowsky. I agree on the second part; the model overestimates median TAI arrival time. But I disagree on the first part -- I think that having a probability distribution over when to expect TAI / AGI / AI-PONR etc. is pretty important/decision-relevant, e.g. for advising people on whether to go to grad school, or for deciding what sort of research project to undertake. (Perhaps Yudkowsky agrees with this much.) 

Hum, I would say Yudkowsky seems to agree with the value of a pro... (read more)

4Daniel Kokotajlo1y
I guess I would say: Ajeya's framework/model can incorporate this objection; this isn't a "get rid of the whole framework" objection but rather a "tweak the model in the following way" objection. Like, I agree that it would be bad if everyone who used Ajeya's model had to put 100% of their probability mass into the six bio anchors she chose. That's super misleading/biasing/ignores loads of other possible ways AGI might happen. But I don't think of this as a necessary part of Ajeya's model; when I use it, I throw out the six bio anchors and just directly input my probability distribution over OOMs of compute. My distribution is informed by the bio anchors, of course, but that's not the only thing that informs it.

Strongly disagree with this, to the extent that I think this is probably the least cruxy topic discussed in this post, and thus the comment is as wrong as is physically possible.

Remove Platt's law, and none of the actual arguments and meta-discussions changes. It's clearly a case of Yudkowsky going for the snappy "hey, see like even your new-and-smarter report makes exactly the same estimation predicted by a random psychological law" + his own frustration with the law still applying despite expected progress.

But once again, if Platt's law was so wrong that... (read more)

5Daniel Kokotajlo1y
Hahaha ok, interesting! If you are right I'll take some pride in having achieved that distinction. ;) I interpreted Yudkowsky as claiming that Ajeya's model had enough free parameters that it could be made to predict a wide range of things, and that what was actually driving the 30-year prediction was a bunch of implicit biases rather than reality. Platt's Law is evidence for this claim. If it were false and e.g. the typical timelines forecast was only 10 years out, or 60, then we would have less reason to think that implicit biases were driving Ajeya's choice of parameters. Of course, Yudkowsky also made other arguments besides this one... but this one seemed to be there, and it seemed fairly important to me. It's entirely possible I am misconstruing Yudkowsky's argument... you did recently do a reconstruction, so you probably understand it better than me. Care to elaborate?
Load More