This is a special post for quick takes by tailcalled. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

30 comments, sorted by Click to highlight new comments since: Today at 3:56 AM

If a tree falls in the forest, and two people are around to hear it, does it make a sound?

I feel like typically you'd say yes, it makes a sound. Not two sounds, one for each person, but one sound that both people hear.

But that must mean that a sound is not just auditory experiences, because then there would be two rather than one. Rather it's more like, emissions of acoustic vibrations. But this implies that it also makes a sound when no one is around to hear it.

I think this just repeats the original ambiguity of the question, by using the word "sound" in a context where the common meaning (air vibrations perceived by an agent) is only partly applicable.  It's still a question of definition, not of understanding what actually happens.

I think we're playing too much with the meaning of "sound" here. The tree causes some vibrations in the air, which leads to two auditory experiences since there are two people

I think I've got it, the fix to the problem in my corrigibility thing!

So to recap: It seems to me that for the stop button problem, we want humans to control whether the AI stops or runs freely, which is a causal notion, and so we should use counterfactuals in our utility function to describe it. (Dunno why most people don't do this.) That is, if we say that the AI's utility should depend on the counterfactuals related to human behavior, then it will want to observe humans to get input on what to do, rather than manipulate them, because this is the only way for it to be dependent on the counterfactuals.

But So8res pointed out, just using counterfactuals directly is a big problem because it rapidly brings us out of distribution. A practical implementation of this beyond the stop button problem up having an exponential amount of counterfactuals to cover, and the vast majority of those counterfactuals will be far outside of the real-world distribution. This means that the AI might not get aligned at all, because the real-world applications don't get nonnegligible weight in the utility function.

But I think I've figured out a solution now, which I'd call conditional+counterfactual corrigibility. As usual let's use  to denote that the stop button gets pressed and the AI shuts down,  to denote whichever non-corrigible utility function that we want to make corrigible, and / to denote a counterfactual where people do () or do not () want to press the stop button. However, we will also use  and  to denote the conditions where people do or do not want to press the stop button. In that case, the utility function should be. In that case, we can define  to mean that humans can control whether the AI stops or runs in condition :

and then we simply want to define the utility as saying that people can control the AI in both the  and the  condition:

Previously, I strongly emphasized the need to keep the AI "under a counterfactual" - that is, if it believed it could control whether humans want to stop it or not, then it would be incentivized to manipulate humans. But this is what brings us out of distribution. However, counterfactuals aren't the only way to keep the appearance of a phenomenon constant - conditionals work too. And conditionals keep you nicely on distribution, so that's now my solution to the distribution issues. This means that we can use much less invasive counterfactuals.

That said, this approach I going to have a hard time with chaotic phenomena, as combining conditionals and counterfactuals in the presence of chaos can get pretty weird.

It also might be vulnerable to some variant of the critiques that were first raised against it, because now the conditional introduces a link between its policy and the scenarios it faces, but I can't immediately construct a case where it happens, because the conditional would still somewhat tend to sabotage the obvious approaches. This sort of unclarity is kind of concerning when it comes to the idea.

Like suppose the AI immediately very publically does something that looks very unsafe. Say grabs control over the stop button and starts mass-producing paperclips in an extremely publically visible way. This would probably lead to people wanting to stop it. So therefore, if it has a policy like that, the  conditional would lead to people quickly wanting to stop it. This means that in the  branch, it can quickly determine whether it is in the  branch or the  branch; in the  case, it can then keep going with whatever optimization  specified, while in the  case, it can then immediately shut down itself.

But the reason I think the AI *wouldn't* do this is, what about the  branch? If you condition on humans not wanting to press the stop button even though there's a clearly unaligned AI, what sort of situation could produce this? I have trouble imagining it, because it seems like it would need to be pretty extreme. The best ideas I can come up with is stuff like "black hole swallows the earth", but this would rank pretty low in the AI's utility function, and therefore it would avoid acting this way in order to have a reasonable  branch.

But this does not seem like sane reasoning on the AI's side to me, so it seems like this should be fixed. And of course, fixed in a principled rather than unprincipled way.

One thing that seems really important for agency is perception. And one thing that seems really important for perception is representation learning. Where representation learning involves taking a complex universe (or perhaps rather, complex sense-data) and choosing features of that universe that are useful for modelling things.

When the features are linearly related to the observations/state of the universe, I feel like I have a really good grasp of how to think about this. But most of the time, the features will be nonlinearly related; e.g. in order to do image classication, you use deep neural networks, not principal component analysis.

I feel like it's an interesting question: where does the nonlinearity come from? Many causal relationships seem essentially linear (especially if you do appropriate changes of variables to help, e.g. taking logarithms; for many purposes, monotonicity can substitute for linearity), and lots of variance in sense-data can be captured through linear means, so it's not obvious why nonlinearity should be so important.

Here's some ideas I have so far:

  • Suppose you have a Gaussian mixture distribution with two Gaussians  with different means and identical covariances. In this case, the function that separates them optimally is linear. However, if the covariances differed between the Gaussians , then the optimal separating function is nonlinear. So this suggests to me that one reason for nonlinearity is fundamental to perception: nonlinearity is necessary if multiple different processes could be generating the data, and you need to discriminate between the processes themselves. This seems important for something like vision, where you don't observe the system itself, but instead observe light that bounced off the system.
  • Consider the notion of the habitable zone of a solar system; it's the range in which liquid water can exist. Get too close to the star and the water will freeze, get too far and it will boil. Here, it seems like we have two monotonic effects which add up, but because the effects aren't linear, the result can be nonmonotonic.
  • Many aspects of the universe are fundamentally nonlinear. But they tend to exist on tiny scales, and those tiny scales tend to mostly get loss to chaotic noise, which tends to turn things linear. However, there are things that don't get lost to noise, e.g. due to conservation laws; these provide fundamental sources of nonlinearity in the universe.
  • ... and actually, most of the universe is pretty linear? The vast majority of the universe is ~empty space; there isn't much complex nonlinearity that is happening there, just waves and particles zipping around. If we disregard the empty space, then I believe (might be wrong) that the vast majority is stars. Obviously lots of stuff is going on within stars, but all of the details get lost to the high energies, so it is mostly simple monotonic relations that are left. It seems that perhaps nonlinearity tends to live on tiny boundaries between linear domains. The main reason thing that makes these tiny boundaries so relevant, such that we can't just forget about them and model everything in piecewise linear/piecewise monotonic ways, is that we live in the boundary.
  • Another major thing: It's hard to persist information in linear contexts, because it gets lost to noise. Whereas nonlinear systems can have multiple stable configurations and therefore persist it for longer.
  • There is of course a lot of nonlinearity in organisms and other optimized systems, but I believe they result from the world containing the various factors listed above? Idk, it's possible I've missed some.

It seems like it would be nice to develop a theory on sources of nonlinearity. This would make it clearer why sometimes selecting features linearly seems to work (e.g. consider IQ tests), and sometimes it doesn't.

I recently wrote a post about myopia, and one thing I found difficult when writing the post was in really justifying its usefulness. So eventually I mostly gave up, leaving just the point that it can be used for some general analysis (which I still think is true), but without doing any optimality proofs.

But now I've been thinking about it further, and I think I've realized - don't we lack formal proofs of the usefulness of myopia in general? Myopia seems to mostly be justified by the observation that we're already being myopic in some ways, e.g. when training prediction models. But I don't think anybody has formally proven that training prediction models myopically rather than nonmyopically is a good idea for any purpose?

So that seems like a good first step. But that immediately raises the question, good for what purpose? Generally it's justified with us not wanting the prediction algorithms to manipulate the real-world distribution of the data to make it more predictable. And that's sometimes true, but I'm pretty sure one could come up with cases where it would be perfectly fine to do so, e.g. I keep some things organized so that they are easier to find.

It seems to me that it's about modularity. We want to design the prediction algorithm separately from the agent, so we do the predictions myopically because modifying the real world is the agent's job. So my current best guess for the optimality criterion of myopic optimization of predictions would be something related to supporting a wide variety of agents.

Yeah, I think usually when people are interested in myopia, it's because they think there's some desired solution to the problem that is myopic / local, and they want to try to force the algorithm to find that solution rather than some other one. E.g. answering a question based only on some function of its contents, rather than based on the long-term impact of different answers.

I think that once you postulate such a desired myopic solution and its non-myopic competitors, then you can easily prove that myopia helps. But this still leaves the question of how we know this problems statement is true - if there's a simpler myopic solution that's bad, then myopia won't help (so how can we predict if this is true?) and if there's a simpler non-myopic solution that's good, myopia may actively hurt (this one seems a little easier to predict though).

In the context of natural impact regularization, it would be interesting to try to explore some @TurnTrout-style powerseeking theorems for subagents. (Yes, I know he denounces the powerseeking theorems, but I still like them.)

Specifically, consider this setup: Agent U starts a number of subagents S1, S2, S3, ..., with the subagents being picked according to U's utility function (or decision algorithm or whatever). Now, would S1 seek power? My intuition says, often not! If S1 seeks power in a way that takes away power from S2, that could disadvantage U. So basically S1 would only seek power in cases where it expects to make better use of the power than S2, S3, ....

Obviously this may be kind of hard for us to make use of if we are trying to make an AI and we only know how to make dangerous utility maximizers. But if we're happy with the kind of maximizers we can make on the first order (as seems to apply to the SOTA, since current methods aren't really utility maximizers) and mainly worried about the mesaoptimizers they might make, this sort of theorem would suggest that the mesaoptimizers would prefer staying nice and bounded.

Theory for a capabilities advance that is going to occur soon:

OpenAI is currently getting lots of novel triplets (S, U, A), where S is a system prompt, U is a user prompt, and A is an assistant answer.

Given a bunch of such triplets (S, U_1, A_1), ... (S, U_n, A_n), it seems like they could probably create a model P(S|U_1, A_1, ..., U_n, A_n), which could essentially "generate/distill prompts from examples".

This seems like the first step towards efficiently integrating information from lots of places. (Well, they could ofc also do standard SGD-based gradient descent, but it has its issues.)

A followup option: they could use something a la Constitutional AI to generate perturbations A'_1, ..., A'_n. If they have a previous model like the above, they could then generate a perturbation P(S'|U_1, A'_1, ..., U_n, A'_n). I consider this significant because this then gives them the training data to create a model P(S'|S, U_1, A_1, A'_1), which essentially allows them to do "linguistic backchaining": The user can update an output of the network A_1 -> A'_1, and then the model can suggest a way to change the prompt to obtain similar updates in the future.

Furthermore I imagine this could get combined together into some sort of "linguistic backpropagation" by repeatedly applying models like this, which could unleash a lot of methods to a far greater extent than they have been so far.

Obviously this is just a very rough sketch, and it would be a huge engineering and research project to get this working in practice. Plus maybe there are other methods that work better. I'm mainly just playing around with this because I think there's a strong economic pressure for something-like-this, and I want a toy model to use for thinking about its requirements and consequences.

Actually I suppose they don't even need to add perturbations to A directly, they can just add perturbations to S and generate A's from S'. Or probably even look at user's histories to find direct perturbations to either S or A.

I recently wrote a post presenting a step towards corrigibility using causality here. I've got several ideas in the works for how to improve it, but I'm not sure which one is going to be most interesting to people. Here's a list.

  • Develop the stop button solution further, cleaning up errors, better matching the purpose, etc..

e.g.

I think there may be some variant of this that could work. Like if you give the AI reward proportional to  (where  is a reward function for ) for its current world-state (rather than picking a policy that maximizes  overall; so one difference is that you'd be summing over the reward rather than giving a single one), then that would encourage the AI to create a state where shutdown happens when humans want to press the button and  happens when they don't. But the issue I have with this proposal is that the AI would be prone to not respect past attempts to press the stop button. I think maybe if one picked a different reward function, like , then it could work better (though the  part would need a time delay...). Though this reward function might leave it open to the "trying to shut down the AI for reasons" objection that you gave before; I think that's fixed by moving the  counterfactual outside of the sum over rewards, but I'm not sure.

  • Better explaining the intuitions behind why counterfactuals (and in particular counterfactuals over human preferences) are important for corrigibility.

e.g.

his is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human's preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.

This seems to be what sets my proposal apart from most "utility indifference proposals", which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans.

  • Using counterfactuals to control a paperclip maximizer to be safe and productive

e.g.

(I also think that there are other useful things that can be specified using utility functions that involve counterfactuals, which I'm trying to prepare for an explainer post. For instance, a sort of "encapsulation" - if you're a paperclip producer, you might want to make a paperclip maximizer which is encapsulated in the sense that it is only allowed to work within a single factory, using a single set of resources, and not influencing the world otherwise. This could be specified using a counterfactual that the outside world's outcome must be "as if" the resources in the factory just disappeared and paperclips appeared at its output act-of-god style. This avoids any unintended impacts on the outside world while still preserving the intended side effect of the creation of a high but controlled amount of paperclips. However, I'm still working on making it sufficiently neat, e.g. this proposal runs into problems with the universe's conservation laws.)

  • Attempting to formally prove that counterfactuals work and/or are necessary, perhaps with a TurnTrout-style argument

Are there good versions of DAGs for other things than causality?

I've found Pearl-style causal DAGs (and other causal graphical models) useful for reasoning about causality. It's a nice way to abstractly talk and think about it without needing to get bogged down with fiddly details.

In a way, causality describes the paths through which information can "flow". But information is not the only thing in the universe that gets transferred from node to node; there's also things like energy, money, etc., which have somewhat different properties but intuitively seem like they could benefit from graph-based models too.

I'm pretty sure I've seen a number of different graph-based models for describing different flows like this, but I don't know their names, and also the ones I've seen seemed highly specialized and I'm not sure they're the best to use. But I thought, it seems quite probable that someone on LessWrong would know of a recommended system to learn.

I have a concept that I expect to take off in reinforcement learning. I don't have time to test it right now, though hopefully I'd find time later. Until then, I want to put it out here, either as inspiration for others, or as a "called it"/prediction, or as a way to hear critique/about similar projects others might have made:

Reinforcement learning is currently trying to do stuff like learning to model the sum of their future rewards, e.g. expectations using V, A and Q functions for many algorithm, or the entire probability distribution in algorithms like DreamerV3.

Mechanistically, the reason these methods work is that they stitch together experience from different trajectories. So e.g. if one trajectory goes A -> B -> C and earns a reward at the end, it learns that states A and B and C are valuable. If another trajectory goes D -> A -> E -> F and gets punished at the end, it learns that E and F are low-value but D and A are high-value because its experience from the first trajectory shows that it could've just gone D -> A -> B -> C instead.

But what if it learns of a path E -> B? Or a shortcut A -> C? Or a path F -> G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance.

Ok, so that's the problem, but how could it be fixed? Speculation time:

You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.

More formally, let's say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:

Hope(s, a) = rs' + f Hope(s', a')

Where s' is the resulting state that experience has shown comes after s when doing a, f is the discounting factor, and a' is the optimal action in s'.

Because the Hope function is multidimensional, the learning signal is much richer, and one should therefore maybe expects its internal activations to be richer and more flexible in the face of new experience.

Here's another thing to notice: let's say for the policy, we use the Hope function as a target to feed into a decision transformer. We now have a natural parameterization for the policy, based on which Hope it pursues.

In particular, we could define another function, maybe called the Result function, which in addition to s and a takes a target distribution w as a parameter, subject to the Bellman equation:

Result(s, a, w) = rs' + f Result(s', a', (w-rs')/f)

Where a' is the action recommended by the decision transformer when asked to achieve (w-rs')/f from state s'.

This Result function ought to be invariant under many changes in policy, which should make it more stable to learn, boosting capabilities. Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards.

An obvious challenge with this proposal is that states are really latent variables and also too complex to learn distributions over. While this is true, that seems like an orthogonal problem to solve.

Also this mindset seems to pave way for other approaches, e.g. you could maybe have a Halfway function that factors an ambitious hope into smaller ones or something. Though it's a bit tricky because one needs to distinguish correlation and causation.

Downvoted because conditional on this being true, it is harmful to publish. Don't take it personally, but this is content I don't want to see on LW.

Why harmful

Because it's capability research. It shortens the TAI timeline with little compensating benefit.

It's capability research that is coupled to alignment:

Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards.

Coupling alignment to capabilities is basically what we need to survive, because the danger of capabilities comes from the fact that capabilities is self-funding, thereby risking outracing alignment. If alignment can absorb enough success from capabilities, we survive.

I missed that paragraph on first reading, mea culpa. I think that your story about how it's a win for interpretability and alignment is very unconvincing, but I don't feel like hashing it out atm. Revised to weak downvote.

Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?

Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?

Surely your expectation that the current trajectory is mostly doomed depends on your expectation of the technical details of the extension of the current trajectory. If technical specifics emerge that shows the current trajectory to be going in a more alignable direction, it may be fine to accelerate.

Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.

You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.

More formally, let's say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:

Hope(s, a) = rs' + f Hope(s', a')

The "successor representation" is somewhat close to this. It encodes the distribution over future states a partcular policy expects to visit from a particular starting state, and can be learned via the Bellman equation / TD learning.

Yes, my instant thought too was "this sounds like a variant on a successor function".

Of course, the real answer is that if you are worried about the slowness of bootstrapping back value estimates or short eligibility traces, this mostly just shows the fundamental problem with model-free RL and why you want to use models: models don't need any environmental transitions to solve the use case presented:

But what if it learns of a path E -> B? Or a shortcut A -> C? Or a path F -> G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance.

If the MBRL agent has learned a good reward-sensitive model of the environmental dynamics, then it will have already figured out E->B and so on, or could do so offline by planning; or if it had not because it is still learning the environment model, it would have a prior probability over the possibility that E->B gives a huge amount of reward, and it can calculate a VoI and target E->B in the next episode for exploration, and on observing the huge reward, update the model, replan, and so immediately begin taking E->B actions within that episode and all future episodes, and benefiting from generalization because it can also update the model everywhere for all E->B-like paths and all similar paths (which might now suddenly have much higher VoI and be worth targeting for further exploration) rather than simply those specific states' value-estimates, and so on.

(And this is one of the justifications for successor representations: it pulls model-free agents a bit towards model-based-like behavior.)

With MBRL, don't you end up with the same problem, but when planning in the model instead? E.g. DreamerV3 still learns a value function in their actor-critic reinforcement learning that occurs "in the model". This value function still needs to chain the estimates backwards.

It's the 'same problem', maybe, but it's a lot easier to solve when you have an explicit model! You have something you can plan over, don't need to interact with an environment out in the real world, and can do things like tree search or differentiating through the environmental dynamics model to do gradient ascent on the action-inputs to maximize the reward (while holding the model fixed). Same as training the neural network, once it's differentiable - backprop can 'chain the estimates backwards' so efficiently you barely even think about it anymore. (It just holds the input and output fixed while updating the model.) Or distilling a tree search into a NN - the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that's very fast and explicit and can be distilled down into a NN forward pass.

And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*

* Why yes, all of this does sound a lot like how you train a LLM today and what it is able to do, how curious

Same as training the neural network, once it's differentiable - backprop can 'chain the estimates backwards' so efficiently you barely even think about it anymore.

I don't think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic. DreamerV3 only unrolls for 16 steps.

Or distilling a tree search into a NN - the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that's very fast and explicit and can be distilled down into a NN forward pass.

But when you distill a tree search, you basically learn value estimates, i.e. something similar to a Q function (realistically, V function). Thus, here you also have an opportunity to bubble up some additional information.

And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*

I'm not doubting the relevance of MBRL, I expect that to take off too. What I'm doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.

I don't think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic.

Those are two different things. The unrolling of the episode is still very cheap. It's a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way... (Given how small a Dreamer is, it may even be computationally cheaper to do some gradient ascent on it than it is to run whatever simulated environment you might be using! Especially given simulated environments will increasingly be large generative models, which incorporate lots of reward-irrelevant stuff.) The usefulness of the planning is a different thing, and might also be true for other planning methods in that environment too - if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.

But when you distill a tree search, you basically learn value estimates

This is again doing the same thing as 'the same problem'; yes, you are learning value estimates, but you are doing so better than alternatives, and better is better.. The AlphaGo network loses to the AlphaZero network, and the latter, in addition to just being quantitatively much better, also seems to have qualitatively different behavior, like fixing the 'delusions' (cf. AlphaStar).

What I'm doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.

They won't be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don't find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.

These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn't solving your problems, you're just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.

The unrolling of the episode is still very cheap. It's a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way...

But I'm not advocating against MBRL, so this isn't the relevant counterfactual. A pure MBRL-based approach would update the value function to match the rollouts, but e.g. DreamerV3 also uses the value function in a Bellman-like manner to e.g. impute the future reward at the end of an episode. This allows it to plan for further than the 16 steps it rolls out, but it would be computationally intractable to roll out for as far as this ends up planning.

if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.

It's possible for there to be a kind of chaos where the analytic gradients blow up yet discrete differences have predictable effects. Bifurcations etc..

They won't be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don't find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.

These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn't solving your problems, you're just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.

I agree with things needing to be learned; using the actual states themselves was more of a toy model (because we have mathematical models for MDPs but we don't have mathematical models for "capabilities researchers will find something that can be Learned"), and I'd expect something else to happen. If I was to run off to implement this now, I'd be using learned embeddings of states, rather than states themselves. Though of course even learned embeddings have their problems.

The trouble with just saying "let's use decision transformers" is twofold. First, we still need to actually define the feedback system. One option is to just define reward as the feedback, but as you mention, that's not nuanced enough. You could use some system that's trained to mimic human labels as the ground truth, but this kind of system has flaws for standard alignment reasons.

It seems to me that capabilities researchers are eventually going to find some clever feedback system to use. It will to a great extent be learned, but they're going to need to figure out the learning method too.

Thanks for the link! It does look somewhat relevant.

But I think the weighting by reward (or other significant variables) is pretty important, since it generates a goal to pursue, making it emphasize things that can achieved rather than just things that might randomly happen.

Though this makes me think about whether there are natural variables in the state space that could be weighted by, without using reward per se. E.g. the size of (s' - s) in some natural embedding, or the variance in s' over all the possible actions that could be taken. Hmm. 🤔