jacob_cannell

I have a compute-market startup called vast.ai, and I'm working towards aligned AI. Currently seeking networking, collaborators, and hires - especially top notch cuda/gpu programmers.

My personal blog: https://entersingularity.wordpress.com/

Posts

Sorted by New

0jacob_cannell's Shortform

19Empowerment is (almost) All We Need

15LOVE in a simbox is all you need

Wiki Contributions

Comments

Current AIs Provide Nearly No Data Relevant to AGI Alignment

jacob_cannell4mo3-3

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories

They are critically relevant. From your own linked post ( how I delineate ) :

We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.

If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (an outcome which brain efficiency - as a concept - predicts in advance).

We need to align the AGI's values precisely right.

Not really - if the AGI is very similar to uploads, we just need to align them about as well as humans. Note this is intimately related to 1. and the technical relation between AGI and brains. If they are inevitably very similar then much of the classical AI risk argument dissolves.

You seem to be - like EY circa 2009 - in what I would call the EMH brain camp, as opposed to the ULM camp. It seems given the following two statements, you would put more weight on B than A:

A. The unique intellectual capabilities of humans are best explained by culture: our linguistically acquired mental programs, the evolution of which required vast synaptic capacity and thus is a natural emergent consequence of scaling.

B. The unique intellectual capabilities of humans are best explained by a unique architectural advance via genetic adaptations: a novel 'core of generality'^[1] that differentiates the human brain from animal brains.

This is a EY term; and if I recall correctly he still uses it fairly recently. ↩︎

Current AIs Provide Nearly No Data Relevant to AGI Alignment

jacob_cannell4mo14-4

Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take. The 'canonical' AI risk argument is implicitly based on a set of interdependent assumptions/predictions about the nature of future AI:

fast takeoff is more likely than slow, downstream dependent on some combo of:

continuation of Moore's Law
feasibility of hard 'diamondoid' nanotech
brain efficiency vs AI
AI hardware (in)-dependence

the inherent 'alien-ness' of AI and AI values
supposed magical coordination advantages of AIs
arguments from analogies: namely evolution

These arguments are old enough that we can now update based on how the implicit predictions of the implied worldviews turned out. The traditional EY/MIRI/LW view has not aged well, which in part can be traced to its dependence on an old flawed theory of how the brain works.

For those who read HPMOR/LW in their teens/20's, a big chunk of your worldview is downstream of EY's and the specific positions he landed on with respect to key scientific questions around the brain and AI. His understanding of the brain came almost entirely from ev psych and cognitive biases literature and this model in particular - evolved modularity - hasn't aged well and is just basically wrong. So this is entangled with everything related to AI risk (which is entirely about the trajectory of AI takeoff relative to human capability).

It's not a coincidence that many in DL/neurosci have a very different view (shards etc). In particular the Moravec view that AI will come from reverse engineering the brain, that progress is entirely hardware constrained and thus very smooth and predictable, that is the view turned out to be mostly all correct. (his late 90's prediction of AGI around 2028 is especially prescient)

So it's pretty clear EY/LW was wrong on 1. - the trajectory of takeoff and path to AGI, and Moravec et al was correct.

Now as the underlying reasons are entangled, Moravec et al was also correct on point 2 - AI from brain reverse engineering is not alien! (But really that argument was just weak regardless.) EY did not seriously consider that the path to AGI would involve training massive neural networks to literally replicate human thoughts.

Point 3 Isn't really taken seriously outside of the small LW sphere. By the very nature of alignment being a narrow target, any two random Unaligned AIs are especially unlikely to be aligned with each other. The idea of a magical coordination advantage is based on highly implausible code sharing premises (sharing your source code is generally a very bad idea, and regardless doesn't and can't actually prove that the code you shared is the code actually running in the world - the grounding problem is formidable and unsolved)

The problem with 4 - the analogy from evolution - is that it factually contradicts the doom worldview - evolution succeeded in aligning brains to IGF well enough despite a huge takeoff in the speed of cultural evolution over genetic evolution - as evidence by the fact that humans have one of the highest fitness scores of any species ever, and almost certainly the fastest growing fitness score.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

jacob_cannell7mo50

Nice work. I've long that that our ability to monitor the inner monologue of AI agents will be important for security&control - and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.

If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure "training agents not to lie" actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?

AI doom from an LLM-plateau-ist perspective

jacob_cannell1y20

For example, Judea Pearl published the belief propagation algorithm in 1982. Why hadn’t someone already published it in 1962? Or 1922?

Belief propagation is the kind of thing that most people wouldn't work on in an age before computers. It would be difficult to evaluate/test, but more importantly wouldn't have much hope for application. Seems to me it arrived at a pretty normal time in our world.

For example, people have known for decades that flexible hierarchical planning is very important in humans but no one can get it to really work well in AI, especially in a reinforcement learning context.

What do you think of diffusion planning?

AI doom from an LLM-plateau-ist perspective

jacob_cannell1y41

How long have you held your LLM plateau model and how well did it predict GPT4 scaling? How much did you update on GPT4? What does your model predict for (a hypothetical) GPT5?

My answers are basically that I predicted back in 2015 that something not much different than NNs of the time (GPT1 was published a bit after) could scale all the way with sufficient compute, and the main key missing ingredient of 2015 NNs was flexible context/input dependent information routing, which vanilla FF NNs lack. Transformers arrived in 2017^[1] with that key flexible routing I predicted (and furthermore use all previous neural activations as a memory store) which emulates a key brain feature in fast weight plasticity.

GPT4 was something of an update in that they simultaneously scaled up the compute somewhat more than I expected but applied it more slowly - taking longer to train/tune/iterate etc. Also the scaling to downstream tasks was somewhat better than I expected.

All that being said, the transformer arch on GPUs only strongly accelerates training (consolidation/crystallization of past information), not inference (generation of new experience), which explains much of what GPT4 lacks vs a full AGI (although there are other differences that may be important, that is probably primary, but further details are probably not best discussed in public).

Attention is All you Need ↩︎

Thoughts on hardware / compute requirements for AGI

jacob_cannell1y42

I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include

Uncontroversial was perhaps a bit tongue-in-cheek, but that claim is specifically about a narrow correspondence between LLMs and linguistic cortex, not about LLMs and the entire brain or the entire cortex.

And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless - because the task itself determines the solution.

Examples from recent neurosci literature:

From "Brains and algorithms partially converge in natural language processing":

Deep learning algorithms trained to predict masked words from large amount of text have recently been shown to generate activations similar to those of the human brain. However, what drives this similarity remains currently unknown. Here, we systematically compare a variety of deep language models to identify the computational principles that lead them to generate brain-like representations of sentences

From "The neural architecture of language: Integrative modeling converges on predictive processing":

Here, we report a first step toward addressing this gap by connecting recent artificial neural networks from machine learning to human recordings during language processing. We find that the most powerful models predict neural and behavioral responses across different datasets up to noise levels.

From "Correspondence between the layered structure of deep language models and temporal structure of natural language processing in the human brain"

We found a striking correspondence between the layer-by-layer sequence of embeddings from GPT2-XL and the temporal sequence of neural activity in language areas. In addition, we found evidence for the gradual accumulation of recurrent information along the linguistic processing hierarchy. However, we also noticed additional neural processes that took place in the brain, but not in DLMs, during the processing of surprising (unpredictable) words. These findings point to a connection between language processing in humans and DLMs where the layer-by-layer accumulation of contextual information in DLM embeddings matches the temporal dynamics of neural activity in high-order language areas.

Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it.

Scaling up GPT-3 by itself is like scaling up linguistic cortex by itself, and doesn't lead to AGI any more/less than that would (pretty straightforward consequence of the LLM <-> linguistic_cortex (mostly) functional equivalence).

In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around.

The comparison should between GPT-3 and linguistic-cortex, not the whole brain. For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task. For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data. In terms of flops-equivalent it's perhaps 1e22 sparse flops for training linguistic cortex (1e13 flops * 1e9 seconds) vs 3e23 flops for training GPT-3. So fairly close, but the brain is probably trading some compute efficiency for data efficiency.

Thoughts on hardware / compute requirements for AGI

jacob_cannell1y1316

He writes that the human brain has “1e13-1e15 spikes through synapses per second (1e14-1e15 synapses × 0.1-1 spikes per second)”. I think Joe was being overly conservative, and I feel comfortable editing this to “1e13-1e14 spikes through synapses per second”, for reasons in this footnote→^[9].

I agree that 1e14 synaptic spikes/second is the better median estimate, but those are highly sparse ops.

So when you say:

So I feel like 1e14 FLOP/s is a very conservative upper bound on compute requirements for AGI. And conveniently for my narrative, that number is about the same as the 8.3e13 FLOP/s that one can perform on the RTX 4090 retail gaming GPU that I mentioned in the intro.

You are missing some foundational differences in how von neumann arch machines (GPUs) run neural circuits vs how neuromorphic hardware (like the brain) runs neural circuits.

The 4090 can hit around 1e14 - even up to 1e15 - flops/s, but only for dense matrix multiplication. The flops required to run a brain model using that dense matrix hardware are more like 1e17 flops/s, not 1e14 flops/s. The 1e14 synapses are at least 10x locally sparse in the cortex, so dense emulation requires 1e15 synapses (mostly zeroes) running at 100hz. The cerebellum is actually even more expensive to simulate .. because of the more extreme connection sparsity there.

But that isn't the only performance issue. The GPU only runs matrix matrix multiplication, not the more general vector matrix multiplication. So in that sense the dense flop perf is useless, and the perf would instead be RAM bandwidth limited and require 100 4090's to run a single 1e14 synapse model - as it requires about 1B of bandwidth per flop - so 1e14 bytes/s vs the 4090's 1e12 bytes/s.

Your reply seems to be "but the brain isn't storing 1e14 bytes of information", but as other comments point out that has little to do with the neural circuit size.

The true fundamental information capacity of the brain is probably much smaller than 1e14 bytes, but that has nothing to do with the size of an actually *efficient* circuit, because efficient circuits (efficient for runtime compute, energy etc) are never also efficient in terms of information compression.

This is a general computational principle, with many specific examples: compressed neural frequency encodings of 3D scenes (NERFs) which access/use all network parameters to decode a single point O(N) are enormously less computationally efficient (runtime throughput, latency, etc) than maximally sparse representations (using trees, hashtables etc) which approach O(log(N)) or O(C), but the sparse representations are enormously less compressed/compact. These tradeoffs are foundational and unavoidable.

We also know that in many cases the brain and some ANN are actually computing basically the same thing in the same way (LLMs and linguistic cortex), and it's now obvious and uncontroversial that the brain is using the sparser but larger version of the same circuit, whereas the LLM ANN is using the dense version which is more compact but less energy/compute efficient (as it uses/accesses all params all the time).

My take on Jacob Cannell’s take on AGI safety

jacob_cannell1y30

One of my disagreements with your U,V,P,W,A model is that I think V & W are randomly-initialized in animals. Or maybe I’m misunderstanding what you mean by “brains also can import varying degrees of prior knowledge into other components”.

I think we agree the cortex/cerebellum are randomly initialized, along with probably most of the hippocampus, BG, perhaps amagdyla? and a few others. But those don't map cleanly to U, W/P, and V/A.

For example, I think most newborn behaviors are purely driven by the brainstem, which is doing things of its own accord without any learning and without any cortex involvement.

Of course - and that is just innate unlearned knowledge in V/A. V/A (value and action) generally go together, because any motor/action skills need pairing with value estimates so the BG can arbitrate (de-conflict) action selection.

The moral is: I claim that figuring out what’s empowering is not a “local” / “generic” / “universal” calculation. If I do X in the morning, it is unknowable whether that was an empowering or disempowering action, in the absence of information about where I’m likely to find myself in in the afternoon. And maybe I can make an intelligent guess at those, but I’m not omniscient. If I were a newborn, I wouldn’t even be able to guess.

Empowerment and value-of-information (curiosity) estimates are always relative to current knowledge (contextual to the current wiring and state of W/P and V/A). Doing X in the morning generally will have variable optionality value depending on the contextual state, goals/plans, location, etc. I'm not sure why you seem to think that I think of optionality-empowerment estimates as requiring anything resembling omniscience.

The newborns VoI and optionality value estimates will be completely different and focused on things like controlling flailing limbs and making sounds, moving the head, etc.

But I don’t know how the baby cats, bats, and humans are supposed to figure that out, via some “generic” empowerment calculation. Arm-flapping is equally immediately useless for both newborn bats and newborn humans, but newborn humans never flap their arms and newborn bats do constantly.

There's nothing to 'figure out' - it just works. If you're familiar with the approximate optionality-empowerment literature, it should be fairly obvious that a generic agent maximizing optionality, will end up flapping it's wing-arms when controlling a bat body, flailing limbs around in a newborn human body, balancing pendulums, learning to walk, etc. I've already linked all this - but maximizing optionality automatically learns all motor skills - even up to bipedal walking.

So yeah, it would be simple and elegant to say “the baby brain is presented with a bunch of knobs and levers and gradually discovers all the affordances of a human body”. But I don’t think that fits the data, e.g. the lack of human newborn arm-flapping experiments in comparison to bats.

Human babies absolutely do the equivalent experiments - most of the difference is simply due to large differences in the arm structure. The bat's long extensible arms are built to flap, the human infants' short stubby arms are built to flail.

Also keep in mind that efficient optionality is approximated/estimated from a sampling of likely actions in the current V/A set, so it naturally and automatically takes advantage of any prior knowledge there. Perhaps the bat does have prior wiring in V/A that proposes&generates simple flapping that can be improved

Instead, I think baby humans have an innate drive to stand up, an innate drive to walk, an innate drive to grasp, and probably a few other things like that. I think they already want to do those things even before they have evidence (or other rational basis to believe) that doing so is empowering.

This just doesn't fit the data at all. Humans clearly learn to stand and walk. They may have some innate bias in V/U which makes that subgoal more attractive, but that is intrinsically more complex addition to the basic generic underlying optionality control drive.

I claim that this also fits better into a theory where (1) the layout of motor cortex is relatively consistent between different people (in the absence of brain damage),

We've already been over that - consistent layout is not strong evidence of innate wiring. A generic learning system will learn similar solutions given similar inputs & objectives.

(2) decorticate rats can move around in more-or-less species-typical ways,

The general lesson from the decortication experiments is that smaller brain mammals rely on (their relatively smaller) cortex less. Rats/rabbits can do much without the cortex and have many motor skills available at birth. Cats/dogs need to learn a bit more, and then primates - especially larger ones - need to learn much more and rely on the cortex heavily. This is extreme in humans, to the point where there is very little innate motor ability left, and the cortex does almost everything.

(3) there’s strong evolutionary pressure to learn motor control fast and we know that reward-shaping is certainly helpful for that,

It takes humans longer than an entire rat lifespan just to learn to walk. Hardly fast.

(4) and that there’s stuff in the brainstem that can do this kind of reward-shaping,

Sure, but there is hardly room in the brainstem to reward-shape for the different things humans can learn to do.

Universal capability requires universal learning.

(5) lots of animals can get around reasonably well within a remarkably short time after birth,

Not humans.

(6) stimulating a certain part of the brain can create “an urge to move your arm” etc. which is independent from executing the actual motion,

Unless that is true for infants, it's just learned V components. I doubt infants have an urge to move the arm in a coordinated way, vs lower level muscle 'urges', but even if they did that's just some prior knowledge in V.

(If you put a novel and useful motor affordance on a baby human—some funny grasper on their hand or something—I’m not denying that they would eventually figure out how to start using it, thanks to more generic things like curiosity,

We know that humans can learn to see through their tongue - and this does not take much longer than an infant learning to see through its eyes.

I think we both agree that sensory cortex uses a pretty generic universal learning algorithm (driven by self supervised predictive learning). I just also happen to believe the same applies to motor and higher cortex (driven by some mix of VoI, optionality control, etc).

I think we’re giving baby animals too much credit if we expect them to be thinking to themselves “gee when I grow up I might need to be good at fighting so I should practice right now instead of sitting on the comfy couch”. I claim that there isn’t any learning signal or local generic empowerment calculation that would form the basis for that

Comments like these suggest you don't have the same model of optionality-empowerment as I do. When the cat was pinned down by the dog in the past, it's planning subsystem computed low value for that state - mostly based on lack of optionality - and subsequently the V system internalizes this as low value for that state and states leading towards it. Afterwards when entering a room and seeing the dog on the other side, the W/P planning system quickly evaluates a few options like: (run into the center and jump up onto the table), (run into the center and jump onto the couch), (run to the right and hide behind the couch), etc - and subplan/action (run into the center ..) gets selected in part because of higher optionality. It's just an intrinsic component of how the planning system chooses options on even short timescales, and chains recursively through training V/A.

My take on Jacob Cannell’s take on AGI safety

jacob_cannell1y30

I'll start with a basic model of intelligence which is hopefully general enough to cover animals, humans, AGI, etc. You have a model-based agent with a predictive world model W learned primarily through self-supervised predictive learning (ie learning to predict the next 'token' for a variety of tokens), a planning/navigation subsystem P which uses W to approximately predict sample important trajectories according to some utility function U, a value function V which computes the immediate net expected discounted future utility of actions from current state (including internal actions), and then some action function A which just samples high value actions based on V. The function of the planning subsystem P is then to train/update V.

The utility function U obviously needs some innate bootstrapping, but brains also can import varying degrees of prior knowledge into other components - and most obviously into V, the value function. Many animals need key functionality 'out of the box', which you can get by starting with a useful prior on V/A. The benefit for innate prior knowledge in V/A diminishes as brains scale up in net training compute (size * training time), so that humans - with net training compute ~1e25 ops vs ~1e21 ops for a cat - rely far more on learned knowledge for V/A rather than prior/innate knowledge.

So now to translate into your 3 levels:

A.): Innate drives: Innate prior knowledge in U and in V/A.

B.): Learned from experience and subsumed into system 1: using W/P to train V/A.

C.): System 2 style reasoning: zero shot reasoning from W/P.

(1) Evidence from cases where we can rule out (C), e.g. sufficiently simple and/or young humans/animals

So your A.) - innate drives - corresponds to U or the initial state of V/A at birth. I agree the example of newborn rodents avoiding birdlike shadows is probably mostly innate V/A - value/action function prior knowledge.

(2) Evidence from sufficiently distant consequences that we can rule out (B) Example: Many animals will play-fight as children. This has a benefit (presumably) of eventually making the animals better at actual fighting as adults. But the animal can’t learn about that benefit via trial-and-error—the benefit won’t happen until perhaps years in the future.

Sufficiently distant consequences is exactly what empowerment is for, as the universal approximator of long term consequences. Indeed the animals can't learn about that long term benefit through trial-and-error, but that isn't how most learning operates. Learning is mostly driven by the planning system 1 - M/P - which drives updates to V/A based on both current learned V and U - and U by default is primarily estimating empowerment and value of information as universal proxies.

The animals play-fighting is something I have witnessed and studied recently. We have a young dog and a young cat who organically have learned to play several 'games'. The main game is a simple chase where the larger dog tries to tackle the cat. The cat tries to run/jump to safety. If the dog succeeds in catching the cat, the dog will tackle constrain it on the ground, teasing it for a while. We - the human parents - often will interrupt the game at this point and occasionally punish the dog if it plays too rough and the cat complains. In the earliest phases the cat was about as likely to chase and attack the dog as the other way around, but over time learned it would near always lose wrestling matches and up in a disempowered state.

There is another type of ambush game the cat will play in situations where it can 'attack' the dog from safety or in range to escape to safety, and then other types of less rough play fighting they do close to us.

So I suspect that some amount of play fighting skill knowledge is prior instinctual, but much of it is also learned. The dog and cat both separately enjoy catching/chasing balls or small objects, the cat play fights and 'attacks' other toys, etc. So early on in their interactions they had these skills available, but those alone are not sufficient to explain the game(s) they play together.

The chase game is well explained by empowerment drive: the cat has learned that allowing the dog to chase it down leads to an intrinsically undesirable disempowered state. This is a much better fit for the data and also has much lower intrinsic complexity than a bunch of innate drives for every specific disempowered situation, vs a general empowerment drive. It's also empowering for the dog to control and disempower the cat to some extent. So much of innate hunting skill drives seem like just variations and/or mild tweaks to empowerment.

The only part of this that requires a more specific explanation is perhaps the safety aspect of play fighting: each animal is always pulling punches to varying degrees, the cat isn't using fully extended claws, neither is biting with full force, etc. That is probably the animal equivalent of empathy/altruism.

Status—I’m not sure whether Jacob is suggesting that human social status related behaviors are explained by (B) or (C) or both. But anyway I think 1,2,3,4 all push towards an (A)-type explanation for human social status behaviors. I think I would especially start with 3 (heritability)—if having high social status is generally useful for achieving a wide variety of goals, and that were the entire explanation for why people care about it, then it wouldn’t really make sense that some people care much more about status than others do, particularly in a way that (I’m pretty sure) statistically depends on their genes

Status is almost all learned B: system 2 W/P planning driving system 1 V/A updates.

Earlier I said - and I don't see your reply yet, so i'll repeat it here:

Infants don't even know how to control their own limbs, but they automatically learn through a powerful general empowerment learning mechanism. That same general learning signal absolutely does not - and can not - discriminate between hidden variables representing limb poses (which it seeks to control) and hidden variables representing beliefs in other humans minds (which determine constraints on the child's behavior). It simply seeks to control all such important hidden variables.

Social status drive emerges naturally from empowerment, which children acquire by learning cultural theory of mind and folk game theory through learning to communicate with and through their parents. Children quickly learn that hidden variables in their parents have huge effect on their environment and thus try to learn how to control those variables.

It's important to emphasize that this is all subconscious and subsumed into the value function, it's not something you are consciously aware of.

I don't see how heritability tells us much about how innate social status is. Genes can control many hyperparms which can directly or indireclty influence the later learned social status drive. One obvious example is just the relevant weightings of value-of-information (curiosity) vs optionality-empowerment and other innate components of U at different points in time (development periods). I think this is part of the explanation for children who are highly curious about the world and less concerned about social status vs the converse.

Fun—Jacob writes “Fun is also probably an emergent consequence of value-of-information and optionality” which I take to be a claim that “fun” is (B) or (C), not (A). But I think it’s (A).

Fun is complex and general/vague - it can be used to describe almost anything we derive pleasure from in your A.) or B.) categories.

Reward is not the optimization target

jacob_cannell1y24

Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn't navigate to that future.

If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.

This is not just purely speculation in the sense that you can run efficient zero in scenarios like this, and I bet it goes for the blueberry.

Your mental model seems to assume pure model-free RL trained to the point that it gains some specific model-based predictive planning capabilities without using those same capabilities to get greater reward.

Humans often intentionally avoid some high reward 'blueberry' analogs like drugs using something like the process you describe here, but hedonic reward is only one component of the human utility function, and our long term planning instead optimizes more for empowerment - which is usually in conflict with short term hedonic reward.