Kaj Sotala

Wiki Contributions


At least ChatGPT seems to have a longer context window, this experiment suggesting 8192 tokens.

And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off.

Maybe - but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario.

A sudden wave of destabilizing AI breakthroughs - with DALL-E/Midjourney/Stable Diffusion suddenly disrupting art and Chat-GPT who-knows-how-many-things - can also make people on the street concerned and both more supportive of AI regulation in general, as well as more inclined to take AGI scenarios seriously in particular. I recently saw a blog post from someone speculating that this might cause a wide variety of actors - M & G included - with a desire to slow down AI progress to join forces to push for widespread regulation.

I was a bit surprised to see Eliezer invoke the Wason Selection Task. I'll admit that I haven't actually thought this through rigorously, but my sense was that modern machine learning had basically disproven the evpsych argument that those experimental results require the existence of a separate cheating-detection module. As well as generally calling the whole massive modularity thesis into severe question, since the kinds of results that evpsych used to explain using dedicated innate modules now look a lot more like something that could be produced with something like GPT.

... but again I never really thought this through explicitly, it was just a general shift of intuitions that happened over several years and maybe it's wrong.

No worries!

That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness.

Yeah, I did drift towards more generic terms like "subsystems" or "parts" later in the series for this reason, and might have changed the name of the sequence if only I'd managed to think of something better. (Terms like "subagents" and "multi-agent models of mind" still gesture away from rational agent models in a way that more generic terms like "subsystems" don't.)

Unlike in subagent models, the subcomponents of agents are not themselves always well modeled as (relatively) rational agents. For example, there might be shards that are inactive most of the time and only activate in a few situations.

For what it's worth, at least in my conception of subagent models, there can also be subagents that are inactive most of the time and only activate in a few situations. That's probably the case for most of person's subagents, though of course "subagent" isn't necessarily a concept that cuts reality a joints, so this depends on where exactly you'd draw the boundaries for specific subagents.

Shard theory claims that the process that maps shards to actions can be modeled as making “bids” to a planner. That is, instead of shards directly voting on actions, they attempt to influence the planner in ways that have “historically increased the probability of executing plans” favored by the shard. For example, if the juice shard bringing a memory of consuming juice to conscious attention has historically led to the planner outputting plans where the baby consumes more juice, then the juice shard will be shaped via reinforcement learning to recall memories of juice consumption at opportune times. On the other hand, if raising the presence of a juice pouch to the planner’s attention has never been tried in the past, then we shouldn’t expect the juice shard to attempt this more so than any other random action.

This is another way in which shard theory differs from subagent models—by default, shards aren’t doing their own planning or search; they merely execute strategies that are learned via reinforcement learning.

This is actually close to the model of subagents that I had in "Subagents, akrasia, and coherence in humans" and "Subagents, neural Turing machines, thought selection, and blindspots". The former post talks about subagents sending competing bids to a selection mechanism that picks the winning bid based on (among other things) reinforcement learning and the history of which subagents have made successful predictions in the past. It also distinguishes between "goal-directed" and "habitual" subagents, where "habitual" ones are mostly executing reinforced strategies rather than doing planning. 

The latter post talks about learned rules which shape our conscious content, and how some of the appearance of planning and search may actually come from reinforcement learning creating rules that modify consciousness in specific ways (e.g. the activation of an "angry" subagent frequently causing harm, with reward then accruing to selection rules that block the activation of the "angry" subagent such as by creating a feeling of confusion instead, until it looks like there is a "confusion" subagent that "wants" to block the feeling of anger).

I think Anders Sandberg did research on this at one point, and I recall him summarizing his findings as "things are easy to ban as long as nobody really wants to have them". IIRC, things that went into that category were chemical weapons (they actually not very effective in modern warfare), CFCs (they were relatively straightforward to replace with equally effective alternatives), and human cloning.

I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy.

What's your take on "parts work" techniques like IDC, IFS, etc. seeming to bring up something like private (or at least not completely shared) world models? Do you consider the kinds of "parts" those access as being distinct from shards?

I would find it plausible to assume by default that shards have something like differing world models since we know from cognitive psychology that e.g. different emotional states tend to activate similar memories (easier to remember negative things about your life when you're upset than if you are happy), and different emotional states tend to activate different shards. 

I suspect that something like the Shadlen & Shohamy take on decision-making might be going on:

The proposal is that humans make choices based on subjective value [...] by perceiving a possible option and then retrieving memories which carry information about the value of that option. For instance, when deciding between an apple and a chocolate bar, someone might recall how apples and chocolate bars have tasted in the past, how they felt after eating them, what kinds of associations they have about the healthiness of apples vs. chocolate, any other emotional associations they might have (such as fond memories of their grandmother’s apple pie) and so on.

Shadlen & Shohamy further hypothesize that the reason why the decision process seems to take time is that different pieces of relevant information are found in physically disparate memory networks and neuronal sites. Access from the memory networks to the evidence accumulator neurons is physically bottlenecked by a limited number of “pipes”. Thus, a number of different memory networks need to take turns in accessing the pipe, causing a serial delay in the evidence accumulation process.

Under that view, I think that shards would effectively have separate world models, since each physically separate memory network suggesting that an action is good or bad is effectively its own shard; and since a memory network is a miniature world model, there's a sense in which shards are nothing but separate world models. 

E.g. the memory of "licking the juice tasted sweet" is a miniature world model according to which licking the juice lets you taste something sweet, and is also a shard. (Or at least it forms an important component of a shard.) That miniature world model is separate from the shard/memory network/world model holding instances of times when adults taught the child to say "thank you" when given something; the latter shard only has a world model of situations where you're expected to say "thank you", and no world model of the consequences of licking juice.

I think Shard Theory is one of the most promising approaches on human values that I've seen on LW, and I'm very happy to see this work posted. (Of course, I'm probably biased in that I also count my own approaches to human values among the most promising and Shard Theory shares a number a similarities with it - e.g. this post talks about something-like-shards issuing mutually competitive bids that get strengthened or weakened depending on how environmental factors activate those shards, and this post talked about values and world-models being learned in an intertwined manner.)

The one big coordination win I recall us having was the 2015 Research Priorities document that among other things talked about the threat of superintelligence. The open letter it was an attachment to was signed by over 8000 people, including prominent AI and ML researchers.

And then there's basically been nothing of equal magnitude since then.

I would presume that the AI would know that humans are likely to try to resist a takeover attempt, and to have various safeguards against it. It might be smart enough to be able to overcome any human response, but that seems to only work if it actually puts that intelligence to work by thinking about what (if anything) it needs to do to counteract the human response. 

More generally, humans are such a major influence on the world as well as a source of potential resources, that it would seem really odd for any superintelligence to naturally arrive on a world-takeover plan without at any point happening to consider how this will affect humanity and whether that suggests any changes to the plan. 

Load More