Kaj Sotala


I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy.

What's your take on "parts work" techniques like IDC, IFS, etc. seeming to bring up something like private (or at least not completely shared) world models? Do you consider the kinds of "parts" those access as being distinct from shards?

I would find it plausible to assume by default that shards have something like differing world models since we know from cognitive psychology that e.g. different emotional states tend to activate similar memories (easier to remember negative things about your life when you're upset than if you are happy), and different emotional states tend to activate different shards. 

I suspect that something like the Shadlen & Shohamy take on decision-making might be going on:

The proposal is that humans make choices based on subjective value [...] by perceiving a possible option and then retrieving memories which carry information about the value of that option. For instance, when deciding between an apple and a chocolate bar, someone might recall how apples and chocolate bars have tasted in the past, how they felt after eating them, what kinds of associations they have about the healthiness of apples vs. chocolate, any other emotional associations they might have (such as fond memories of their grandmother’s apple pie) and so on.

Shadlen & Shohamy further hypothesize that the reason why the decision process seems to take time is that different pieces of relevant information are found in physically disparate memory networks and neuronal sites. Access from the memory networks to the evidence accumulator neurons is physically bottlenecked by a limited number of “pipes”. Thus, a number of different memory networks need to take turns in accessing the pipe, causing a serial delay in the evidence accumulation process.

Under that view, I think that shards would effectively have separate world models, since each physically separate memory network suggesting that an action is good or bad is effectively its own shard; and since a memory network is a miniature world model, there's a sense in which shards are nothing but separate world models. 

E.g. the memory of "licking the juice tasted sweet" is a miniature world model according to which licking the juice lets you taste something sweet, and is also a shard. (Or at least it forms an important component of a shard.) That miniature world model is separate from the shard/memory network/world model holding instances of times when adults taught the child to say "thank you" when given something; the latter shard only has a world model of situations where you're expected to say "thank you", and no world model of the consequences of licking juice.

I think Shard Theory is one of the most promising approaches on human values that I've seen on LW, and I'm very happy to see this work posted. (Of course, I'm probably biased in that I also count my own approaches to human values among the most promising and Shard Theory shares a number a similarities with it - e.g. this post talks about something-like-shards issuing mutually competitive bids that get strengthened or weakened depending on how environmental factors activate those shards, and this post talked about values and world-models being learned in an intertwined manner.)

The one big coordination win I recall us having was the 2015 Research Priorities document that among other things talked about the threat of superintelligence. The open letter it was an attachment to was signed by over 8000 people, including prominent AI and ML researchers.

And then there's basically been nothing of equal magnitude since then.

I would presume that the AI would know that humans are likely to try to resist a takeover attempt, and to have various safeguards against it. It might be smart enough to be able to overcome any human response, but that seems to only work if it actually puts that intelligence to work by thinking about what (if anything) it needs to do to counteract the human response. 

More generally, humans are such a major influence on the world as well as a source of potential resources, that it would seem really odd for any superintelligence to naturally arrive on a world-takeover plan without at any point happening to consider how this will affect humanity and whether that suggests any changes to the plan. 

so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.

There's also the chance that if these posts are not gated, people who previously weren't plugged into the LW ecosystem but are interested in AI find LW through articles such as this one. And then eventually also start reading other articles here and become more interested in alignment concerns.

There's also a bit of a negative stereotype among some AI researchers as alignment people being theoretical philosophers doing their own thing and being entirely out of touch about what real AI is like. They might take alignment concerns a bit more seriously if they find it easy to actually find competent AI discussion on LW / Alignment Forum.

This, in turn, implies that human values/biases/high-level cognitive observables are produced by relatively simpler hardcoded circuitry, specifying e.g. the learning architecture, the broad reinforcement learning and self-supervised learning systems in the brain, and regional learning hyperparameters.

See also the previous LW discussion of The Brain as a Universal Learning Machine.

... the evolved modularity cluster posits that much of the machinery of human mental algorithms is largely innate. General learning - if it exists at all - exists only in specific modules; in most modules learning is relegated to the role of adapting existing algorithms and acquiring data; the impact of the information environment is de-emphasized. In this view the brain is a complex messy cludge of evolved mechanisms.

There is another viewpoint cluster, more popular in computational neuroscience (especially today), that is almost the exact opposite of the evolved modularity hypothesis. I will rebrand this viewpoint the "universal learner" hypothesis, aka the "one learning algorithm" hypothesis (the rebranding is justified mainly by the inclusion of some newer theories and evidence for the basal ganglia as a 'CPU' which learns to control the cortex). The roots of the universal learning hypothesis can be traced back to Mountcastle's discovery of the simple uniform architecture of the cortex.[6]

The universal learning hypothesis proposes that all significant mental algorithms are learned; nothing is innate except for the learning and reward machinery itself (which is somewhat complicated, involving a number of systems and mechanisms), the initial rough architecture (equivalent to a prior over mindspace), and a small library of simple innate circuits (analogous to the operating system layer in a computer). In this view the mind (software) is distinct from the brain (hardware). The mind is a complex software system built out of a general learning mechanism.

Conversely, the genome can access direct sensory observables, because those observables involve a priori-fixed “neural addresses.” For example, the genome could hardwire a cute-face-detector which hooks up to retinal ganglion cells (which are at genome-predictable addresses), and then this circuit could produce physiological reactions (like the release of reward). This kind of circuit seems totally fine to me.

Related: evolutionary psychology used to have a theory according to which humans had a hardwired fear of some stimuli (e.g. spiders and snakes). But more recent research has moved towards a model where, rather than “the fear system” itself having innate biases towards picking up particular kinds of fears, our sensory system (which brings in data that the fear system can then learn from) is biased towards paying extra attention to the kinds of shapes that look like spiders and snakes. Because these stimuli then become more attended than others, it also becomes more probable that a fear response gets paired with them.

This, in turn, implies that human values/biases/high-level cognitive observables are produced by relatively simpler hardcoded circuitry, specifying e.g. the learning architecture, the broad reinforcement learning and self-supervised learning systems in the brain, and regional learning hyperparameters. 

The original WEIRD paper is worth reading for anyone who hasn't already done so; it surveyed various cross-cultural studies which showed that a variety of things that one might assume to be hardwired were actually significantly culturally influenced, including things such as optical illusions:

Many readers may suspect that tasks involving “low-level” or “basic” cognitive processes such as vision will not vary much across the human spectrum (Fodor 1983). However, in the 1960s an interdisciplinary team of anthropologists and psychologists systematically gathered data on the susceptibility of both children and adults from a wide range of human societies to five “standard illusions” (Segall et al. 1966). Here we highlight the comparative findings on the famed Müller-Lyer illusion, because of this illusion’s importance in textbooks, and its prominent role as Fodor’s indisputable example of “cognitive impenetrability” in debates about the modularity of cognition (McCauley & Henrich 2006). Note, however, that population-level variability in illusion susceptibility is not limited to the Müller-Lyer illusion; it was also found for the Sander-Parallelogram and both Horizontal-Vertical illusions.

Segall et al. (1966) manipulated the length of the two lines in the Müller-Lyer illusion (Fig. 1) and estimated the magnitude of the illusion by determining the approximate point at which the two lines were perceived as being of the same length. Figure 2 shows the results from 16 societies, including 14 small-scale societies. The vertical axis gives the “point of subjective equality” (PSE), which measures the extent to which segment “a” must be longer than segment “b” before the two segments are judged equal in length. PSE measures the strength of the illusion.

The results show substantial differences among populations, with American undergraduates anchoring the extreme end of the distribution, followed by the South African-European sample from Johannesburg. On average, the undergraduates required that line “a” be about a fifth longer than line “b” before the two segments were perceived as equal. At the other end, the San foragers of the Kalahari were unaffected by the so-called illusion (it is not an illusion for them). While the San’s PSE value cannot be distinguished from zero, the American undergraduates’ PSE value is significantly different from all the other societies studied.

As discussed by Segall et al., these findings suggest that visual exposure during ontogeny to factors such as the “carpentered corners” of modern environments may favor certain optical calibrations and visual habits that create and perpetuate this illusion. That is, the visual system ontogenetically adapts to the presence of recurrent features in the local visual environment. Because elements such as carpentered corners are products of particular cultural evolutionary trajectories, and were not part of most environments for most of human history, the Müller-Lyer illusion is a kind of culturally evolved by-product (Henrich 2008).

These findings highlight three important considerations. First, this work suggests that even a process as apparently basic as visual perception can show substantial variation across populations. If visual perception can vary, what kind of psychological processes can we be sure will not vary? It is not merely that the strength of the illusory effect varies across populations – the effect cannot be detected in two populations. Second, both American undergraduates and children are at the extreme end of the distribution, showing significant differences from all other populations studied; whereas, many of the other populations cannot be distinguished from one another. Since children already show large population-level differences, it is not obvious that developmental work can substitute for research across diverse human populations. Children likely have different developmental trajectories in different societies. Finally, this provides an example of how population-level variation can be useful for illuminating the nature of a psychological process, which would not be as evident in the absence of comparative work.

A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.

Yes, that was my argument in the comment that I linked. :)

5. Empirical evidence: human intelligence generalised far without staying aligned with its optimisation target.

I think this one is debatable. It seems to me that human intelligence has remained reasonably well aligned with its optimization target, if its optimization target is defined as "being well-fed, getting social status, remaining healthy, having sex, raising children, etc.", i.e. the things that evolution actually could optimize humans for rather than something like inclusive fitness that it couldn't directly optimize for. Yes there are individual humans who are less interested in pursuing particular pieces on that list (e.g. many prefer not to have children), but that's because the actual thing being optimized is a combination of those variables that's sensitive to local conditions. Any goal drift is then from a changed environment that acts as an input to the optimization target, rather than from an increase in capabilities as such.

Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior.

I was just thinking about this. The central example that's often used here is "evolution optimized humans for inclusive genetic fitness, nonetheless humans do not try to actually maximize the amount of their surviving offspring, such as by everyone wanting to donate to sperm/egg banks".

But evolution does not seem to maximize fitness in that sense, where the fitness of a species would be a distinct thing-in-the-world that could be directly observed and optimized for. Something like "docileness" or "size", as used in animal breeding, would be a much better analogy, since those things are something that you can directly observe and optimize for - and human breeders do.

And... if humans had been explicitly bred for friendliness and corrigibility for a while, it seems to me that they likely would want to do the analogous thing of maximizing-their-donations-to-sperm/egg-banks. After all, we can already see that people who are high on either end of some personality trait such as altruism/selfishness, dominance/submission, openness/conservatism, etc., are likely to view that trait as a virtue (as long as nothing in the environment too overwhelmingly disproves this) and seek to become even more like that.

Altruistic people often want to become even more altruistic, selfish people eliminate their altruistic "weaknesses", dominant people to become more dominant, submissive people to make it easier for themselves to submit (this has some strong counterforces in our culture where submissiveness is generally considered undesirable, but you can still see it valued in e.g. workplace cultures where workers resent reforms that would give them more autonomy, preferring bosses to "just tell them what to do"), open people to become more open to experience, and so on.

Probably if people high on such traits were offered chances to self-modify to become even moreso - which seems analogous to the sperm/egg bank thing, since it's the cognitive optimization form of the instinctive thing - quite a few of them would.

Load More