Gunnar_Zarncke — AI Alignment Forum

Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)

Hi, is there a way to get people in touch with a project or project lead? For example, I'd like to get in touch with Masaharu Mizumoto because iVAIS sounds related to the aintelope project.

I notice that o1's behavior (it's cognitive process) looks suspiciously like human behaviors:

Cognitive dissonance: o1 might fabricate or rationalize to maintain internal consistency of conflicting data (which means there is inconsistency).
Impression management/Self-serving bias: o1 may attempt to appear knowledgeable or competent, leading to overconfidence because it is rewarded for the look more than for the content (which means the model is stronger than the feedback).

But why is this happening more when o1 can reason more than previous models? Shouldn't that give it more ways to catch its own deception?

No:

Overconfidence in plausibility: With enhanced reasoning, o2 can generate more sophisticated explanations or justifications, even when incorrect. o2 "feels" more capable and thus might trust its own reasoning more, producing more confident errors ("feels" in the sense of expecting to be able to generate explanations that will be rewarded as good).
Lack of ground-truth: Advanced reasoning doesn't guarantee access to verification mechanisms. o2 is rewarded for producing convincing responses, not necessarily for ensuring accuracy. Better reasoning can increase the capability to "rationalize" rather than self-correct.
Complexity in mistakes: Higher reasoning allows more complex thought processes, potentially leading to mistakes that are harder to identify or self-correct.

Most of this is analogous to how more intelligent people ("intellectuals") can generate elaborate, convincing—but incorrect—explanations that cannot be detected by less intelligent participants (who may still suspect something is off but can't prove it).

I think the four scenarios outlined here roughly map to the areas 1, 6, 7, and 8 of the 60+ Possible Futures post.

Can you provide some simple or not-so-simple example automata in that language?

Just a data point that support hold_my_fish's argument: Savant Kim Peek did likely memorize gigabytes of information and could access them quite reliably:

https://personal.utdallas.edu/~otoole/CGS_CV/R13_savant.pdf

Are there different classes of learning systems that optimize for the reward in different ways?

I don't think that shards are distinct - neither physically nor logically, so they can't hide stuff in the sense of keeping it out of view of the other shards.

Also, I don't think "querying for plans" is a good summary of what goes on in the brain.

I'm coming more from a brain-like AGI lens, and my account of what goes on would be a bit different. I'm trying to phrase this in shard theory terminology.

First, a prerequisite: Why do Alice's shards generate thoughts that value Rick's state, to begin with? The Risk-shard has learned that actions that make Rick happy result in states of Alice that are reinforced (Alice being happy/healthy).

Given that, I see the process as follows:

Alice's Rick-shards generate thoughts at different levels of abstraction about Alice being happy/healthy because Rick is happy/likes her. Examples:
1. Conversion (maybe a cached thought out of discussions they had) -> will have low predicted value for Alice
2. Going to church with Rick -> mixed
3. Being close to Rick in the Church (emphasis on closeness, Church in the background, few aspects active) -> trend positively
4. Being in the Church and thinking it's wrong -> consistent
5. Rick being happy that she joins him -> positive
So the world model returns no plan but only fragments of potential plans, some where she converts and goes to church with Rick, some not, some other combinations.
As there is no plan no purpose must be hidden. Shards only bid for or against parts of plans.
Some of these fragments satisfy enough requirements of both retaining atheist Alice's values (which are predicted to be good for her) as well as scoring on Rick-happiness. Elaborating on these fragments will lead to the activation of more details that are at least somewhat compatible with all shards too. We only call the result of this a "rationalization."
So that she eventually generates enough detailed thoughts that score positively that she actually decides to implement an aggregate of these fragments, which we can call a church-going plan.
So that she gets positive reinforcement for going to church,
which reinforces all aspects of the experience, including being in church, which, in aggregate, we can call a religion-shard,
1. I agree that this changes her internal shard balance significantly - she has learned something she didn't know before, and that leaves her better off (as measured by some fundamental health/happyness measurements).
2. I think this can be meaningfully called value drift only with respect to either existing shards (though these are an abstraction we are using, not something that's fundamental to the brain), or with respect to Alice's interpretations/verbalizations - thoughts that are themselves reinforced by shards.
so that more such thoughts come up, and she eventually converts,
so that Rick ends up happier and liking Alice more - though that was never the "plan" to begin with.

In short: There is no top-down planning but bottom-up action generation. All planning is constructed out of plan fragments that are compatible with all (existing) shards.

Some other noteworthy groups in academia lead by people who are somewhat connected to this community:
- Jacob Steinhardt (Berkeley)
- Dylan Hadfield-Menell (MIT)
- Sam Bowman (NYU)
- Roger Grosse (UofT)
Some other noteworthy groups in academia lead by people who are perhaps less connected to this community:
- Aleksander Madry (MIT)
- Percy Liang (Stanford)
- Scott Neikum (UMass Amhearst)

Can you provide some links to these groups?

Some observations:

Genes reproduce themselves.
Humans reproduce themselves.
Symbols are relearned.
Values are reproduced.

Each needs an environment to do so, but the key observation seems to be that a structure is reliably reproduced across intermediate forms (mitosis, babies, language, society) and build on top of each other. It seems plausible that there is a class of formal representations that describe

the parts that are retained across instances and
the embedding into each other (values into genes and symbols), and
the dynamics of the transfer.

You don't talk about human analogs of grokking, and that makes sense for a technical paper like this. Nonetheless, grokking also seems to happen in humans, and everybody has had "Aha!" moments before. Can you maybe comment a bit on the relation to human learning? It seems clear that human grokking is not a process that purely depends on the number of training samples seen but also on the availability of hypotheses. People grok faster if you provide them with symbolic descriptions of what goes on. What are your thoughts on the representation and transfer of the resulting structure, e.g., via language/token streams?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments