AI ALIGNMENT FORUM
AF

1462
Chris van Merwijk
Ω430160
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Thoughts on responsible scaling policies and regulation
Chris van Merwijk19d10

I think forcing people to publicly endorse policies that they don't endorse in practice just because they would solve the problem in theory is not a recipe for policy success.

 

I know this was said in a different context, but:

The request from people like Yudkowsky, Soares, PauseAI, etc is not that people should publicly endorse the policy despite not endorsing it in practice. 

Their request is that they shouldn't be held back from saying so only because they think the policy is unlikely to happen.

There's a difference between
(1) I don't support a pause because it's unlikely to happen, even though it would be a great policy, better than the policy I'm pushing.
(2) I don't support a pause because it wouldn't be executed well and would be negative.

They're saying (1) is bad, not (2).

Reply
The Pando Problem: Rethinking AI Individuality
Chris van Merwijk6mo20

This comment was written by Claude, based on my bullet points:

I've been thinking about the split-brain patient phenomenon as another angle on this AI individuality question.

Consider split-brain patients: despite having the corpus callosum severed, the two hemispheres don't suddenly become independent agents with totally different goals. They still largely cooperate toward shared objectives. Each hemisphere makes predictions about what the other is doing and adjusts accordingly, even without direct communication.

Why does this happen? I think it's because both hemispheres were trained together for their whole life, developing shared predictive models and cooperative behaviors. When the connection is cut, these established patterns don't just disappear—each hemisphere fills in missing information with predictions based on years of shared experience.

Similarly, imagine training an AI model to solve some larger task, consisting of a bunch of subtasks. Just for practical reasons it will have to carve up the subtask to some extent and call instances of itself to solve the subtask. In order to perform the larger task well, there will be an incentive on the model for these instances to have internal predictive models, habits, drives of something like "I am part of a larger agent, performing a subtask".

Even if we later placed multiple instances of such a model (or of different but similar models) in positions meant to be adversarial - perhaps as checks and balances on each other - they might still have deeply embedded patterns predicting cooperative behavior from similar models. Each instance might continue acting as if it were part of a larger cooperative system, maintaining coordination through these predictive patterns rather than through communication even though their "corpus callosum" is cut (in analogy with split brain patients).

I'm not sure how far this analogy goes, it's just a thought.

Reply
The Pando Problem: Rethinking AI Individuality
Chris van Merwijk6mo10

A version of what ChatGPT wrote here prompted

What was the prompt?

Reply
Compositional language for hypotheses about computations
Chris van Merwijk7mo10

I may be confused somehow. Feel free to ignore. But:
* At first I thought you meant the input alphabet to be the colors, not the operations.
* Instead, am I correct that "the free operad generated by the input alphabet of the tree automaton" is an operad with just one color, and the "operations" are basically all the labeled trees where labels of the nodes are the elements of the alphabet, such that the number of children of a node is always equal to the arity of that label in the input alphabet?
* That would make sense, as the algebra would then I guess assign the state space of the tree automaton to the single color of the operad, and each arity n operation would be mapped to the mathematical function from Q^n to Q.
* That would make sense I think, but then why do you talk about a "colored" operad in: "we can now define a deterministic automaton over a (colored) operad O to be an O-algebra"?
 

Reply
Compositional language for hypotheses about computations
Chris van Merwijk7mo10

More precisely, they are algebras over the free operad generated by the input alphabet of the tree automaton

Wouldn't this fail to preserve the arity of the input alphabet? i.e. you can have trees where a given symbol occurs multiple times, and with different amounts of children? That wouldn't be allowed from the perspective of the tree automaton right?

Reply
Gradual Disempowerment, Shell Games and Flinches
Chris van Merwijk8mo10

but note that the gradual problem makes the risk of coups go up.

Just a request for editing the post to clarify: do you mean coups by humans (using AI), coups by autonomous misaligned AI, or both?

Reply
Many arguments for AI x-risk are wrong
Chris van Merwijk9mo20

EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom, Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments "always" have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.

As one of Evan's co-authors on the mesa-optimization paper from 2019 I can confirm this. I don't recall ever thinking seriously about a counting argument over functions. 

Reply
What’s the short timeline plan?
Chris van Merwijk9mo20

I just want to register a prediction that I think something like meta's coconut will in the long run in fact perform much better than natural language CoT. Perhaps not in this time-frame though.

Reply
Cortés, Pizarro, and Afonso as Precedents for Takeover
Chris van Merwijk2y10

NEW EDIT: After reading three giant history books on the subject, I take back my previous edit. My original claims were correct.

Could you edit this comment to add which three books you're referring to?

Reply
The Waluigi Effect (mega-post)
Chris van Merwijk3y80

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

Reply
Load More
14Extinction Risks from AI: Invisible to Science?
2y
6
29Risks from Learned Optimization: Conclusion and Related Work
6y
4
39Deceptive Alignment
6y
6
30The Inner Alignment Problem
6y
6
30Conditions for Mesa-Optimization
6y
35
58Risks from Learned Optimization: Introduction
6y
33