AI ALIGNMENT FORUM
AF

Chris_Leong
Ω4583220779
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Wise AI Wednesdays
2Chris_Leong's Shortform
6y
4
The Open Agency Model
Chris_Leong1mo10

Is there any chance you could define what you mean by "open agency"? Do you essentially mean "distributed agency"?

Reply
the void
Chris_Leong3mo*30

Lots of fascinating points, however:

a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it's also worth flagging that there's less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you've neglected the potential for us to apply filtering to the training data. Whilst I don't think the solution will be that simple, I don't think the relation is quite as straightforward as you claim.
c) The discussion of "how do you think the LLM's feel about these experiments" is interesting, but it is also overly anthromorphic. LLM's are anthromorphic to a certain extent having been trained on human data, but it is still mistaken to run a purely anthromorphic analysis that doesn't account for other training dynamics.
d) Whilst you make a good point in terms of how the artificiality of the scenario might be affecting the experiment, I feel you're being overly critical of some of research into how models might misbehave. Single papers are rarely definitive and often there's value in just showing a phenomenon exists in order to spur further research on it, which can explore a wider range of theories about mechanisms. It's very easy to say "oh this is poor quality research because it doesn't my favourite objection". I've probably fallen into this trap myself. However, the number of possible objections that could be made is often pretty large and if you never published until you addressed everything, you'd most likely never publish.
e) I worry that some of your skepticism of the risks manages to be persuasive by casting vague asperations that are disconnected from the actual strength of the arguments. You're like "oh, the future, the future, people are always saying it'll happen in the future", which probably sounds convincing to folks who haven't been following that closely, but it's a lot less persuasive if you know that we've been consistently seeing stronger results over time (in addition to a recent spike in anecdotes with the new reasoning models). This is just a natural part of the process, when you're trying to figure out how to conduct solid research in a new domain, of course it's going to take some time.

Reply1
When is it important that open-weight models aren't released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.
Chris_Leong3mo10

I think it's valuable for some people to say that it's a terrible idea in advance so they have credibility after things go wrong.

Reply
Alignment first, intelligence later
Chris_Leong4mo10

Whilst interesting, this post feels very assertive.

You claim that biological systems work by maintaining alignment as they scale. In what sense is this true?

You say that current methods lack a vision of a current whole. In what sense? There's something extremely elegant about pre-training to learn a world model, doing supervised learning to select a sub-distribution and using RL to develop past the human level. In what sense does this "lack a vision"?

I'm open to the possibility that we need to align a model as we make it more intelligent to prevent the agent sabotaging the process. But it's unclear from this article if this is why you want alignment first or for some other reason.

Reply
Models Don't "Get Reward"
Chris_Leong5mo10

I really liked the analogy of taking actions, falling asleep then waking up (possibly with some modifications) and continuing.

I was already aware of your main point, but the way you've described it is a much clearer way of thinking about this.

Reply
A Problem to Solve Before Building a Deception Detector
Chris_Leong5mo20

Recently, the focus of mechanistic interpretability work has shifted to thinking about "representations", rather than strictly about entire algorithms


Recently? From what I can tell, this seems to have been a focus from the early days (1, 2).

That said, great post! I really appreciated your conceptual frames.

Reply
Chris_Leong's Shortform
Chris_Leong5mo*611

Collapsable boxes are amazing. You should consider using them in your posts.

They are a particularly nice way of providing a skippable aside. For example, filling in background information, answering an FAQ or including evidence to support an assertion.

Compared to footnotes, collapsable boxes are more prominent and are better suited to contain paragraphs or formatted text.

Reply
AI for AI safety
Chris_Leong5mo*1-1

Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.

One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very diffuse factor. At the same time, a less diffuse intervention would be to increase the wisdom of specific actors.

You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll quickly end up with AI systems too capable for us to control".

I agree. In fact, a key reason why I think this is important is that we can't afford to leave anything on the table.

One of the things I like about the approach of training AI advisors is that humans can compensate for weaknesses in the AI system. In other words, I'm introducing a third category of labour human-AI cybernetic systems/centaur labour. I think that it's likely that this might widen the sweet spot, however, we have to make sure that we do this in a way that differentially benefits safety.

You do discuss the possibility of using AI to unlock enhanced human labour. It would also be possible to classify such centaur systems under this designation.

  1. ^

    More broadly, I think there's merit to the cyborgism approach even if some of the arguments is less compelling in light of recent capabilities advances.

Reply
The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization
Chris_Leong5mo20

Lots of interesting ideas here, but the connection to alignment still seems a bit vague.

Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values? It seems to me that an unaligned ASI is extremely sensitive to context, just in the service of its own goals.

Then again, maybe you see Live Theory as being more about figuring out what the outer objective should look like (broad principles that are then localised to specific contexts) rather than about figuring out how to ensure an AI internalises specific values. And I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.
 

Reply
The “no sandbagging on checkable tasks” hypothesis
Chris_Leong6mo10

That would make the domain of checkable tasks rather small.

That said, it may not matter depending on the capability you want to measure.

If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.

On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.

Reply
Load More
AI Safety & Entrepreneurship
16d
(+4)
AI Safety & Entrepreneurship
16d
(+417/-337)
AI Safety & Entrepreneurship
20d
(+168)
AI Safety & Entrepreneurship
21d
(+342/-1)
AI Safety & Entrepreneurship
1mo
(+136)
AI Safety & Entrepreneurship
1mo
(+25)
AI Safety & Entrepreneurship
1mo
(+175/-101)
AI Safety & Entrepreneurship
2mo
(+49)
AI Safety & Entrepreneurship
2mo
(+14/-13)
AI Safety & Entrepreneurship
2mo
(+200)
Load More
8An Easily Overlooked Post on the Automation of Wisdom and Philosophy
3mo
0
4Potentially Useful Projects in Wise AI
3mo
0
7AI Safety & Entrepreneurship v1.0
4mo
0
4Linkpost to a Summary of "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.
5mo
0
10Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.
10mo
0
8On the Confusion between Inner and Outer Misalignment
1y
2
48Don't Dismiss Simple Alignment Approaches
2y
2
7What evidence is there of LLM's containing world models?
Q
2y
Q
0
12Yann LeCun on AGI and AI Safety
2y
1
10What does the launch of x.ai mean for AI Safety?
Q
2y
Q
2
Load More