Davidmanheim — AI Alignment Forum

Typo?

how information is used, it might to be far enough

Probably should read "might not be"

Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)

Davidmanheim3mo30

One possible important way to address parts of this is by moving from only thinking about model audits and model cards, towards organizational audits. That is, the organization should have policies about when to test and what and when to disclose test results; an organizational safety audit would decide if those policies are appropriate, sufficiently transparent, and sufficient given the risks - and also check to ensure the policies are being followed.

Note that Anthropic has done something like this, albeit weaker, by undergoing an ISO management system audit, as they described here. Unfortunately, this specific audit type doesn't cover what we care about most, but it's the right class of solution. (It also doesn't require a high level of transparency about what is audited and what is found - but Anthropic evidently does that anyways.)

BIG-Bench Canary Contamination in GPT-4

Davidmanheim5mo20

Flagging that I just found that Google Gemini also has this contamination: https://twitter.com/davidmanheim/status/1939597767082414295

A single principle related to many Alignment subproblems?

Davidmanheim6mo30

To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about.

You're conflating can and should! I agree that it would be ideal if this were the case, but am skeptical it is. That's what I meant when I said I think A is false.

If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn't it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?

That's a very big "if"! And simplicity priors are made questionable, if not refuted, by the fact that we haven't gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.

You define "values" as ~"the decisions humans would converge to after becoming arbitrarily more knowledgeable".

No, I think it's what humans actually pursue today when given the options. I'm not convinced that these values are static, or coherent, much less that we would in fact converge.

You say that values depend on inscrutable brain machinery. But can't we treat the machinery as a part of "human ability to comprehend"?

No, because we don't comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that. (Then we apply pretty-sounding but ultimately post-hoc reasoning to explain it - as I tweeted partly thinking about this conversation.)

A single principle related to many Alignment subproblems?

Davidmanheim6mo10

No, the argument above is claiming that A is false.

A single principle related to many Alignment subproblems?

Davidmanheim6mo20

I think the crux might be that I think the ability to sample from a distribution at points we can reach does not imply that we know anything else about the distribution.

So I agree with you that we can sample and evaluate. We can tell whether a food we have made is good or bad, and can have aesthetic taste(, but I don't think that this is stationary, so I'm not sure how much it helps, not that this is particularly relevant to our debate.) And after gather that data, (once we have some idea about what the dimensions are,) we can even extrapolate, in either naive or complex ways.

But unless values are far simpler than I think they are, I will claim that the naive extrapolation from the sampled points fails more and more as we extrapolate farther from where we are, which is a (or the?) central problem with AI alignment.

A single principle related to many Alignment subproblems?

Davidmanheim6mo*30

If something is too hard to optimize/comprehend, people couldn't possibly optimize/comprehend it in the past, so it couldn't be a part of human values.

I don't understand why this claim would be true.

Take the human desire for delicious food; humans certainly didn't understand the chemistry of food and the human brain well enough to comprehend it or directly optimize it, but for millennia we picked foods that we liked more, explored options, and over time cultural and culinary processes improved on this poorly understood goal.

Modeling versus Implementation

Davidmanheim6mo21

As I understand it, MIRI intended to build principled glass-box agents based on Bayesian decision theory.

I think this misunderstands the general view of agent foundations by those who worked on it in the past. That is, "highly reliable agent design" was an eventual goal, in the same sense that someone taking high-school physics wants to use it to build rockets - they (hopefully) understand enough to know that they don't know enough, and will need to learn more before even attempting to build anything.

That's why Eliezer talked so much about deconfusion. The idea was to figure out what they didn't know. This led to later talking about building safe AI as an eventual goal - not a plan, but an eventual possible outcome if they could figure out enough. They clarified this view. It was mostly understood by funders. And I helped Issa Rice write a paper laying out the different pathways that it could help - and only two of those involved building agents.

And why did they give it up? Largely because they found that the deconfusion work was so slow, and everyone was so fundamentally wrong about the basics, that as LLM-based systems were developed they didn't think we could possible build the reliable systems in time. They didn't think that Bayesian decision theory or glass-box agents would necessarily work, and they didn't know what would. So I think "MIRI intended to build principled glass-box agents based on Bayesian decision theory" is not just misleading, but wrong.

Slow corporations as an intuition pump for AI R&D automation

Davidmanheim6mo10

This seems mostly right, except that it's often hard to parallelize work and manage large projects - which seems like it slows thing importantly. And, of course, some things are strongly serialized using time that can't be sped up via more compute or more people. (See: PM hires 9 women to have baby in one month.)

Similarly, running 1,000 AI research groups in parallel might get you the same 20 insights 50 times, rather than generating far more insights. And managing and integrating the research, and deciding where to allocate research time, plausibly gets harder at more than a linear rate with more groups.

So overall, the model seems correct, but I think the 10x speed up is more likely than the 20x speed up.

Interpretability Will Not Reliably Find Deceptive AI

Davidmanheim7mo10

CoT monitoring seems like a great control method when available

As I posted in a top level comment, I'm not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments