Wiki Contributions

Comments

Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives. 

  • You could hope for substantial coordination to wait for bigger models that you only use via CPM, but I think bigger models are much riskier than well elicited small models so this seems to just make the situation worse putting aside coordination feasibility.

If we're looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we're going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don't see in current systems.  Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.

My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they're used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.

I'd be very interested in hearing the reasons why you're skeptical of the approach, even a bare-bones outline if that's all you have time for.

Sorry, I'm not quite clear what you mean by this, so I might be answering the wrong question.

I believe counterfactuals on the input space are a subset of counterfactuals on the predictor's state, because the input space's influence is through the predictor's state, but modifying the predictor's state can also reach states that don't correspond to any input. As such, I don't think counterfactuals on the input space add any power to the proposal.

Long-term planning is another capability that is likely necessary for deceptive alignment that could. Obviously a large alignment tax, but there are potentially ways to mitigate that. It seems at least as promising as some other approaches you listed.

I don't find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be "give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3". If you consider this all part of the training process (and I think that's a fair characterization),  model that starts with goal misgeneralization quickly becomes a schemer too.

I think this part uses an unfair comparison:

Supposes that  and  are small finite sets. A task  can be implemented as dictionary whose keys lie in  and whose values lie in , which uses  bits. The functional  can be implemented as a program which receives input of type  and returns output of type . Easy!

In the subjective account, by contrast, the task  requires infinite bits to specify, and the functional  must somehow accept a representation of an arbitrary function . Oh no! This is especially troubling for embedded agency, where the agent's decision theory must run on a physical substrate.

If X and W+ are small finite sets, then any behavior can be described with a utility function requiring only a finite number of bits to specify. You only need to use R as the domain when W+ is infinite, such as when outcomes are continuous, in which case the dictionaries require infinite bits to specify too.

I think this is representative of an unease I have with the framing of this sequence. It seems to be saying that the more general formulation allows for agents that behave in ways that utility maximizers cannot, but most of these behaviors exist for maximizers of certain utility functions. I'm still waiting for the punchline of what AI safety relevant aspect requires higher order game theory rather than just maximizing agents, particularly if you allow for informational constraints.

I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.

I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

Thanks for the comment. I agree that, ideally, we would find a way not to have two wholly separate models and instead somehow train a model against itself. I think a potential issue with your proposal is that small perturbations could have discontinuous effects, the anticipation of which distorts predictions. However, it would be interesting to think about further to see if there's some way to avoid that issue.

Thanks Caspar, your comments here and on earlier drafts are appreciated. We'll expand more on the positioning within the related literature as we develop this into a paper.

As for your work on Decision Scoring Rules and the proposal in your comment, the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it. That seems very useful to me, as if we were able to properly specify the desired utility function, we could skip using predictive models and just train an AI to maximize that instead (modulo inner alignment). 

For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don't know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.

For the second point, I think we are also in agreement. If the training process leads the AI to learning "If I predict that this action will destroy the world, the humans won't choose it", which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.

Load More