Conditioning Predictive Models: The case for competitiveness

Adam Jermyn; Johannes Treutlein; Rubi J. Hudson; kcwoolverton

[-][anonymous]2y32

Impersonating a human would present a potentially thorny deployment and societal issue, since it could be quite confusing for users to interact with AIs that claim to be humans. A ban on impersonating humans, if this is not clearly recognizable to the user, has also been suggested as part of the proposed EU AI Act.

I am unsure I understand the problem or conflict with the EU AI Act. It seems to me that the clear recognizability (you simulated Bill gates on your computer and you know that you simulated it and you don't pretend to tell other humans that the actual Bill Gates said that) should solve it (most likely).

[-]Johannes Treutlein2y32

Yes, one could e.g. have a clear disclaimer above the chat window saying that this is a simulation and not the real Bill Gates. I still think this is a bit tricky. E.g., Bill Gates could be really persuasive and insist that the disclaimer is wrong. Some users might then end up believing Bill Gates rather than the disclaimer. Moreover, even if the user believes the disclaimer on a conscious level, impersonating someone might still have a subconscious effect. E.g., imagine an AI friend or companion who repeatedly reminds you that they are just an AI, versus one that pretends to be a human. The one that pretends to be a human might gain more intimacy with the user even if on an abstract level the users knows that it's just an AI.

I don't actually know whether this would conflict in any way with the EU AI act. I agree that the disclaimer may be enough for the sake of the act.

[-]VojtaKovarik3y*10

Flagging confusion / potential disagreement: I think only predicting humans is neither sufficient nor necessary for the results to be aligned / helpful / not doom. Insufficient because if misaligned AGI is already in control, or likely going to be in control later, predicting arbitrary existing humans seems unsafe. [Edit: I think this is very non- obvious and needs further supporting arguments.] Not necessary because it should be fine to predict any known-to-be-safe process. (As long as you do this in a well-founded manner / not predicting itself.)

If a new paradigm displaced LLM pretraining + fine-tuning, though, then all bets are off. ↩︎
For example, it is likely to be strictly more difficult to implement an LLM that never predicts other AIs than one that does, as well as strictly more cumbersome to have to restrict an LLM to only ever be interacted with using carefully crafted, safe conditionals. ↩︎
Another potentially tricky issue with any of these approaches of training a predictive model to claim to be a human is that if it’s actually not well-described as a predictive model—and is instead e.g. well-described as an agent—then we would actively be training it to lie in telling us that it’s a human rather than an AI, which could result in a substantially less aligned agent than we would have gotten otherwise. Furthermore, such patches could also lead to an assistant that is more fragile and susceptible to jailbreaking due to the limitation of having to go through predicting some human. ↩︎
Notably, this does also open us up to some additional inner alignment concerns related to doing the internal cognitive resource management in the right way, which we discuss later. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

13

Conditioning Predictive Models: The case for competitiveness

13

3. The case for competitiveness

Implementation competitiveness

Performance competitiveness

Restricting ourselves to only predicting humans

Restricting ourselves to dialogue agents that claim to be humans

Restricting ourselves to only using predictive models

Handling sequential reasoning