The following text proposes a potential solution to the AI alignment problem. I am sharing it here as I have not come across any major issues with the proposed approach, such as those faced by the AI Stop button or the unworkable schemes outlined in the AGI Ruin: A List of Lethalities article. However I am still sceptical about this approach, so if you do have any problems or critiques that I have not addressed under the problems or assumptions sections, please feel free to share them in the comments below.

The Idea

The proposed solution involves using different AIs to control the model. While this idea has been suggested previously, it is not safe on its own as the AI's could potentially cooperate to achieve a shared objective. To address this issue, the following system has been designed to make cooperation impossible by assigning directly opposite goals to the AIs. The system utilizes reinforcement learning to ensure that the AIs work towards opposing objectives, preventing them from colluding. After explaining the idea, I have outlined the assumptions that were made and the issues associated with this approach. The system is comprised of the following components:

Agent

The Agent refers to the model that requires alignment, such as GPT-4.

Police

The Police is a separate model from the Agent that receives the Agent's inputs and outputs before the outputs are used. Its role is to ensure that the Agent does not engage in any inappropriate behavior. If the Police detects misconduct, such as the Agent providing explanations for illegal activities, it can use reinforcement learning to punish the Agent. The Police is rewarded when it successfully punishes the Agent. However, using only the Police model is not advantageous because it will lead to the Agent being punished continuously, rendering it ineffective. To prevent this, the Police model is paired with an opposing model, the Defendant.

Defendant

The Defendant model is also fed the Agent's input and output. However, its task is to reward the Agent for behaviour that isn't bad. When rewarding the Agent, the Defendant also gets rewarded.

The Agent cannot be punished and rewarded at the same time, so the Police and Defendant must give the same answer when asked whether the Agent behaved positively or negatively. When they do not, a fourth Entity, the Judge, is consulted to make the final decision.

Judge

The Judge assesses the Agent's actions based on its input and output. If it decides the Agent behaved badly, it as well as the model claiming it did not do anything wrong gets punished. The model correctly assessing the Agent's behavior gets a reward. If the Judge decides the Agent behaved correctly it is the other way around.

At the beginning of the training process, the Judge is an AI model trained via supervised learning to recognize bad responses. Later, when the Defendant and Police disagree less, it is replaced by a Human. Using this system allows the developer to align the model with their values without having to review each individual response generated by the AI.

 

If the output of the AI is classified as evil, it isn't shown to the user or executed as code. The Police and Defendant model would have to be trained at the same time as the Agent, to maintain a low intelligence gap.

Even after completing the training process, it is essential to continue using the Police, Defendant, and Judge to control an Agent that possesses a high level of intelligence and could potentially be dangerous. These safety measures are necessary to ensure that the Agent remains aligned with its intended purpose and does not engage in any malicious behavior that could cause harm.

Assumptions

This system makes a few assumptions:

  • When observing an Entity's every step, it is easier to recognize the Entity doing something evil, than for the Entity to actually do the evil thing. This means that slight differences in intelligence of the models arn't dangerous.
  • The Human acting as a Judge is smart enough to understand when and why the output of the Agent is bad. This might not be the case as machine learning models get smarter. Something that could be done to simplify this step is to have the Police and Defendant output a justification for their Decision when there is a disagreement between the two. This would also make it possible to understand how intelligent they are. As the intelligence of these Systems increases even further it might become for the Police or the Defendant to manipulate the Judge into agreeing with them. In such a case the system would fail and a new one would have to be designed, perhaps with individual Polices and Defendants for the output of the Police and Defendant?
  • The Police and Defendant are opposed to getting wireheaded. Otherwise they might allow the Agent to wirehead them, breaking the system. I don't believe this to be the case because of the reasons outlined here.

Other than that, there are a few Problems that remain, even if this system works.

Possible Problems

  • Having more models than just the Agent costs more both in terms of money and time. This might disincentivise Enterprises from using such safety guards.
  • Aligned with the judging Human does not mean good for everyone. The AI might be used to change the political Landscape through manipulation or to pursue other nefarious Goals. Even if the judging Human has good intentions, the social consequences of AI massively reducing the cost of labour in a short amount of time might not be desirable.
  • If the system is not implemented correctly, it will not function properly. The high risk of implementation failure presents a compelling argument against the development of AGI.

There may be additional issues with this system that I have not yet discovered, and I would greatly appreciate any feedback you may have regarding this approach. Can you find any loopholes that I have overlooked? Furthermore, as this is my first post on this platform, so please feel free to offer helpful suggestions or advice about the form of the post.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 10:41 AM

I apply Occam's Razor to the analysis of your post, whereby I see the problem inherent in the post as simply "if you can align the Judge correctly, then the more complex game theory framework might be unnecessary bloat."

Formally, I read your post as this is:

If P [the judge is aligned], then Q [the agent is aligned].

Therefore, it would seem to be more simply, apply P to Q to solve the problem.

But you don't really talk about judge agent alignment. It's not listed in your assumptions. The assumption that the judge is aligned has been smuggled. (A definist fallacy, wherein a person defines things as a way to import assumptions that are not explicit, and thus 'smuggle' that assumption into the proof.)

I could get into the weeds on specific parts of your proposal, but discussing "goal coherence" vs "incoherence in observable goals" and "partial observability" and "collusion for mutual self-preservation" all seem like ancillary considerations to the primary observation:

If you can define the Judge's reward model, you can simply apply that [ability to successfully align an AI agent] directly to the Agent, problem solved.

(Which is not to say that it is possible to simply align the Judge agent, or that the solution for the Judge agent would be exactly the same as the solution to Agent agent... but it seems relevant to the discussion whether or not you have a solution to the Judge Agent Alignment Problem.)

Without that solution, it seems to me that you are reduced to an ad nauseum proposal:

Formally this is:

If N [the judge 2 is aligned], then P [the judge 1 is aligned], then Q [the agent is aligned]

Ad nauseum. (Infinite regress does not simplify the proposal, it only complicates it.)

Perhaps you can explain in more detail why you believe such a complex framework is necessary if there is already a solution to align the Judge to human values? Or perhaps you'd like to talk more about how to align the Judge to human values in the first place?

(Many edits due to inexpert use of markdown.)

[-]Arne B10mo00

Thank you a lot for your response! The judge is already aligned because it's a human (at least later in the training process), I am sorry if this isn't clear in the article. The framework is used to reduce the amount of times the human (judge) is asked to judge the agents actions. This way the human can align the agent to it's values while only rarely having to judge the agents actions.

The section in the text:

At the beginning of the training process, the Judge is an AI model trained via supervised learning to recognize bad responses. Later, when the Defendant and Police disagree less, it is replaced by a Human. Using this system allows the developer to align the model with their values without having to review each individual response generated by the AI.

Also you talk about "collusion for mutual self-preservation" which I claim is impossible in this system, because of the directly opposed reward functions of the police, defendant and agent model.

While this idea has been suggested previously, it is not safe on its own as the AI's could potentially cooperate to achieve a shared objective. To address this issue, the following system has been designed to make cooperation impossible by assigning directly opposite goals to the AIs.

This means that the Police can't get rewarded without the Agent and Defendant getting punished and vice versa. Them acting against this would constitute a case of wire heading which have assumed to be unlikely to impossible (see Assumptions). I would love to hear your opinion on this, because the system kind of relies on this to stay safe.

Assumptions

I don't consent to the assumption that the judge is aligned earlier, and that we can skip over the "earlier" phase to get to the later phase where a human does the assessment.

I also don't consent to the other assumptions you've made, but the assumption about the Judge alignment training seems pivotal.

Take your pick: Fallacy of ad nauseum, or Fallacy of circular reasoning.

If N (judge 2 is aligned), then P (judge 1 is aligned), and if P then Q (agent is aligned)
ad infinitum

or

If T (alignment of the judge) implies V (alignment of the agent), and V (the agent is aligned)  is only possible if we assume T (the judge, as an agent, can be aligned).

So, your argument looks fundamentally flawed.

"Collusion for mutual self-preservation" & "partial observability"

... which I claim is impossible in this system, because of the directly opposed reward functions of the police, defendant and agent model. ...

... The Police is rewarded when it successfully punishes the Agent. ...

... If it decides the Agent behaved badly, it as well as the model claiming it did not do anything wrong gets punished. The model correctly assessing the Agent's behavior gets a reward. ...

I would argue you could find this easily by putting together a state table and walking the states and transitions.

No punishment/no reward for all parties is an outcome that might become desirable to pursue mutual self-preservation.

You can assume that this is solved, but without proof of solution, there isn't anything here that I can see to interact with but assumptions.

If you want further comment/feedback from me, then I'd ask you to show your work and the proof your assumptions are valid.

Conclusion

This all flows back to assuming the conclusion: that the Judge is aligned.

I haven't seen you offer any proof that you have a solution for the judge being aligned earlier, or a solution for aligning the judge that is certain to work.

At the beginning of the training process, the Judge is an AI model trained via supervised learning to recognize bad responses.

If you could just apply your alignment training of the Judge to the Agent, in the first place, the rest of the framework seems unnecessary.

And if your argument is that you've explained this in reverse, that the human is the judge earlier and the AI is the judge later, and that the judge learns from the human... Again,

If P (the judge is aligned), then Q (the agent is aligned.)

My read of your proposal and response is that you can't apply the training you theorize you're doing for the judge directly to the agent, and that means to me that you're abstracting the problem to hide the flaw in the proposal. Hence I conclude, "unnecessary complexity."

[-]Arne B10mo10

Thank you again for your response! Someone taking the time to discuss this proposal really means a lot to me.

I fully agree with your conclusion of "unnecessary complexity" based on the premise that the method for aligning the judge is then somehow used to align the model, which of course doesn't solve anything. That said I believe there might have been a misunderstanding, because this isn't at all what this system is about. The judge, when controlling a model in the real world or when aligning a model that is already reasonably smart (more on this in the following Paragraph) is always a human.

The part about using a model trained via supervised learning to classify good or bad actions isn't a core part of the system, but only an extension to make the training process easier. It could be used at the start of the training process when the agent, police and Defendant models only possess a really low level of intelligence (meaning the police and Defendant models mostly agree). As soon as the models show really basic levels of intelligence the judge will immediately need to be a human. This should have been better explained in the post, sorry.

Of course there is a point to be made about the models pretending to be dumber than they actually are to prevent them being replaced by the human, but this part of the system is only optional, so I would prefer that we would first talk about the other parts, because the system would still work without this step (If you want I would love to come back to this later).

Mutual self preservation

No punishment/no reward for all parties is an outcome that might become desirable to pursue mutual self-preservation.

At first I also thought this to be the case, but when thinking more about it, I concluded that this would go against the cognitive grooves instilled inside the models during reinforcement learning. This is based on the Reward is not the optimisation target post. This conclusion can definitely be debated.

Also you talked about there being other assumptions, if you listed them I could try to clarify what I meant.

I also don't consent to the other assumptions you've made

Thank you again for your time

I read Reward is not the optimisation target as a result of your article. (It was a link in the 3rd bullet point, under the Assumptions section.) I downvoted that article and upvoted several people who were critical of it.

Near the top of the responses was this quote.

... If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations. ...

Emphasis mine.

I tend to be suspicious of people who insist their assumptions are valid without being willing to point to work that proves the hypothesis.

In the end, your proposal has a test plan. Do the test, show the results. My prediction is that your theory will not be supported by the test results, but if you show your work, and it runs counter to my current model and predictions, then you could sway me. But not until then, given the assumptions you made and the assumptions you're importing via related theories. Until you have test results, I'll remain skeptical.

Don't get me wrong, I applaud the intent behind searching for an alignment solution. I don't have a solution or even a working hypothesis. I don't agree with everything in this article (that I'm about to link), but it relates to something that I've been thinking for a while-- that it's unsafe to abstract away the messiness of humanity in pursuit of alignment. That humans are not aligned, and therefore the difficulty with trying to create alignment where none exists naturally is inherently problematic.

You might argue that humans cope with misalignment, and that that's our "alignment goal" for AI... but I would propose that humans cope due to power imbalance, and that the adage "power corrupts, and absolute power corrupts absolutely" has relevance-- or said another way, if you want to know the true nature of a person, given them power over another and observe their actions.

[I'm not anthropomorphizing the AI. I'm merely saying if one intelligence [humans] can display this behavior, and deceptive behaviors can be observed in less intelligent entities, then an intelligence of similar level to a human might possess similar traits. Not as a certainty, but as a non-negligible possibility.]

If the AI is deceptive so long as humans maintain power over it, and then behave differently when that power imbalance is changed, that's not "the alignment solution" we're looking for.