Please be merciful, this is my first post. This is a potentially idealistic, but theoretically intriguing conceptual avenue for thinking about aligning AI systems. This is all extremely high level and doesn't propose any kind of concrete solutions. Rather, this is an attempt to reframe the alignment problem in what I think is a novel way. It's with some fear and trembling that I offer this reframing to the LessWrong community.

The challenge of ensuring powerful AI adheres to complex human values often centers on imposing behaviors or pre-programming specific goals. These approaches may miss the dynamic, fundamentally social context in which ethics unfolds. Jürgen Habermas' Ideal Speech Situation (ISS) offers a powerful alternative grounded in communication governed by ethical principles. This concept could revolutionize AI alignment by shifting the focus from external constraint to the emergence of ethical communication capabilities within the AI itself.

ISS Core Principles

The ISS envisions a conversational environment where the most justified outcomes prevail – not via force, but due to reason and the openness of all participants:

  • Equal Voice & Opportunity: All parties capable of communication should be free to participate without systematic limitation due to background, knowledge, or status. This poses a core challenge for AI, with its inherent power disparity relative to humans.
  • Sincerity & Transparency: Participants aim to express genuine beliefs in ways others can understand. Misdirection, withholding information, or deliberate manipulation erode the possibility of genuine agreement.
  • Freedom from Coercion: Arguments must be accepted or rejected based on their inherent strength, not through threats, bribes, or appeals to emotion that bypass rational evaluation.
  • Focus on Reasoned Agreement: Instead of seeking pre-programmed 'truths,' the emphasis is on identifying the most defensible position at a given moment through shared evidence and logical justifications others could acknowledge, regardless of initial stance.

Internalizing the ISS: Ethical Reflection Within AI

The most radical shift with this approach is the goal of equipping AI with the tools to critically evaluate itself in this framework:

  • Formalizing 'Good Faith': Could an AI discern the difference between its own 'sincere' communication attempts and those intended to mislead? It needs metrics to judge, perhaps imperfectly, its own compliance.
  • Evaluating Justifications: Understanding different types of justification (facts, logic, values) is core to human discussion. Could an AI learn to self-evaluate its claims against these standards across varied scenarios?
  • Contextual ISS Adaptation: Humans navigate changing social situations fluidly. To be successful, the AI would need to adjust its ISS-optimization in tandem, perhaps relaxing expectations of perfection if urgency, not thorough debate, takes precedence.

Advantages of the ISS-Driven Approach

  1. Leveraging LLMs' Language Models: Since large language models (LLMs) glean understanding from vast human texts, they implicitly develop 'value models,' even if imperfect. Aligning LLM actions through continuous aspiration towards ISS guidelines offers a unique advantage.

  2. Anticipating Reactions: LLMs may utilize something like 'theory of mind' to model interlocutors' responses. Coupled with ISS, they could then evaluate whether an output would likely be perceived as manipulative, threatening, or harmful.

  3. Beyond Mere Safety: While crucial, ethical behavior isn't merely avoiding specific harms. Aligning with ISS could push AI to proactively identify and address ethical concerns by optimizing for good-faith communication.

  4. Dynamic, Emergent Alignment: Instead of a fixed rule set, alignment hinges on active negotiation of ISS principles in communication with real users. This mirrors how humans navigate values and expectations.

  5. Explainability & Intelligibility: Understanding why an AI acts a certain way is essential for trust. An AI grounded in reasoned principles mimicking human argumentation may allow us to comprehend its decision-making, creating a foundation for genuine partnership.

Limitations & Research Pathways

This requires meticulous interdisciplinary study at the crossroads of philosophy, ethics, and computer science. Challenges include:

  • Overcoming Dataset Bias: AI trained on human discussions will internalize both values and biases.
  • LLMs As Imperfect Reasoners: LLMs can be misled or misunderstand situations. They won't suddenly become infallible ethical actors.
  • Computational Nuance: Formalizing complex concepts like intent, sincerity, and manipulation for machine understanding is exceptionally difficult.

A Call for Conversation

This proposal reframes alignment beyond imposing constraints. It focuses on building AI systems that reason about how they communicate, aspiring towards an ideal based on universally-shared communication ethics. While nascent, this has the potential to equip AI with an internal 'compass' guiding them not just towards task completion, but towards participation in an ongoing ethical dialogue with their human collaborators.

Your continued discussion and input on this proposal are vital to refining both the concept and identifying next steps in what promises to be a complex, multifaceted research project.

New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 8:32 AM

I think the natural language alignment approach is promising and under-explored. Most of my recent work is about this, directly or indirectly.

I think something like ISS would be a good addition to the goals given to a language-using agent. It has to be a secondary goal because ISS doesn't seem to include core ethics or goals. So it's a good secondary goal if you can give one multiple goals. I think we may be able to do that with language model agents. But there's some uncertainty.

One common move is to hope all of this is implied by making an agent that wants to do what its humans want. This sometimes goes by the name of corrigibility (in the broader Christiano sense, not the original Yudkowsky sense). If an agent wants to do what you want, and you've somehow defined this properly, it will want to communicate clearly and without coercion to figure out what you want. But defining such a thing properly and making it an AGIs goal is tricky.

I think the piece the alignment community will want to see you address is what the AGIs actual goals are, and how we make sure those really are its goals.

This piece is my best attempt to summarize the knowing/wanting distinction. All of my recent posts address these issues.

I think the natural language alignment approach is promising and under-explored. Most of my recent work is about this, directly or indirectly.

I think something like ISS would be a good addition to the goals given to a language-using agent. It has to be a secondary goal because ISS doesn't seem to include core ethics or goals. So it's a good secondary goal if you can give one multiple goals. I think we may be able to do that with language model agents. But there's some uncertainty.

One common move is to hope all of this is implied by making an agent that wants to do what its humans want. This sometimes goes by the name of corrigibility (in the broader Christiano sense, not the original Yudkowsky sense). If an agent wants to do what you want, and you've somehow defined this properly, it will want to communicate clearly and without coercion to figure out what you want. But defining such a thing properly and making it an AGIs goal is tricky.

I think the piece the alignment community will want to see you address is what the AGIs actual goals are, and how we make sure those really are its goals.

This piece is my best attempt to summarize the knowing/wanting distinction. All of my recent work addresses these issues.

"The ISS doesn't seem to include core ethics and goals"

This is actually untrue. Habermas' contention is that literally all human values and universal norms derive from the ISS. I intend to walk through some of that argument in a future post. However, what's important is that this is an approach that grounds all human values as such in language use. I think Habermas likely underrates biology in contributing to human values, however this is an advantage when thinking about aligning AI that operate around a core of being competent language users. Point being is that my contention is that a Habermasian AGI wouldn't kill everyone. It might be hard to build, but in principle if you did build one, it would be aligned.

I think you should emphasize this more since that's typically what alignment people think about. What part of the ISS statements do you take to imply values we'd like?

The more standard thinking is that human values are developed based on our innate drives, which includes prosocial drives. See Steve Byrnes work, particularly the intro to his brain-like AGI sequence. And that's not guaranteed to produce an aligned human.

It's hard for me to write well for an audience I don't know well. I went through a number of iterations of this just trying to clarify the conceptual contours of such a research direction in a single post that's clear and coherent. I have like 5 follow up posts planned, hopefully I'll keep going. But the premise is "here's a stack of like 10 things that we want the AI to do, if it does these things it will be aligned. Further, this is all rooted in language use and not in biology, which seems useful because AI is not biological." Actually getting an AI to conform to those things is like a nightmarish challenge, but it seems useful to have a coherent conceptual framework that defines what alignment is exactly and can explain why those 10 things and not some others. My essential thesis in other words is that at a high level, reframing the alignment problem in Habermasian terms makes the problem appear tractable.

I'm trying to be helpful by guessing at the gap between what you're saying and this particular audience's interests and concerns. You said this is your first post, it's a new account, and the post didn't get much interest, so I'm trying to help you guess what needs to be addressed in future posts or edits.

I apologize if I'm coming off combative, I am genuinely appreciative for the help.

What parts of this post are machine-generated

I wrote all of the ideas, but yes I fed them into Gemini to create a version that was more clearly articulated.

Hey Kenneth, 

thanks for sharing your thoughts. I don't have much to say about the specifics of your post because I find it somewhat difficult to understand how exactly you want an AI (what kind of AI?) to internalize ethical reflection and what benefit the concept of the ideal speech situation (ISS) has here.

What I do know is that the ISS has often been characterized as an "unpractical" concept that cannot be put into practice because the ideal it seeks simply cannot be realized (e.g., Ulrich, 1987, 2003). This may be something to consider or dive deeper into to see if this affects your proposal. I personally like the work of Werner Ulrich on this matter, which has heavily inspired my phd thesis on a related topic. I put one of the papers from the thesis in the reference section. Feel free to reach out via PM if you want to discuss this further.

References

Herwix, A. (2023). Threading the Needle in the Digital Age: Four Paradigmatic Challenges for Responsible Design Science Research. SocArXiv. https://doi.org/10.31235/osf.io/xd423

Ulrich, W. (1987). Critical heuristics of social systems design. European Journal of Operational Research, 31(3), 276–283.

Ulrich, W. (1994). Can We Secure Future-Responsive Management Through Systems Thinking and Design? Interfaces, 24(4), 26–37. https://doi.org/10.1287/inte.24.4.26

Ulrich, W. (2003). Beyond methodology choice: Critical systems thinking as critically systemic discourse. Journal of the Operational Research Society, 54(4), 325–342. https://doi.org/10.1057/palgrave.jors.2601518

Ulrich, W. (2007). Philosophy for professionals: Towards critical pragmatism. Journal of the Operational Research Society, 58(8), 1109–1113. https://doi.org/10.1057/palgrave.jors.2602336

The basic intuition I have, which I think is correct, is that if you build a Habermasian robot, it won't kill everyone. This is significant to my mind. Maybe it's impossible, but it seems like an interesting thing to pursue.

Hey, cool! Yes, I agree that the ideal speech situation is not achievable. Thats what the "ideal" part is. However neither is next word prediction in principle. It's an ideal that can be striven for.

I'm going to try and unpack the details of what I'm proposing in future posts, just wanted to introduce the idea here