academic foundations in cog sci, currently interested in AI alignment and interpretability work (and lots of other things!). i'm newish to this space so welcome constructive feedback (and suggested resources) when i'm wrong or missing something :)
Even if AGI has Approval Rewards (i.e., from LLMs or somehow in RL/agentic scenarios), Approval Rewards only work if the agent actually values the approver's approval. Maybe sometimes that valuation is more or less explicit, but there needs to be some kind of belief that the approval is important, and therefore behaviors should align with approval reward-seeking / disapproval minimizing outcomes.
As a toy analogy: many animals have preferences about food, territory, mates, etc., but humans don't really treat those signals as serious guides to our behaviors. Not because the signals aren't real, but because we don't see birds, for example, as being part of our social systems in ways that require us seeking their approval for better outcomes for us. We don't care if birds support our choice of lunch, or who we decide to partner with. Even among humans, in-group/out-group biases, or continuums of sameness/differentness, closeness/distance, etc. can materially affect how strongly or weakly we value approval reward signals. The approval of someone seen as very different, or part of a distant group, will get discounted, while those from "friends and idols", or even nearby strangers, matter a lot.
So if AGI somehow does have an Approval Reward mechanism, what will count as a relevant or valued approval reward signal? Would AGI see humans as not relevant (like birds -- real, embodied creatures with observable preferences that just don't matter to them), or not valued (out-group, non-valued reference class), and largely discount our approval in their reward systems? Would it see other AGI entities as relevant/valued?
Maybe this is part of the sociopath issue too. But the point is, approval rewards only work if the agent assigns significance to the approver. So if we do decide that approval rewards are a good thing, and try to somehow incorporate them in AGI designs, we should probably make sure that human approval rewards are valued (or at least be explicit and intentional about this valuation structure).
On another note, initially I felt like one attraction of having an approval reward signal is that, to your point, it's actually pretty plastic (in humans), so could potentially increase alignment plasticity, which might be important. I think unless we discover some magic universal value system that is relevant for all of humanity for all eternity, it would be good for alignment to shift alongside organic human values-drift. We probably wouldn't want AGI today to be aligned to colonial values from the 1600s. Maybe future humans will largely disagree with current regimes, e.g., capitalism. But approval rewards mechanisms could orient alignment toward some kind of consensus / average, which could also change over time. It would also guardrail against "bad" values drift, so AGI doesn't start adopting outlier values that don't benefit most people. Still, it's not perfect because it could also inherit all the failure modes of human social reward dynamics, like capture by powerful groups, polarization, majorities endorsing evil norms, etc., which could play out in scary ways with superintelligence discounting human signals.
Even if AGI has Approval Rewards (i.e., from LLMs or somehow in RL/agentic scenarios), Approval Rewards only work if the agent actually values the approver's approval. Maybe sometimes that valuation is more or less explicit, but there needs to be some kind of belief that the approval is important, and therefore behaviors should align with approval reward-seeking / disapproval minimizing outcomes.
As a toy analogy: many animals have preferences about food, territory, mates, etc., but humans don't really treat those signals as serious guides to our behaviors. Not because the signals aren't real, but because we don't see birds, for example, as being part of our social systems in ways that require us seeking their approval for better outcomes for us. We don't care if birds support our choice of lunch, or who we decide to partner with. Even among humans, in-group/out-group biases, or continuums of sameness/differentness, closeness/distance, etc. can materially affect how strongly or weakly we value approval reward signals. The approval of someone seen as very different, or part of a distant group, will get discounted, while those from "friends and idols", or even nearby strangers, matter a lot.
So if AGI somehow does have an Approval Reward mechanism, what will count as a relevant or valued approval reward signal? Would AGI see humans as not relevant (like birds -- real, embodied creatures with observable preferences that just don't matter to them), or not valued (out-group, non-valued reference class), and largely discount our approval in their reward systems? Would it see other AGI entities as relevant/valued?
Maybe this is part of the sociopath issue too. But the point is, approval rewards only work if the agent assigns significance to the approver. So if we do decide that approval rewards are a good thing, and try to somehow incorporate them in AGI designs, we should probably make sure that human approval rewards are valued (or at least be explicit and intentional about this valuation structure).
On another note, initially I felt like one attraction of having an approval reward signal is that, to your point, it's actually pretty plastic (in humans), so could potentially increase alignment plasticity, which might be important. I think unless we discover some magic universal value system that is relevant for all of humanity for all eternity, it would be good for alignment to shift alongside organic human values-drift. We probably wouldn't want AGI today to be aligned to colonial values from the 1600s. Maybe future humans will largely disagree with current regimes, e.g., capitalism. But approval rewards mechanisms could orient alignment toward some kind of consensus / average, which could also change over time. It would also guardrail against "bad" values drift, so AGI doesn't start adopting outlier values that don't benefit most people. Still, it's not perfect because it could also inherit all the failure modes of human social reward dynamics, like capture by powerful groups, polarization, majorities endorsing evil norms, etc., which could play out in scary ways with superintelligence discounting human signals.