All of AdamGleave's Comments + Replies

When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool's errand. Around half the people told me they thought it was extremely unlikely I'd find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I'd done a survey (even an informal one) before conducting this research to get a better sense of people'... (read more)

Thanks for flagging this disagreement Ryan. I enjoyed our earlier conversation (on LessWrong and in-person) and updated in favor of the sample efficiency framing, although we (clearly) still have some significant differences in perspective here. Would love to catch up again sometime and see if we can converge more on this. I'll try and summarize my current take and our key disagreements for the benefit of other readers.

I think I mostly agree with you that in the special case of vanilla RLHF this problem is equivalent to a sample efficiency problem. Specifi... (read more)

Thanks, that's a good link. In our case our assets significantly exceed the FDIC $250k insurance limit and there are operational costs to splitting assets across a large number of banks. But a high-interest checking account could be a good option for many small orgs.

Does this circle exploit have any connection to convolutions? That was my first thought when I saw the original writeups, but nothing here seems to help explain where the exploit is coming from. All of the listed agents vulnerable to it, AFAIK, make use of convolutions. The description you give of Wu's anti-circle training sounds a lot like you would expect from an architectural problem like convolution blindness: training can solve the specific exploit but then goes around in cycles or circles (ahem), simply moving the vulnerability around, like squeezin

... (read more)
I wouldn't really expect larger convolutions to fix it, aside from perhaps making the necessary 'circles' larger and/or harder to find or create longer cycles in the finetuning as there's more room to squish the attack around the balloon. It could be related to problems like the other parameters of the kernel like stride or padding. (For example, I recall the nasty 'checkboard' artifacts in generative upscaling were due to the convolution stride/padding, and don't seem to ever come up in Transformer/MLP-based generative models but also simply making the CNN kernels larger didn't fix it, IIRC - you had to fix the stride/padding settings.) I personally would find it interesting but I don't know how important it is. It seems likely that you might find a completely different-looking adversarial attack, but would that be conclusive? There would be so many things that change between a CNN KataGo and a from-scratch ViT KataGo. Especially if you are right that Timbers et al find a completely different adversarial attack in their AlphaZero which AFAIK still uses CNNs. Maybe you could find many different attacks if you change up enough hyperparameters or initializations. On the gripping hand, now that I look at this earlier version, their description of it as a weird glitch in AZ's evaluation of pass moves at the end of the game sounds an awful lot like your first Tromp-Taylor pass exploit ie. it could probably be easily fixed with some finetuning. And in that case, perhaps Timbers et al would have found the 'circle' exploit in AZ after all if they had gotten past the end-game pass-related exploit? (This also suggests a weakness in the search procedures: it really ought to produce more than one exploit, preferably a whole list of distinct exploits. Some sort of PBT or novelty search approach perhaps...) Maybe a mechanistic interpretability approach would be better: if you could figure out where in KataGo it screws up the value estimate so badly, and what edits are necessar

This matches my impression. FAR could definitely use more funding. Although I'd still at the margin rather hire someone above our bar than e.g. have them earn-to-give and donate to us, the math is getting a lot closer than it used to be, to the point where those with excellent earning potential and limited fit for AI safety might well have more impact pursuing a philanthropic pathway.

I'd also highlight there's a serious lack of diversity in funding. As others in the thread have mentioned, the majority of people's funding comes (directly or indirectly) from... (read more)

5Steve Byrnes4mo
It’s easier than that—there are high-interest-rate free FDIC-eligible checking accounts. has a good list, although you might need to be a member to view it. As of this moment (2023-07-20), the top of their leaderboard is: Customers Bank (5.20% APY), BankProv (5.15%), BrioDirect (5.06%), UFB Direct (5.06%).

I still don't understand which of (1), (2), or (3) your most worried about.

Sample efficiency isn't the main way I think about this topic so it's a bit difficult to answer. I find all these defeaters fairly plausible, but if I had to pick the central concern it'd be (3).

I tend to view ML training as a model taking a path through a space of possible programs. There's some programs that are capable and aligned with our interests; others that are capable but will actively pursue harmful goals; and of course many other programs that just don't do anything parti... (read more)

Oh, we're using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model's learning process). I don't especially care about what definitions we use here, but do wonder if this means we're speaking past each other in other areas as well.

Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It's always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the "environment" is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)

I'll need to take a closer look at the paper, but it looks like they derive the DPO objective by ... (read more)

Thanks for the follow-up, this helps me understand your view!

At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don't need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.

Sure, this feels basically right to me. My reframing of this would be that we could do in principle do RL directly with feedback provided b... (read more)

1Ryan Greenblatt5mo
(Here's a possibly arcane remark. It's worth noting that I think always correct reward models are sufficient for high stakes alignment via runtime filtering (technically you just need to never give a very bad output decent reward). So, always correct reward models would be great if you could get them. Note that always correct could look pretty different than current advsarial robustness; in particular, you also need to worry about collusion and stuff like seizing physical control of the reward channel. I also think that avoiding high stakes failure via adversarial training and/or other robustness style tech seems generally good. I've been assuming we're talking about the low stakes alignment problem (aka outer alignment))
1Ryan Greenblatt5mo
I think this trend extrapolation argument seems fine in the absence of a specific defeater. In the case of AI takeover, there is a clear defeater to 'models have been getting more aligned over time'. I was trying to get at which specific defeater you thought overcame the trend expolation argument. Here's an attempt at an exhaustive list of defeaters: 1. Thinking sample efficiency would get worse in the future breaking the current trend 2. Thinking that the current 'trend extrapolated' sample efficiency would be insufficent or otherwise good to improve on the margin 3. Thinking that there would be important negative consequences of reward model non-robustness which aren't well described by sample efficiency (e.g., teaching the model to lie via first practicing on the reward model, or having an easier time exploring into bad behavior or something) It's also worth noting that if I was centrally interested in (2) I would push on that directly. (But you have other applications of robustness in mind, so this might not be that interesting.) I still don't understand which of (1), (2), or (3) your most worried about. (Maybe (3) based on some argument I don't yet understand? I also don't see why hacking the reward model is dangerous which is maybe an important crux here.) Aside on alignment trend extrapolation: I'm also not really sure how to measure 'aligned' in a way that makes sense given that models have been also getting smarter. It seems plausible that alignment has been notably improving over time? Beyond this, the more natural trend extrapolation might be takeover risk. My guess is that the ex-ante takeover risk from GPT4 should have been like 0.1% and then the future/ongoing risk is more like 0.001% or something. And, the trend extrapolation doesn't look good here : ).
1Ryan Greenblatt5mo
Are you assuming that we can't collect human data online as the policy optimizes against the reward model? (People currently do collect data online to avoid getting hacked like this.) This case seems probably hopeless to me without very strong regularization (I think you agree with this being mostly hopeless), but it also seems easy to avoid by just collecting human data online.
1Ryan Greenblatt5mo
I don't really see why (b) leads to dangerous failures. It seems like failures should be totally benign and just result in somewhat lower production? Beyond this, it seems like this failure should happen early as it doesn't require clever models to occur, so by default there will be strong commercial incentives to resolve this. I agree it's an alignment failure in some sense which could be addressed by alignment technology. I just think it isn't very important to reduce from an X-risk/AI takeover perspective.
1Ryan Greenblatt5mo
Minor point, feel free to ignore. FWIW, I typically use 'reward hacking' to refer to just (a) here. I'd just call (b) 'poor reward model sample efficiency'. That said, I more centrally use 'reward hacking' to describe hacking a reward process based on outcomes via stuff like 'sensor tampering', but this is still a subset of RLHF: the subset where humans look at outcomes and then assess reward taking this into account.
1Ryan Greenblatt5mo
I originally linked to the wrong paper! : ( Here is the actual Direct Preference Optimization paper. (I guess I just googled something like 'DPO RL' and then didn't actually check that it was the right paper) Yikes, sorry for wasting your time.

To check my understanding, is your view something like:

  1. If the reward model isn't adversarially robust, then the RL component of RLHF will exploit it.

  2. These generations will show up in the data presented to the human. Provided the human is adversarially robust, then the human feedback will provide corrective signal to the reward model.

  3. The reward model will stop being vulnerable to those adversarial examples, although may still be vulnerable to other adversarial examples.

  4. If we repeat this iterative process enough times,... (read more)

1Ryan Greenblatt6mo
This is pretty close to my understanding, with one important objection. Thanks for responding and trying to engage with my perspective. Objection I don't claim we'll necessarily ever get a fully robust reward model, just that the reward model will be mostly robust on average to the actual policy you use as long as human feedback is provided at a frequent enough interval. We never needed a good robust reward model which works on every input, we just needed a reward model which captured our local preferences about the actual current policy distribution sufficiently well. At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don't need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want. KL penalty is presumably an important part of the picture. My overall claim is just that normal RLHF is pretty fault tolerant and degrades pretty gracefully with lack of robustness. Sample efficiency seems fine now and progress continues Beyond this, my view is considerably informed by 'sample efficiency seems fine in practice now and is better better not worse with more powerful models (from my limited understanding)'. It's possible this trend could reverse with sufficiently big models, but I'd find this surprising and don't see any particular reason to expect this. Technical advancement in RLHF is ongoing and will improve sample efficiency further. It seems to me like your views either imply that sample efficiency is low enough now that high quality RLHF currently can't compete with other ways of training AIs which are less safe but have cheaper reward signals (e.g., heavy training on automated outcomes based feedback or low quality human reward signal). Or possibly that this will happen in the future as models get more powerful (re

This is a good point, adversarial examples in what I called in the post the "main" ML system can be desirable even though we typically don't want them in the "helper" ML systems used to align the main system.

One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that ... (read more)

Right: if the agent has learned an inner objective of "do things similar to what humans do in the world at the moment I am currently acting", then it'd definitely be incentivized to do that. It's not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn't help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you're worried about inner optimization, but most the oth... (read more)

Thanks, I'd missed that!

Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we'll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you'd like to see happen in this space?

Also could you clarify whether the setting was for adversarial training or just a vanilla model? "During training, adversarial examples for training are constructed by PGD attacker of 30 iterations" makes me think it's adversarial training but I could imagine this just being used for evals.

1Ethan Caballero9mo
The setting was adversarial training and adversarial evaluation. During training, PGD attacker of 30 iterations is used to construct adversarial examples used for training. During testing, the evaluation test set is an adversarial test set that is constructed via PGD attacker of 20 iterations. Experimental data of y-axis is obtained from Table 7 of; experimental data of x-axis is obtained from Figure 7 of

Rachel did the bulk of the work on this post (well-done!), I just provided some advise on the project and feedback on earlier manuscripts.

I wanted to share why I'm personally excited by this work in case it helps contextualize it for others.

We'd all like AI systems to be "corrigible", cooperating with us in correcting them. Cooperative IRL has been proposed as a solution to this. Indeed Dylan Hadfield-Menell et al show that CIRL is provably corrigible in a simple setting, the off-switch game.

Provably corrigible sounds great, but where there's a proof there... (read more)

I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.

In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the... (read more)

1Johannes Treutlein1y
I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity's potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape. E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. It seems unclear whether these things would tend to balance out, or whether e.g. everyone will inevitably be exposed to some persuasion that causes irreparable damage. Of course, things could also work out better than expected, if our ability to keep AIs in check scales better than dangerous capabilities.

Thanks for this response, I'm glad to see more public debate on this!

The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. Bu... (read more)

3Erik Jenner1y
Interesting points, I agree that our response to part C doesn't address this well. AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It's harder to make an airtight x-risk argument using fast takeoff, since I don't think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we're figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for). A more robust worry (and what I'd probably rely on for a compact argument) is something like What Failure Looks Like Part 1: maybe AIs work within the system, in the sense that they don't take over the world in obvious, visible ways. They usually don't break laws in ways we'd notice, they don't kill humans, etc. On paper, humans "own" the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively "take over the world" (in the sense that humans don't actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren't great at collaborating against humanity. Addressing a few specific points: True to some extent, but I'd expect AI progress to be much faster than hu

Thanks for the quick reply! I definitely don't feel confident in the 20W number, I could believe 13W is true for more energy efficient (small) humans, in which case I agree your claim ends up being true some of the time (but as you say, there's little wiggle room). Changing it to 1000x seems like a good solution though which gives you plenty of margin for error.

This is a nitpick, but I don't think this claim is quite right (emphasis added)

 If a silicon-chip AGI server were literally 10,000× the volume, 10,000× the mass, and 10,000× the power consumption of a human brain, with comparable performance, I don’t think anyone would be particularly bothered—in particular, its electricity costs would still be below my local minimum wage!!

First, how much power does the brain use? 20 watts is StackExchange's answer, but I've struggled to find good references here. The appealingly named Appraising the brain's energy bu... (read more)

3Steve Byrnes1y
Thanks! Prior to your comment, the calculation in my head was 12 W × 10,000 × 10¢/kWh < $14.25/hr. The biggest difference from you is that I had heard 12 watts for brain energy consumption somewhere, and neglected to check it. I don’t recall where I had heard that, but for example, 12 W is in this article. They used the 20% figure, but for resting metabolic rate they cite this which says 1740 kcal/day (→16.9W) in men, 1348 kcal/day (→13.1W) in women, and the article turns 13.1W into 12W by sketchy rounding. That still presupposes that the 20% is valid in both genders. I traced the “20%” back to here which cites papers from 1957 & 1960 (and 1997 but that’s another secondary source). I downloaded the 1957 source (Kety, “The general metabolism of the brain in vivo”. In: Metabolism of the nervous system (Richter D, ed), pp 221–237), and it did cite studies of both men and women, and suggested that it scales with brain mass. I don’t understand everything that goes into the calculation, but they do say 20 W directly, so I certainly feel best about that number, but AFAICT it remains likely that the power would lower for smaller-than-average people including most women. I’m still confused about the discrepency with earlier in this paragraph, but I don’t want to spend more time on it. ¯\_(ツ)_/¯ My intended meaning was that the “power consumption” of “a silicon-chip AGI server” was all-in power consumption including HVAC, but I can see how a reader could reasonably interpret my words as excluding HVAC. I specifically said “my local minimum wage” because I happen to live in a state (Massachusetts) with high minimum wage of $14.25/hr. (The cost to the employer is of course a bit higher, thanks to legally-mandated employer taxes, sick days, sick-family days, etc.) Granted, we have unusually expensive electricity here in Massachusetts too, but people normally put servers where electricity is cheaper and talk to them over the internet. Anyway, I clearly messed up especially b

"The Floating Droid" example is interesting as there's a genuine ambiguity in the task specification here. In some sense that means there's no "good" behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that's outside the scope of this contest.) But it's interesting the interpretation flips with model scale, and in the opposite direction to what I'd have predicted (doing EV calculations are harder so I'd have expected scale to increase not decrease EV answers.) Follow-up questions... (read more)

3Rohin Shah1y
I think the inverse scaling here is going from "random answer" to "win/loss detection" rather than "EV calculation" to "win/loss detection".

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

A rough heuristic I have is that if the idea you're introducing is highly novel, it's OK to not be rigorous. Your contribution is bringing this new, potentially very promising, idea to people's attention. You're seeking feedback on how promising it really is and where people are confused , which will be helpful for then later formalizing it and studying it more rigo... (read more)

1David Scott Krueger1y
I think part of this has to do with growing pains in the LW/AF community... When it was smaller it was more like an ongoing discussion with a few people and signal-to-noise wasn't as important, etc. 

Work that is still outside the academic Overton window can be brought into academia if it can be approached with the technical rigor of academia, and work that meets academic standards is much more valuable than work that doesn't; this is both because it can be picked up by the ML community, and because it's much harder to tell if you are making meaningful progress if your work doesn't meet these standards of rigor.

Strong agreement with this! I'm frequently told by people that you "cannot publish" on a certain area, but in my experience this is rarely true... (read more)

Presumably "too dismissive of speculative and conceptual research" is a direct consequence of increased emphasis on rigor. Rigor is to be preferred all else being equal, but all else is not equal.

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

I note that within rigorous fields, the downsides of rigor are not obvious: we can point to all the progress made; progress that wasn't made due to the neglect of conceptua... (read more)

2David Scott Krueger1y
Agree RE systemic blindspots, although the "algorithmic contribution" thing is sort of a known issue that a lot of senior people disagree with, IME.

A related dataset is Waterbirds, described in Sagawa et al (2020), where you want to classify birds as landbirds or waterbirds regardless of whether they happen to be on a water or land background.

The main difference from HappyFaces is that in Waterbirds the correlation between bird type and background is imperfect, although strong. By contrast, HappyFaces has perfect spurious correlation on the training set. Of course you could filter Waterbirds to make the spurious correlation perfect to get an equally challenging but more natural dataset.

My sense is that Stuart assuming there's an initial-specified reward function is a simplification, not a key part of the plan, and that he'd also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison.

IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn't really that unique to IRD -- Bayesian IRL or preference comparison would have the same property.

2Rohin Shah2y
Yeah, I agree with that. (I don't think we have experience with deep Bayesian versions of IRL / preference comparison at CHAI, and I was thinking about advice on who to talk to)

I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)

Ah, the "model-based using a model-free RL algorithm" approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You'... (read more)

Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.

Most model-based RL algorithms I've seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don't see how you solve this problem in general in a sample-efficient manner otherwise.

One class of model-based RL approaches is based on [model-predictive control](ht... (read more)

2Steve Byrnes3y
Hmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can't—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I'm not intimately familiar with the deep RL literature, I wouldn't know what's typical and I'll take your word for it, but it does seem that both possibilities are out there. Anyway, I don't think the neocortex can evaluate the true reward function in arbitrary states, because it's not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that's the whole thing with TD learning and dopamine. I added a new sub-bullet at the top to clarify that it's hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the "other possible explanations" section at the bottom saying what I said in the paragraph just above. Thank you. Well, the rats are trying to do the rewarding thing after zero samples, so I don't think "sample-efficiency" is quite the right framing. In ML today, the reward function is typically a function of states and actions, not "thoughts". In a brain, the reward can depend directly on what you're imagining doing or planning to do, or even just what you're thinking about. That's my proposal here. Well, I guess you could say that this is still a "normal MDP", but where "having thoughts" and "having ideas" etc. are part of the state / action space.

I'm a bit confused by the intro saying that RL can't do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:

  - Agent explores by pressing lever, learns transition dynamics that pressing lever => spray of salt water.

  - Planner concludes that any sequence of actions involving pressing lever wi... (read more)

Good question! Sorry I didn't really explain. The missing piece is "the planner will conclude this has positive reward". The planner has no basis for coming up with this conclusion, that I can see.

In typical RL as I understand it, regardless of whether it's model-based or model-free, you learn about what is rewarding by seeing the outputs of the reward function. Like, if an RL agent is playing an Atari game, it does not see the source code that calculates the reward function. It can try to figure out how the reward function works, for sure, but when it doe... (read more)

Thanks for the post, this is my favourite formalisation of optimisation so far!

One concern I haven't seen raised so far, is that the definition seems very sensitive to the choice of configuration space. As an extreme example, for any given system, I can always augment the configuration space with an arbitrary number of dummy dimensions, and choose the dynamics such that these dummy dimensions always get set to all zero after each time step. Now, I can make the basin of attraction arbitrarily large, while the target configuration set remains a fixed si... (read more)

I feel like there are three facets to "norms" v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the 'negative' rights is probably more important, but it seems to me t... (read more)

1Rohin Shah5y
Yeah, agreed with all of that, thanks for the comment. You could definitely try to figure out each of these things individually, eg. learning constraints that can be used with Constrained Policy Optimization is along the "what not to do" axis, and a lot of the multiagent RL work is looking at how we can get some norms to show up with decentralized training. But I feel a lot more optimistic about research that is trying to do all three things at once, because I think the three aspects do interact with each other. At least, the first two feel very tightly linked, though they probably can be separated from the multiagent setting.