Mod note: I removed Dan H as a co-author since it seems like that was more used as convenience for posting it to the AI Alignment Forum. Let me know if you want me to revert.
If the difference between these papers is: we do activations, they do weights, then I think that warrants more conceptual and empirical comparisons.
Yeah, it's totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic "please have more thorough related work sections" request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs).
The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in.
E.g. in https://arxiv.org/pdf/2304.03279.pdf the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don't see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any s...
I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.
I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of ...
Yeah, does sure seem like we should update something here. I am planning to spend more time on AIAF stuff soon, but until then, if someone has a drop-in paragraph, I would probably lightly edit it and then just use whatever you send me/post here.
This is not commenting on the substance of this post, but I really feel like the title of this post should be "The self-alignment problem".
Like, we talk about "The alignment problem" not "The unalignment problem". The current title makes me think that the problem is that I somehow have to unalign myself, which doesn't really make sense.
Direct optimizers typically have a very specific architecture requiring substantial iteration and search. Luckily, it appears that our current NN architectures, with a fixed-length forward pass and a lack of recurrence or support for branching computations as is required in tree search makes the implementation of powerful mesa-optimizers inside the network quite challenging.
I think this is being too confident on what "direct optimizers" require.
There is an ontology, mostly inherited from the graph-search context, in which "direct optimizers" require ...
Perhaps I've simply been misreading John, and he's been intending to say "I have some beliefs, and separately I have some suggestive technical results, and they feel kinda related to me! Which is not to say that any onlooker is supposed to be able to read the technical results and then be persuaded of any of my claims; but it feels promising and exciting to me!".
For what it's worth, I ask John about once ever month or two about his research progress and his answer has so far been (paraphrased) "I think I am making progress. I don't think I have anything to...
John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results even so.
Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem...
This is just false, because it is not taking into account the cost of doing expected value maximization, since giving consistent preferability scores is just very expensive and hard to do reliably.
I do really want to put emphasis on the parenthetical remark "(at least in some situations, though they may not arise)". Katja is totally aware that the coherence arguments require a bunch of preconditions that are not guaranteed to be the case for all situations, or even any situation ever, and her post is about how there is still a relevant argument here.
Crossposting this comment from the EA Forum:
Nuno says:
I appreciate the whole post. But I personally really enjoyed the appendix. In particular, I found it informative that Yudkowsk can speak/write with that level of authoritativeness, confidence, and disdain for others who disagree, and still be wrong (if this post is right).
I respond:
(if this post is right)
The post does actually seem wrong though.
I expect someone to write a comment with the details at some point (I am pretty busy right now, so can only give a quick meta-level gleam), but...
Copying my response from the EA forum:
(if this post is right)
The post does actually seem wrong though.
Glad that I added the caveat.
Also, the title of "there are no coherence arguments" is just straightforwardly wrong. The theorems cited are of course real theorems, they are relevant to agents acting with a certain kind of coherence, and I don't really understand the semantic argument that is happening where it's trying to say that the cited theorems aren't talking about "coherence", when like, they clearly are.
Well, part of the semantic nuance is tha...
I’m following previous authors in defining ‘coherence theorems’ as
theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
On that definition, there are no coherence theorems. VNM is not a coherence theorem, nor is Savage’s Theorem, nor is Bolker-Jeffrey, nor are Dutch Book Arguments, nor is Cox’s Theorem, nor is the Complete Class Theorem.
there are theorems that are relevant to the question of agent coherence
I'd have no proble...
Yep, I think it's pretty plausible this is just a data-quality issue, though I find myself somewhat skeptical of this. Maybe worth a bet?
I would be happy to bet that conditional on them trying to solve this with more supervised training and no RLHF, we are going to see error modes substantially more catastrophic than current Chat-GPT.
Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul's comment saying "I've seen only modest qualitative differences" in order to disagree and say "I think we've now seen substantial qualitative differences".
We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.
...It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-dis
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).
Ok, I think we might now have some additional data on this debate. It does indeed look like to me that Sydney was trained with the next best available technology after RLHF, for a few months, at least based on Gwern's guesses here: https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned?commen...
Relevant piece of data: https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/?fbclid=IwAR3KTBnxC_y7n0TkrCdcd63oBuwnu6wyXcDtb2lijk3G-p9wdgD9el8KzQ4
...Feb 1 (Reuters) - ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history, according to a UBS study on Wednesday.
The report, citing data from analytics firm Similarweb, said an average of about 13 million u
I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).
Makes sense, sorry for the miscommunication. I really didn't feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you.
I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do ...
How much total investment do you think there is in AI in 2023?
My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.
...How much variance do you think there is in the level o
I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).
I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in ...
I think it's unlikely that the reception of ChatGPT increased OpenAI's valuation by $10B, much less investment in OpenAI, even before thinking about replaceability.
Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by $10B, not that it increased investment into specifically OpenAI. Companies generally don't have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry.
I think the effect would have been very similar if it had been trained via supervised learning on good dialogs
I don't currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in vario...
...I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem
RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line.
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled a...
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).
I think the much more important differences are:
my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3
I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.
I think this is my second-favorite post in the MIRI dialogues (for my overall review see here).
I think this post was valuable to me in a much more object-level way. I think this post was the first post that actually just went really concrete on the current landscape of efforts int he domain of AI Notkilleveryonism and talked concretely about what seems feasible for different actors to achieve, and what isn't, in a way that parsed for me, and didn't feel either like something obviously political, or delusional.
I didn't find the part about differ...
I feel like this post is the best current thing to link to for understanding the point of coherence arguments in AI Alignment, which I think are really crucial, and even in 2023 I still see lots of people make bad arguments either overextending the validity of coherence arguments, or dismissing coherence arguments completely in an unproductive way.
I wrote up a bunch of my high-level views on the MIRI dialogues in this review, so let me say some things that are more specific to this post.
Since the dialogues are written, I keep coming back to the question of the degree to which consequentialism is a natural abstraction that will show up in AI systems we train, and while this dialogue had some frustrating parts where communication didn't go perfectly, I still think it has some of the best intuition pumps for how to think about consequentialism in AI systems.
The other part I liked the most w...
This was quite a while ago, probably over 2 years, though I do feel like I remember it quite distinctly. I guess my model of you has updated somewhat here over the years, and now is more interested in heads-down work.
I think I was actually helping Robby edit some early version of this post a few months before it was posted on LessWrong, so I think my exposure to it was actually closer to ~18-20 months ago.
I do think that still means I set a lot of my current/recent plans into motion before this was out, and your post is appreciated.
I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall.
A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off.
Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment...
If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.
I've thought a good amount about Finite Factored Sets in the past year or two, but I do sure keep going back to thinking about the world primarily in the form of Pearlian causal influence diagrams, and I am not really sure why.
I do think this one line by Scott at the top gave me at least one pointer towards what was happening:
but I'm trained as a combinatorialist, so I'm giving a combinatorics talk upfront.
In the space of mathematical affinities, combinatorics is among the branches of math I feel most averse to, and I think that explains a good...
I think this is a fun idea, but also, I think these explanations are mostly actually pretty bad, and at least my inner Eliezer is screaming at most of these rejected outputs, as well as the reasoning behind them.
I also don't think it provides any more substantial robustness guarantees than the existing fine-tuning, though I do think if we train the model to be a really accurate Eliezer-simulator, that this approach has more hope (but that's not the current training objective of either base-GPT3 or the helpful assistant model).
Promoted to curated: I found engaging with this post quite valuable. I think in the end I disagree with the majority of arguments in it (or at least think they omit major considerations that have previously been discussed on LessWrong and the AI Alignment Forum), but I found thinking through these counterarguments and considering each one of them seriously a very valuable thing to do to help me flesh out my models of the AI X-Risk space.
IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there's a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would've said that we'd never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda's modular addition work.
I don't think I've seen ma...
Oh, huh, I think this moderation action makes me substantially less likely to comment further on your posts, FWIW. It's currently will within your rights to do so, and I am on the margin excited about more people moderating things, but I feel hesitant participating with the current level of norm-specification + enforcement.
I also turned my strong-upvote into a small-upvote, since I have less trust in the comment section surfacing counterarguments, which feels particularly sad for this post (e.g. I was planning to respond to your comment with examples of pa...
I appreciate the effort and strong-upvoted this post because I think it's following a good methodology of trying to build concrete gear-level models and concretely imagining what will happen, but also think this is really very much not what I expect to happen, and in my model of the world is quite deeply confused about how this will go (mostly by vastly overestimating the naturalness of the diamond abstraction, underestimating convergent instrumental goals and associated behaviors, and relying too much on the shard abstraction). I don't have time to write a whole response, but in the absence of a "disagreevote" on posts am leaving this comment.
Thanks. Am interested in hearing more at some point.
I also want to note that insofar as this extremely basic approach ("reward the agent for diamond-related activities") is obviously doomed for reasons the community already knew about, then it should be vulnerable to a convincing linkpost comment which points out a fatal, non-recoverable flaw in my reasoning (like: "TurnTrout, you're ignoring the obvious X and Y problems, linked here:"). I'm posting this comment as an invitation for people to reply with that, if appropriate![1]
And if there is nothing...
Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
I agree that "train a system with internet access, but then remove it, then hope that it's safe", doesn't really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it's an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.
Here is an example quote from the latest OpenAI blogpost on AI Alignment:
Language models are particularly well-suited for automating alignment research because they come “preloaded” with a lot of knowledge and information about human values from reading the internet. Out of the box, they aren’t independent agents and thus don’t pursue their own goals in the world. To do alignment research they don’t need unrestricted access to the internet. Yet a lot of alignment research tasks can be phrased as natural language or coding tasks.
This sounds super straig...
I think the smiling example is much more analogous than you are making it out here. I think the basic argument for "this just encourages taking control of the reward" or "this just encourages deception" goes through the same way.
Like, RLHF is not some magical "we have definitely figured out whether a behavior is really good or bad" signal, it's historically been just some contractors thinking for like a minute about whether a thing is fine. I don't think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater th...
and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment
I am confused. How does RLHF help with outer alignment? Isn't optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)
I don't think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don't understand where the inner/outer alignment distinction comes from in this context)
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).
I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.
I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a g...
At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.
Also, I think that "on our first try" thing isn't ...
I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI's activities).
Huh, I definitely expect it to drive >0.1% of OpenAI's activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI's research staff, while probably substantially increasing OpenAI's ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to m...
I think the direct risk of OpenAI's activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.
I agree that if we consider indirect risks broadly (including e.g. "this helps OpenAI succeed or raise money and OpenAI's success is dangerous") then I'd probably move back towards "what % of OpenAI's activities is it."
I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don't understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it's an important disagreement.
I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval i...
But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can't access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic.
I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I...
If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs.
Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I've seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researc...
I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think "We have powerful AI systems but haven't deployed them to do stuff they are capable of" is a very short-term kind of situation and not particularly desirable besides.
I'm not sure what you are comparing RLHF or WebGPT to when you say "paradigm of AIs that are much harder to align." I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling...
Yeah, I agree that I am doing reasoning on people's motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people's motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.
I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by sa...
I don't think "your AI wants to kill you but it can't get out of the box so it helps you with alignment instead" is the mainline scenario. You should be building an AI that wouldn't stab you if your back was turned and it was holding a knife, and if you can't do that then you should not build the AI.
That's interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like "we'll just h...
WebGPT is approximately "reinforcement learning on the internet".
There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think "reinforcement learning on the internet" is approximately the worst direction for modern AI to go in terms of immediate risks.
I don't think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions f...
The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying "well, seems like the cost is kinda high so we won't do it" seems like exactly the kind of attitude that I am worried will cause humanity to go extinct.
The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic.
I think this is literally true, but at least as far as I know is not really conveying the underlying dynamics and so I expect readers to walk away with the wrong impression.
Again, I might be...
Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".
My model is that John is talking about "evidence on whether an AI alignment solution is sufficient", and you understood him to say "evidence on whether the AI Alignment problem is real/difficult". My guess is you both agree on the former, but I am not confident.
WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)
(Edit: M...
like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons
This also seems like an odd statement - it seems reasonable to say "I think the net effect of InstructGPT is to boost capabilities" or even "If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT". But it feels like you're assuming some deep insight into the intention behind the people w...
A while ago I got most of the way to set up a feature on LW/AIAF that would export LW/AIAF posts to a nicely formatted academic-looking PDF that is linkable. I ended up running into a hurdle somewhat close to the end and shelved the feature, but if there is a lot of demand here, I could probably finish up the work, which would make this process even easier.
A while ago I made a very quick Python script to pull Markdown from LW, then use pandoc to export to a PDF (because I prefer reading physical papers and Latex formatting). I used it somewhat regularly for ~6 months and found that it was good enough for my purposes. I assume the LW developers could write something much better, but I've thrown it into this Github [repo](https://github.com/juesato/lw_pdf_exporter/tree/main) in case it's of help or interest.
Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the "learning to summarize from human feedback" work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).
I think Rohin Shah doesn't think of himself as having produced empirical work that helps with AI Alignment, but only to ha...
I'm pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn't expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.
The main channel of value that I see for doing work like "learning to summarize" and the critiques project and various interpret...
Mod note: It felt fine to do this once or twice, but it's not an intended use-case of AI Alignment Forum membership to post to the AI Alignment Forum with content that you didn't write.
I would have likely accepted this submission to the AI Alignment Forum anyways, so it seems best to just go via the usual submission channels. I don't want to set a precedent of weirdly confusing co-authorship for submission purposes. You can also ping me on Intercom in-advance if you want to get an ahead notice of whether the post fits on the AIAF, or want to make sure it goes live there immediately.