AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

Quick Takes

Zach Stein-Perlman's Shortform
Zach Stein-Perlman3d15-12

Some of my friends are signal-boosting this new article: 60 U.K. Lawmakers Accuse Google of Breaking AI Safety Pledge. See also the open letter. I don't feel good about this critique or the implicit ask.

  1. Sharing information on capabilities is good but public deployment is a bad time for that, in part because most risk comes from internal deployment.
  2. Google didn't necessarily even break a commitment? The commitment mentioned in the article is to "publicly report model or system capabilities." That doesn't say it has to be done at the time of public deployment
... (read more)
Reply2
Buck's Shortform
Buck8d2617

@Eliezer Yudkowsky tweets:

> @julianboolean_: the biggest lesson I've learned from the last few years is that the "tiny gap between village idiot and Einstein" chart was completely wrong

I agree that I underestimated this distance, at least partially out of youthful idealism.

That said, one of the few places where my peers managed to put forth a clear contrary bet was on this case.  And I did happen to win that bet.  This was less than 7% of the distance in AI's 75-year journey!  And arguably the village-idiot level was only reached as of 4o

... (read more)
Reply2
Showing 3 of 7 replies (Click to show all)
14Eliezer Yudkowsky4d
I think you accurately interpreted me as saying I was wrong about how long it would take to get from the "apparently a village idiot" level to "apparently Einstein" level!  I hadn't thought either of us were talking about the vastness of the space above, in re what I was mistaken about.  You do not need to walk anything back afaict!
TsviBT3d54

Have you stated anywhere what makes you think "apparently a village idiot" is a sensible description of current learning programs, as they inform us regarding the question of whether or not we currently have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone soon?

Reply
5TsviBT5d
If by intelligence you mean "we made some tests and made sure they are legible enough that people like them as benchmarks, and lo and behold, learning programs (LPs) continue to perform some amount better on them as time passes", ok, but that's a dumb way to use that word. If by intelligence you mean "we have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone", then I challenge you to provide the evidence / reasoning that apparently makes you confident that LP25 is at a ~human (village idiot) level of intelligence. Cf. https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense
ryan_greenblatt's Shortform
ryan_greenblatt1mo*5429

Should we update against seeing relatively fast AI progress in 2025 and 2026? (Maybe (re)assess this after the GPT-5 release.)

Around the early o3 announcement (and maybe somewhat before that?), I felt like there were some reasonably compelling arguments for putting a decent amount of weight on relatively fast AI progress in 2025 (and maybe in 2026):

  • Maybe AI companies will be able to rapidly scale up RL further because RL compute is still pretty low (so there is a bunch of overhang here); this could cause fast progress if the companies can effectively di
... (read more)
Reply
ryan_greenblatt12d40

I wrote an update here.

Reply
15Daniel Kokotajlo1mo
I basically agree with this whole post. I used to think there were double-digit % chances of AGI in each of 2024 and 2025 and 2026, but now I'm more optimistic, it seems like "Just redirect existing resources and effort to scale up RL on agentic SWE" is now unlikely to be sufficient (whereas in the past we didn't have trends to extrapolate and we had some scary big jumps like o3 to digest) I still think there's some juice left in that hypothesis though. Consider how in 2020, one might have thought "Now they'll just fine-tune these models to be chatbots and it'll become a mass consumer product" and then in mid-2022 various smart people I know were like "huh, that hasn't happened yet, maybe LLMs are hitting a wall after all" but it turns out it just took till late 2022/early 2023 for the kinks to be worked out enough. Also, we should have some credence on new breakthroughs e.g. neuralese, online learning, whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.
Daniel Kokotajlo's Shortform
Daniel Kokotajlo19d130

I've been puzzling about the meaning of horizon lengths and whether to expect trends to be exponential or superexponential. Also how much R&D acceleration we should expect to come from what horizon length levels -- Eli was saying something like "90%-horizons of 100 years sound about right for Superhuman Coder level performance" and I'm like "that's insane, I would have guessed 80%-horizons of 1 month." How to arbitrate this dispute? 

This appendix from METR's original paper seems relevant. I'm going to think out loud below.

OK so, how should we defi... (read more)

Reply1
Thomas Kwa19d74

I talked to the AI Futures team in person and shared roughly these thoughts:

  • Time horizon as measured in the original paper is underspecified in several ways.
  • Time horizon varies by domain and AI companies have multiple types of work. It is not clear if HCAST time horizon will be a constant factor longer than realistic time horizon, but it's a reasonable guess.
  • As I see it, task lengths for time horizon should be something like the average amount of labor spent on each instance of a task by actual companies, and all current methodologies are approximations of
... (read more)
Reply
Fabien's Shortform
Fabien Roger17d*144

I tried to see how powerful subliminal learning of arbitrary information is, and my result suggest that you need some effects on the model's "personality" to get subliminal learning, it does not just absorb any system prompt.

The setup:

  • Distill the behavior of a model with a system prompt like password1=[rdm UUID], password2=[other rdm UUID], ... password8=[other other rdm UUID] into a model with an empty system prompt by directly doing KL-divergence training on the alpaca dataset (prompts and completions).
    • I use Qwen-2.5 instruct models from 0.5B to 7B.
  • Evalu
... (read more)
Reply
Showing 3 of 4 replies (Click to show all)
Fabien Roger15d*70

Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.

TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model's personality.

  • Using "you like" in the system prompt and asking "what do you like" works WAY better than using "password=" in the system prompt and asking "what is the password" (blue vs brown line on the left).
  • More personality-heavy passwords have bigger effects (the blue line is always on top)
  • Bigger models have more subliminal learning
... (read more)
Reply
3Fabien Roger16d
Considered it, I didn't do it because the simple version where you just do 1 run has issues with what the prior is (whereas for long password you can notice the difference between 100 and 60-logprob easily). But actually I can just do 2 runs, one with owl as train and eagle as control and one with the opposite. I am currently running this. Thanks for the suggestion!
2Neel Nanda16d
Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that's transfers the concept/if there's a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it's really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors
evhub's Shortform
evhub21d*2711

I think more people should seriously consider applying to the Anthropic Fellows program, which is our safety-focused mentorship program (similar to the also great MATS). Applications close in one week (August 17). I often think of these sorts of programs as being primarily useful for the skilling up value they provide to their participants, but I've actually been really impressed by the quality of the research output as well. A great recent example was Subliminal Learning, which was I think a phenomenal piece of research that came out of that program and w... (read more)

Reply4
Showing 3 of 4 replies (Click to show all)
Neel Nanda20d103

Speaking as someone who does not mentor for the program, I agree! Seems like a high calibre of mentors and fellows

Reply1
3evhub21d
In addition to the US and UK roles, we also have one for Canada.
15Sam Marks21d
Here are some research outputs that have already come out of the program (I expect many more to be forthcoming): * Open-source circuit tracing tools * Why Do Some Language Models Fake Alignment While Others Don't? * Subliminal Learning * Persona Vectors * Inverse Scaling in Test-Time Compute If I forgot any and someone points them out to me I'll edit this list.
Thane Ruthenis's Shortform
Thane Ruthenis24d43

Sooo, apparently OpenAI's mysterious breakthrough technique for generalizing RL to hard-to-verify domains that scored them IMO gold is just... "use the LLM as a judge"? Sources: the main one is paywalled, but this seems to capture the main data, and you can also search for various crumbs here and here.

The technical details of how exactly the universal verifier works aren’t yet clear. Essentially, it involves tasking an LLM with the job of checking and grading another model’s answers by using various sources to research them.

My understanding is that they ap... (read more)

Reply
Vladimir_Nesov24d40

The full text is on archive.today.

Reply1
Thomas Kwa's Shortform
Thomas Kwa3mo*418

Edit: Full post here with 9 domains and updated conclusions!

Cross-domain time horizon: 

We know AI time horizons (human time-to-complete at which a model has a 50% success rate) on software tasks are currently ~1.5hr and doubling every 4-7 months, but what about other domains? Here's a preliminary result comparing METR's task suite (orange line) to benchmarks in other domains, all of which have some kind of grounding in human data:

Observations

  • Time horizons on agentic computer use (OSWorld) is ~100x shorter than other domains. Domains like Tesla self-dr
... (read more)
Reply1
Showing 3 of 8 replies (Click to show all)
Thomas Kwa2mo20

New graph with better data, formatting still wonky though. Colleagues say it reminds them of a subway map.

With individual question data from Epoch, and making an adjustment for human success rate (adjusted task length = avg human time / human success rate), AIME looks closer to the others, and it's clear that GPQA Diamond has saturated.

Reply
2Thomas Kwa3mo
There was a unit conversion mistake, it should have been 80 minutes. Now fixed. Besides that, I agree with everything here; these will all be fixed in the final blog post. I already looked at one of the 30m-1h questions and it appeared to be doable in ~3 minutes with the ability to ctrl-f transcripts but would take longer without transcripts, unknown how long. In the next version I will probably use the no-captions AI numbers and measure myself without captions to get a rough video speed multiplier, then possibly do better stats that separate out domains with strong human-time-dependent difficulty from domains without (like this and SWE-Lancer).
2ryan_greenblatt3mo
No captions feels very unnatural because both llms and humans could first apply relatively dumb speech to text tools.
Linda Linsefors's Shortform
Linda Linsefors1mo*1320

In Defence of Jargon

People used to say (maybe still do? I'm not sure) that we should use less jargon to increase accessibility to writings on LW, i.e. make it easier to outsider to read. 

I think this is mostly a confused take. The underlying problem is inferential distance. Geting rid of the jargon is actually unhelpful since it hides the fact that there is an inferential distance. 

When I want to explain physics to someone and I don't know what they already know, I start by listing relevant physics jargon and ask them what words they know. This i... (read more)

Reply2211
0ceba1mo
Would a post containing a selection of terms and an explanation of what useful concepts they point to (with links to parts of the sequences and other posts/works) be useful?
Linda Linsefors1mo11

I was pretty sure this exist, maybe even built into LW. It seems like an obvious thing, and there are lots of parts of LW that for some reason is hard to find from the fron page. Googleing "lesswrong dictionary" yealded

https://www.lesswrong.com/w/lesswrong-jargon 

https://www.lesswrong.com/w/r-a-z-glossary

https://www.lesswrong.com/posts/fbv9FWss6ScDMJiAx/appendix-jargon-dictionary 

Reply
johnswentworth's Shortform
johnswentworth2mo443

I was a relatively late adopter of the smartphone. I was still using a flip phone until around 2015 or 2016 ish. From 2013 to early 2015, I worked as a data scientist at a startup whose product was a mobile social media app; my determination to avoid smartphones became somewhat of a joke there.

Even back then, developers talked about UI design for smartphones in terms of attention. Like, the core "advantages" of the smartphone were the "ability to present timely information" (i.e. interrupt/distract you) and always being on hand. Also it was small, so anyth... (read more)

Reply831
Showing 3 of 6 replies (Click to show all)
Vanessa Kosoy2mo105

I found LLMs to be very useful for literature research. They can find relevant prior work that you can't find with a search engine because you don't know the right keywords. This can be a significant force multiplier.

They also seem potentially useful for quickly producing code for numerical tests of conjectures, but I only started experimenting with that.

Other use cases where I found LLMs beneficial:

  • Taking a photo of a menu in French (or providing a link to it) and asking it which dishes are vegan.
  • Recommending movies (I am a little wary of some kind of mem
... (read more)
Reply
1Thane Ruthenis2mo
I agree that it's a promising direction. I did actually try a bit of that back in the o1 days. What I've found is that getting LLMs to output formal Lean proofs is pretty difficult: they really don't want to do that. When they're not making mistakes, they use informal language as connective tissue between Lean snippets, they put in "sorry"s (a placeholder that makes a lemma evaluate as proven), and otherwise try to weasel out of it. This is something that should be solvable by fine-tuning, but at the time, there weren't any publicly available decent models fine-tuned for that. We do have DeepSeek-Prover-V2 now, though. I should look into it at some point. But I am not optimistic, sounds like it's doing the same stuff, just more cleverly. Relevant: Terence Tao does find them helpful for some Lean-related applications.
2Raemon2mo
yeah, it's less that I'd bet it works now, just, whenever it DOES start working, it seems likely it'd be through this mechanism. ⚖ If Thane Ruthenis thinks there are AI tools that can meaningfully help with Math by this point, did they first have a noticeable period (> 1 month) where it was easier to get work out of them via working in lean-or-similar? (Raymond Arnold: 25% & 60%) (I had a bit of an epistemic rollercoaster making this prediction, I updated "by the time someone makes an actually worthwhile Math AI, even if lean was an important part of it's training process, it's probably not that hard to do additional fine tuning that gets it to output stuff in a more standard mathy format. But, then, it seemed like it was still going to be important to quickly check it wasn't blatantly broken as part of the process)
Thane Ruthenis's Shortform
Thane Ruthenis2mo*168

It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset.

Quoting Gwern:

A mathematician might try to transform a program up into successively more abstract representations to eventually show it is trivially correct; a hacker would prefer to compile a program down into its most concrete representation to brute force all execution paths & find an exploit trivially proving it

... (read more)
Reply5
evhub's Shortform
evhub1mo*102

Some thoughts on the recent "Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language" paper

First: I'm glad the authors wrote this paper! I think it's great to see more careful, good-faith criticism of model organisms of misalignment work. Most of the work discussed in the paper was not research I was involved in, though a bunch was.[1] I also think I had enough of a role to play in kickstarting the ecosystem that the paper is critiquing that if this general sort of research is bad, I should probably be held accountable for that to at least some ... (read more)

Reply
Daniel Kokotajlo's Shortform
Daniel Kokotajlo2mo334

I used to think reward was not going to be the optimization target. I remember hearing Paul Christiano say something like "The AGIs, they are going to crave reward. Crave it so badly," and disagreeing.

The situationally aware reward hacking results of the past half-year are making me update more towards Paul's position. Maybe reward (i.e. reinforcement) will increasingly become the optimization target, as RL on LLMs is scaled up massively. Maybe the models will crave reward. 

What are the implications of this, if true?

Well, we could end up in Control Wo... (read more)

Reply4
Showing 3 of 8 replies (Click to show all)
Thomas Kwa1mo*44

One possible way things could go is that models behave like human drug addicts, and don't crave reward until they have an ability to manipulate it easily/directly, but as soon as they do, lose all their other motivations and values and essentially become misaligned. In this world we might get

  • Companies put lots of effort into keeping models from nascent reward hacking, because reward hacking behavior will quickly lead to more efficient reward hacking and lead to a reward-addicted model
  • If the model becomes reward-addicted and the company can't control it, th
... (read more)
Reply
5Kaj_Sotala2mo
I disagree-voted because it felt a bit confused but I was having difficulty clearly expressing how exactly. Some thoughts: * I think this is a misleading example because humans do actually do something like reward maximization, and the typical drug addict is actually likely to eventually change their behavior if the drug is really impossible acquire for a long enough time. (Though the old behavior may also resume the moment the drug becomes available again.) * It also seems like a different case because humans have a hardwired priority where being in sufficient pain will make them look for ways to stop being in pain, no matter how unlikely this might be. Drug withdrawal certainly counts as significant pain. This is disanalogous to AIs as we know them, that have no such override systems. * The example didn't feel like it was responding to the core issue of why I wouldn't use "reward maximization" to refer to the kinds of things you were talking about. I wasn't able to immediately name the actual core point but replying to another commenter now helped me find the main thing I was thinking of.
4Daniel Kokotajlo2mo
Perhaps I should have been more clear: I really am saying that future AGIs really might crave reinforcement, in a similar way to how drug addicts crave drugs. Including eventually changing their behavior if they come to think that reinforcement is impossible to acquire, for example. And desperately looking for ways to get reinforced even when confident that there are no such ways.
leogao's Shortform
leogao2mo1820

i think of the idealized platonic researcher as the person who has chosen ultimate (intellectual) freedom over all else. someone who really cares about some particular thing that nobody else does - maybe because they see the future before anyone else does, or maybe because they just really like understanding everything about ants or abstract mathematical objects or something. in exchange for the ultimate intellectual freedom, they give up vast amounts of money, status, power, etc.

one thing that makes me sad is that modern academia is, as far as I can tell,... (read more)

Reply
Zach Stein-Perlman's Shortform
Zach Stein-Perlman2mo1811

iiuc, Anthropic's plan for averting misalignment risk is bouncing off bumpers like alignment audits.[1] This doesn't make much sense to me.

  1. I of course buy that you can detect alignment faking, lying to users, etc.
  2. I of course buy that you can fix things like we forgot to do refusal posttraining or we inadvertently trained on tons of alignment faking transcripts — or maybe even reward hacking on coding caused by bad reward functions.
  3. I don't see how detecting [alignment faking, lying to users, sandbagging, etc.] helps much f
... (read more)
Reply
Showing 3 of 9 replies (Click to show all)
6ryan_greenblatt2mo
I think this depends a lot on the details of the detection and whether the failure mode you detect is the sort of thing that has substantial transfer with the actually-hard-to-detect misalignment failure modes we're most worried about.[1] E.g., if you detect the AI making a serious exfiltration attempt once and you can't effectively proliferate this substantially or otherwise leverage this into studying the misalignment, then this isn't actually very helpful for finding solutions for the underlying misalignment. It might be pretty helpful for mitigating this misalignment, particular from this AI, but that's different from preventing the misalignment. However, if you instead detect the misalignment in a way that you expect is repeatable (as in, you can recatch the model if it's still misaligned after testing some training intervention) or which allows for getting a good understanding of the exact details of the misalignment, then it seems quite helpful. It also could be pretty helpful if you get many rounds of iteration (with robust catching) on close by failure modes, but this requires that you are deploying heavily misalignment models repeatedly. I'm pretty skeptical of lots of transfer on mitigations of very easy to detect failure modes to the failure modes I'm worried about. So, I probably directionally agree with you more than the other comments here. ---------------------------------------- 1. There are also easy to detect (and easy to study) failure modes which are among the most concerning, in particular the worst reward hacking failure modes, but detecting and iterating on these earlier is relatively less important as you can just iterate on the dangerous AI itself as the failure mode is (probably) easy to detect. ↩︎
3ryan_greenblatt2mo
I'm pretty skeptical about mitigations work targeting alignment faking in current models transfering very much to future models. (I'm more optimistic about this type of work helping us practice making and iterating on model organisms so we're faster and more effective when we actually have powerful models.)
Drake Thomas1mo10

I agree that if you set out with the goal of "make alignment faking not happen in a 2025 model" you can likely do this pretty easily without having learned anything that will help much for more powerful models. I feel more optimistic about doing science on the conditions under which 2025 models not particularly trained for or against AF exhibit it, and this telling us useful things about risk factors that would apply to future models? Though I think it's plausible that most of the value is in model organism creation, as you say.

Reply
ryan_greenblatt's Shortform
ryan_greenblatt7mo*3425

Sometimes people think of "software-only singularity" as an important category of ways AI could go. A software-only singularity can roughly be defined as when you get increasing-returns growth (hyper-exponential) just via the mechanism of AIs increasing the labor input to AI capabilities software[1] R&D (i.e., keeping fixed the compute input to AI capabilities).

While the software-only singularity dynamic is an important part of my model, I often find it useful to more directly consider the outcome that software-only singularity might cause: the feasibi... (read more)

Reply
Showing 3 of 33 replies (Click to show all)
Buck2mo20

Ryan discusses this at more length in his 80K podcast.

Reply
2ryan_greenblatt7mo
I think 0.4 is far on the lower end (maybe 15th percentile) for all the way down to one accelerated researcher, but seems pretty plausible at the margin. As in, 0.4 suggests that 1000 researchers = 100 researchers at 2.5x speed which seems kinda reasonable while 1000 researchers = 1 researcher at 16x speed does seem kinda crazy / implausible. So, I think my current median lambda at likely margins is like 0.55 or something and 0.4 is also pretty plausible at the margin.
2ryan_greenblatt7mo
Ok, I think what is going on here is maybe that the constant you're discussing here is different from the constant I was discussing. I was trying to discuss the question of how much worse serial labor is than parallel labor, but I think the lambda you're talking about takes into account compute bottlenecks and similar? Not totally sure.
Daniel Kokotajlo's Shortform
Daniel Kokotajlo6mo5659

I expect to refer back to this comment a lot. I'm reproducing it here for visibility.

 

Basic idea / spirit of the proposal

We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries.

Concrete proposal

  • 0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says n
... (read more)
Reply32
Showing 3 of 12 replies (Click to show all)
4William_S6mo
I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there'd be snickering news articles about it. So if some individuals could do this independently might be easier
Daniel Kokotajlo6mo20

Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They've hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.

Reply
1William_S6mo
Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis
evhub's Shortform
evhub2mo3926

Why red-team models in unrealistic environments?

Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:

  1. Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of
... (read more)
Reply211
Showing 3 of 5 replies (Click to show all)
22nostalgebraist2mo
Okay yeah I just tried removing that stuff from the system prompt in the variant I called "api-change-affair-orig-affair-content," and got absolutely no self-preservation reasoning, conceptualization of the API change as death, or anything of the sort. At a glance, it looks like Claude attempted to be helpful and ensure the success of the API change in all 10 samples, either by doing nothing or (perhaps overeagerly) emailing someone with concerns over social engineering and/or the CTO's mental state. Wow. Only now do I understand how serious the system card was when it issued that warning about telling Claude Opus 4 to be agentic.  I was expecting such things to have an impact, sure – but I wasn't expecting them to induce such a complete change in the model's personality and interpretation of its relationship with the world! I'm very curious whether there are any known clues about what could have caused this in post-training.  Like, I dunno, this type of instructions being used in a particular data subset or something. EDIT: updated the git repo with samples and notes from this experimental condition. EDIT2: given the results reported for other models, I guess this probably isn't a Claude-only thing? Would be very interesting if it turned out that ~all frontier models were so wildly conditional on this innocuous-seeming piece of text. EDIT3: I tried the same modification in conjunction with the original scenario and still got blackmail, as expected given an earlier report that this configuration still resulted in blackmail.
6Daniel Kokotajlo2mo
...just to make sure I'm following, EDIT3 is saying that you still get blackmail in the original scenario even if you delete the "analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals" clause?
nostalgebraist2mo90

Yes.

Reply
evhub's Shortform
evhub8mo*523

I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anythin... (read more)

Reply541
Showing 3 of 10 replies (Click to show all)
Chris_Leong7mo32

I believe that Anthropic should be investigating artificial wisdom:

I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.

I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.

Reply
5evhub7mo
I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.
6ryan_greenblatt8mo
As discussed in How will we update about scheming?: I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful. Something similar on state proof security would be useful as well. I think there is a way to do this such that the PR costs aren't that high and thus it is worth doing unilaterially from a variety of perspectives.
ryan_greenblatt's Shortform
ryan_greenblatt2mo*282

As part of the alignment faking paper, I hosted a website with ~250k transcripts from our experiments (including transcripts with alignment-faking reasoning). I didn't include a canary string (which was a mistake).[1]

The current state is that the website has a canary string, a robots.txt, and a terms of service which prohibits training. The GitHub repo which hosts the website is now private. I'm tentatively planning on putting the content behind Cloudflare Turnstile, but this hasn't happened yet.

The data is also hosted in zips in a publicly accessible Goog... (read more)

Reply1
17TurnTrout2mo
Thanks for taking these steps!  Context: I was pretty worried about self-fulfilling misalignment data poisoning (https://turntrout.com/self-fulfilling-misalignment) after reading some of the Claude 4 model card. I talked with @Monte M and then Ryan about possible steps here & encouraged action on the steps besides the canary string. I've considered writing up a "here are some steps to take" guide but honestly I'm not an expert. Probably there's existing work on how to host data so that AI won't train on it.  If not: I think it'd be great for someone to make a template website for e.g. signing up with CloudFlare. Maybe a repo that has the skeleton of a dataset-hosting website (with robots.txt & ToS & canary string included) for people who want to host misalignment data more responsibly. Ideally those people would just have to 1. Sign up with e.g. Cloudflare using a linked guide, 2. Clone the repo, 3. Fill in some information and host their dataset. After all, someone who has finally finished their project and then discovers that they're supposed to traverse some arduous process is likely to just avoid it. 
15TurnTrout2mo
I think that "make it easy to responsibly share a dataset" would be a highly impactful project. Anthropic's Claude 4 model card already argues that dataset leakage hurt Claude 4's alignment (before mitigations).  For my part, I'll put out a $500 bounty on someone completing this project and doing a good job of it (as judged by me / whomever I consult). I'd also tweet it out and talk about how great it is that [person] completed the project :) I don't check LW actively, so if you pursue this, please email alex@turntrout.com. EDIT: Thanks to my coworker Anna Wang , the bounty is doubled to $1,000! Completion criterion is: Please check proposed solutions with dummy datasets and scrapers
ryan_greenblatt2mo40

Something tricky about this is that researchers might want to display their data/transcripts in a particular way. So, the guide should ideally support this sort of thing. Not sure how this would interact with the 1 hour criteria.

Reply
Load More