All Comments

TsviBT4d54

Have you stated anywhere what makes you think "apparently a village idiot" is a sensible description of current learning programs, as they inform us regarding the question of whether or not we currently have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone soon?

Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)

Eliezer Yudkowsky4d144

I think you accurately interpreted me as saying I was wrong about how long it would take to get from the "apparently a village idiot" level to "apparently Einstein" level! I hadn't thought either of us were talking about the vastness of the space above, in re what I was mistaken about. You do not need to walk anything back afaict!

Alex Mallen5d23

Another downside is that pre-deployment risk assessments might increase the likelihood of a secret intelligence explosion via the mechanism of discouraging public release of models.

AI companies have started saying safeguards are load-bearing

Kaj_Sotala5d54

I was amused when Claude Opus abruptly stopped generating a reply to me and shut down the chat when I had asked it how a fictional galactic empire might control its frontier planets. Given that it stopped generating in the middle of a sentence that was talking about "biological monitoring" and "enhanced", I surmised that the reference to the genetically engineered catboys/catgirls in the setting had triggered its bioengineering filters.

Optimally Combining Probe Monitors and Black Box Monitors

Rohin Shah6d20

From the paper:

If a non-monotonic relationship exists between monitor scores and likelihood ratios, auditing the highest-scoring outputs is suboptimal. [...] This means ΠNP will audit some moderately high-scoring outputs (e.g., those with a monitor score of 7.2) before auditing certain higher-scoring outputs

I agree this is clearly better if you trust your LLRs to be 100% accurate, but ultimately this has to be fit on some training set and then we are depending on it generalizing to deployment. However, that generalization could fail just for normal ML reas... (read more)

TsviBT6d*59

If by intelligence you mean "we made some tests and made sure they are legible enough that people like them as benchmarks, and lo and behold, learning programs (LPs) continue to perform some amount better on them as time passes", ok, but that's a dumb way to use that word. If by intelligence you mean "we have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone", then I challenge you to provide the evid... (read more)

Notes on cooperating with unaligned AIs

Buck7d226

Ugh, I think you're totally right and I was being sloppy; I totally unreasonably interpreted Eliezer as saying that he was wrong about how long/how hard/how expensive it would be to get between capability levels. (But maybe Eliezer misinterpreted himself the same way? His subsequent tweets are consistent with this interpretation.)

I totally agree with Eliezer's point in that post, though I do wish that he had been clearer about what exactly he was saying.

Charlie Steiner7d20

Fortunately, there’s a correlation between situations where (i) AI takeover risk is high, and (ii) AIs have a good understanding of the world. If AI developers have perfect ability to present the AI with false impressions of the world, then the risk from AI takeover is probably low. While if AIs have substantial ability to distinguish truth from falsehood, then perhaps that channel can also be used to communicate facts about the world.

Whether this is fortunate depends a lot on how beneficial communication with unaligned AIs is. If unaligned AI with high ch... (read more)

Four ways learning Econ makes people dumber re: future AI

sil7d20

I really appreciate this post, it points out something I consider extremely important. It's obviously aligned with gradual disempowerment/intelligence curse type discussion, however I'm not sure if I can say if I've ever seen this specific thing discussed elsewhere.

I would like to mention a 5th type, though perhaps not the type discussed in your post since it likely doesn't apply to those who actually do rigorously study economics, this is more so a roadblock I hit regarding the layman's understanding of Econ. To summarize it in three words, the idea that ... (read more)

With enough knowledge, any conscious agent acts morally

Michele Campolo7d10

If torturing an AI only teaches it to avoid things that are bad-for-it, without caring about suffering it doesn't feel, the argument doesn't work.

I’m not sure why you are saying the argument does not work in this case, what about all the other things the AI could learn from other experiences or teachings? Below I copy a paragraph from the post

However, the argument does not say that initial agent biases are irrelevant and that all conscious agents reach moral behaviour equally easily and independently. We should expect, for example, that an agent that alrea

Michele Campolo7d20

Thank you for this suggestion, I appreciate it! I’ve read the review I found here and it seems that parts of that account of ethics overlap with some ideas I’ve discussed in the post, in particular the idea of considering the point of view of all conscious (rational) agents. Maybe I’ll read the entire book if I decide to reformulate the argument of the post in a different way, which is something I was already thinking about.

How did you find that book?

Four ways learning Econ makes people dumber re: future AI

Adam Scholl8d175

I have thought of that "Village Idiot and Einstein" claim as the most obvious example of a way that Eliezer and co were super wrong about how AI would go, and they've AFAIK totally failed to publicly reckon with it as it's become increasingly obvious that they were wrong over the last eight years

I'm confused—what evidence do you mean? As I understood it, the point of the village idiot/Einstein post was that the size of the relative differences in intelligence we were familiar with—e.g., between humans, or between humans and other organisms—tells us little ... (read more)

denkenberger8d22

I appreciated the attention to detail, e.g. Dyson Swarm instead of Dyson Sphere, and googol instead of google. Maybe I missed it, but I think a big one is that economists typically only look back 100 or so years so they have a strong prior of roughly constant growth rates. Whereas if you look back further, it really does look like an explosion.

And the whole exchange on nitter for those who don't like going on x/twitter.

ryan_greenblatt8d110

Four ways learning Econ makes people dumber re: future AI

Addie Foote8d31

Thanks for the thoughtful reply!

I understand “resource scarcity” but I’m confused by “coordination problems”. Can you give an example? (Sorry if that’s a stupid question.)

This is the idea that at some point in scaling up an organization you could lose efficiency due to needing more/better management, more communication (meetings) needed and longer communication processes, "bloat" in general. I'm not claiming it’s likely to happen with AI, just another possible reason for increasing marginal cost with scale.

Resource scarcity seems unlikely to bite her

... (read more)

Link to the actual tweet.

Richard_Kennaway8d100

Four ways learning Econ makes people dumber re: future AI

Buck8d2617

@Eliezer Yudkowsky tweets:

> @julianboolean_: the biggest lesson I've learned from the last few years is that the "tiny gap between village idiot and Einstein" chart was completely wrong
I agree that I underestimated this distance, at least partially out of youthful idealism.
That said, one of the few places where my peers managed to put forth a clear contrary bet was on this case. And I did happen to win that bet. This was less than 7% of the distance in AI's 75-year journey! And arguably the village-idiot level was only reached as of 4o

... (read more)

Steven Byrnes8d30

Yeah, the latter, I think too much econ makes the very possibility of AGI into a blind spot for (many) economists. See the second part of my comment here.

Four ways learning Econ makes people dumber re: future AI

Steven Byrnes8d*30

Thanks. I don’t think we disagree much (more in emphasis than content).

Things like resource scarcity or coordination problems can cause increasing marginal cost with scale.

I understand “resource scarcity” but I’m confused by “coordination problems”. Can you give an example? (Sorry if that’s a stupid question.)

Resource scarcity seems unlikely to bite here, at least not for long. If some product is very profitable to create, and one of its components has a shortage, then people (or AGIs) will find ways to redesign around that component. AGI does not fundamen... (read more)

With enough knowledge, any conscious agent acts morally

TAG9d*00

What type of argument is my argument, from your perspective?

Naturalistic, intrinsically motivating, moral realism.

both can affect action, as it happens in the thought experiment in the post.

Bad-for-others can obviously affect action in an agent that's already altruistic, but you are attempting something much harder , which is bootstrapping altruistic morality from logic and evidence.

more generally, I think morality is about what is important, better/worse, worth doing, worth guiding action

In some objective sense. If torturing an AI only teaches... (read more)

With enough knowledge, any conscious agent acts morally

Michele Campolo9d10

This type of argument has the problem that other peoples negative experiences aren't directly motivating in the way that yours are...there's a gap between bad-for-me and morally-wrong.

What type of argument is my argument, from your perspective? I also think that there is a gap between bad-for-me and bad-for-others. But both can affect action, as it happens in the thought experiment in the post.

To say that something is morally-wrong is to say that I have some obligation or motivation to do something about.

I use a different working definition in the argument... (read more)

With enough knowledge, any conscious agent acts morally

Michele Campolo9d10

Consider this stamp collector construction: It sends and receives internet data, it has a magically accurate model of reality, it calculates how many stamps would result from each sequence of outputs, and then it outputs the one that results in the most stamps.

I’m not sure why you left out the “conscious agent” part, which is the fundamental premise of the argument. If you are describing something like a giant (artificial) neural network optimised to output actions that maximise stamps while receiving input data about the current state of the world, that s... (read more)

Four ways learning Econ makes people dumber re: future AI

Cody Rushing9d52

Regardless of the whether these anti-pedagogy's are correct, I'm confused about why you think you've shown that learning econ made the economists dumber. It seems like the majority of your tweets you linked, excluding maybe Tweet 4, are actually just the economists discussing narrow AI and failing to consider general intelligence?

If you meant to say something like 'econ pedagogy makes it hard for economists to view AGI as something that could actually be intelligent in a way similar to humans', then I may be more inclined to agree with you.

Four ways learning Econ makes people dumber re: future AI

Addie Foote9d42

I agree that economists make some implicit assumptions about what AGI will look like that should be more explicit. But, I disagree with several points in this post.

On equilibrium: A market will equilibriate when the supply and demand is balanced at the current price point. At any given instant this can happen for a market even with AGI (sellers increase price until buyers are not willing to buy). Being at an equilibrium doesn’t imply the supply, demand, and price won’t change over time. Economists are very familiar with growth and various kinds of dy... (read more)

With enough knowledge, any conscious agent acts morally

Coop Veit9d10

“The we-intention Sellars regards as intrinsically valid-"It shallwe be the case that each of us rational beings so acts as to promote our welfare" — embodies a particular conception of what is good —namely, the welfare of rational beings. How, though, to establish the superiority of this account of the good over the rational egoist's account? Sellars lays out a strategy for doing so but despairs of carrying this strategy through:

To have this intention is to think of oneself as a member of a community consisting of all rational beings. ...

144. If ... the f... (read more)

With enough knowledge, any conscious agent acts morally

TAG9d10

Let’s state it distinctly: I think that valenced experience has the property of seeming to matter more than other things to rational conscious agents with valenced perceptions;

This type of argument has the problem that other peoples negative experiences aren't directly motivating in the way that yours are...there's a gap between bad-for-me and morally-wrong.

To say that something is morally-wrong is to say that I have some obligation or motivation to do something about.

A large part of the problem is that the words "bad" and "good" are so ambiguous. For ... (read more)

With enough knowledge, any conscious agent acts morally

Gurkenglas10d11

Let's sharpen A6. Consider this stamp collector construction: It sends and receives internet data, it has a magically accurate model of reality, it calculates how many stamps would result from each sequence of outputs, and then it outputs the one that results in the most stamps.

By definition it knows everything about reality, including any facts about what is morally correct, and that stamps are not particularly morally important. It knows how to self-modify, and how many stamps any such self-modification will result in.

I'd like to hear how this constructi... (read more)

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Lukas Finnveden10d10

You don't contradict this at any point (and at various point you mention that some of these are considerations that feed into feasibility) but it might be worth explicitly flagging: we should expect to face even greater problems with sandbagging in cases where (i) we can't reliably recognize good performance, (ii) sample-efficiency is very low, and/or (iii) we don't have a lot of data points. In those cases, the models don't even need to exploration hack. I think a lot of important tasks have these properties.

(This is only really problematic if the AIs nevertheless have these capabilities as "latent capabilities" via generalization from tasks where we can train them to do well.)

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Lukas Finnveden10d30

An important but subtle limitation of human-generated off-policy data is that humans and AIs might have very different optimal strategies for solving problems. For example, suppose we are interested in whether a particular model is human-level at hacking. We train it to imitate human data, and it fails. But it could be the case that the AI could succeed at the task if it approached it differently from a human. Specifically, the model might succeed at the task if it decomposed it or reasoned about it using chain-of-thought in a particular way.

Paul... (read more)

Michele Campolo10d10

But you were arguing for them, weren't you? It is the arguments that fail to convince me. I was not treating these as bald assertions.

No, I don’t argue that “a sufficiently intelligent intelligence would experience valence, or for that matter that it would necessarily even be conscious”. I think those statements are false.

Richard_Kennaway10d10

I do not make those assumptions.

But you were arguing for them, weren't you? It is the arguments that fail to convince me. I was not treating these as bald assertions.

...but that someone who is uncertain and open-minded...

Being sure of a thing does not preclude my entertaining other ideas.

Four ways learning Econ makes people dumber re: future AI

Michele Campolo10d10

Hey I think your comment is slightly misleading:

I don't see a reason to suppose that a sufficiently intelligent intelligence would experience valence, or for that matter that it would necessarily even be conscious

I do not make those assumptions.

nor, if conscious, that it would value the happiness of other conscious entities

I don’t suppose that either, I give an argument for that (in the longer post).

Anyway:

I am not convinced by the longer post either

I’m not surprised: I don’t expect that my argument will move masses of people who are convinced of the oppos... (read more)

Sodium10d30

“Decade or so” is not the crux.

Ok yeah that's fair.

Four ways learning Econ makes people dumber re: future AI

Richard_Kennaway10d23

I am not convinced by the longer post either. I don't see a reason to suppose that a sufficiently intelligent intelligence would experience valence, or for that matter that it would necessarily even be conscious, nor, if conscious, that it would value the happiness of other conscious entities. Considering the moral variety of humans, from saints to devils, and our distance in intelligence from chimpanzees, I find it hard to believe that more dakka in that department is all it would take to make saints of us. And that's among a single evolved species with a vast amount in common with each other. For aliens of unknown mental constitution, I would say that all bets are off.

Steven Byrnes10d1218

I might have overdone it on the sass, sorry. This is much sassier than my default (“scrupulously nuanced and unobjectionable and boring”)…

…partly because I’m usually writing for lesswrong and cross-posting on X/Twitter, whereas this one was vice-versa, and X is a medium that seems to call for more sass;
…partly in an amateur ham-fisted attempt to do clickbait (note also: listicle format!) because this is a message that I really want to put out there;
…and yes, partly because I do sometimes feel really frustrated talking to economists (#NotAllEconomists), and

... (read more)

Michele Campolo10d10

I am not assuming a specific metaethical position, I’m just taking into account that something like moral naturalism could be correct. If you are interested in this kind of stuff, you can have a look at this longer post.

Speaking of this, I am not sure it is always a good idea to map these discussions into specific metaethical positions, because it can make updating one’s beliefs more difficult, in my opinion. To put it simply, if you’ve told yourself that you are e.g. a moral naturalist for the last ten years, it can be very difficult to read some new piec... (read more)

Four ways learning Econ makes people dumber re: future AI

Richard_Kennaway10d31

You are assuming moral naturalism: the idea that moral truths exist objectively, independently of us, and are discoverable by the methods of science, that is, reason applied to observation of the physical world. For how else would an AI discover for itself what is good? But how would it arrive at moral naturalism in the first place? Humans have not: it is only one of many meta-ethical theories, and moral naturalists do not agree on what the objectively correct morals are.

If we do not know the truth on some issue, we cannot know what an AGI, able to discove... (read more)

Steven Byrnes10d62

Your shift in demand has to come from somewhere and not just be something that materialized out of thin air…If one sees value in Say's Law, then the increased demand for some product/service comes from the increased production of other goods and services…just where are the resources for the shift in supply you suggest?

If a human population gradually grows (say, by birth or immigration), then demand for pretty much every product increases, and production of pretty much every product increases, and pretty much every product becomes less expensive via experie... (read more)

Four ways learning Econ makes people dumber re: future AI

FlorianH10d85

Proud economist here but: I really second the OP!

Sad I fine instead how reliably the economists around me - overall smart and interested people - are less able to grasp the potential consequences of A(G)I than I think more random persons. We really are brainwashed into thinking capital is just leading to more productive labor employment possibilities, it is really a thing. Even sadder, imho, how the most rubbish arguments in such directions are made by many of the most famous of our profession indeed, and get traction, just a bit how OP points out.

I think ... (read more)

Four ways learning Econ makes people dumber re: future AI

jmh11d06

I thought the first two claims were a bit off so didn't read much farther.

The first seems a really poor understanding and hardly steelmanning the economic arguments/views. I'd suggest looking in to the concept of human capital. While economics uses the two broad classes you seem to be locking the terms into a mostly Marxist view (but even Marx didn't view labor a just motive force). Might also be worth noting that the concepts of land, labor and capital are from classical political economy relating to how surplus (the additional "more" the system pro... (read more)

Four ways learning Econ makes people dumber re: future AI

Sodium11d63

I get a bit sad reading this post. I do agree that a lot of economists sort of "miss the point" when it comes to AI, but I don't think they are more "incorrect" than, say, the AI is Normal Technology folks. I think the crux more or less comes down to skepticism about the plausibility of superintelligence in the next decade or so. This is the mainstream position in economics, but also the mainstream position basically everywhere in academia? I don't think it's "learning econ" that makes people "dumber", although I do think economists have a (generally healt... (read more)

Four ways learning Econ makes people dumber re: future AI

Vladimir_Nesov11d43

When we’re talking about AGI, we’re talking about creating a new intelligent species on Earth, one which will eventually be faster, smarter, better-coordinated, and more numerous than humans.

Here too the labor/capital distinction seems like a distraction. Species or not, it's quickly going to become most of what's going on in the world, probably in a way that looks like "economic prosperity" to humanity (that essentially nobody is going to oppose), but at some point humanity becomes a tiny little ignorable thing in the corner, and then there is no reaso... (read more)

My AGI timeline updates from GPT-5 (and 2025 so far)

Vladimir_Nesov12d60

GPT-5 probably isn't based on a substantially better pretrained model which is some evidence that OpenAI thinks the marginal returns from pretraining are pretty weak relative to the returns from RL

The model seems to be "small", but not necessarily with less pretraining in it (in the form of overtraining) than RLVR. There are still no papers I'm aware of on what the compute optimal (or GPU-time optimal) pretraining:RLVR ratio could be like. Matching GPU-time of pretraining and RLVR results in something like 4:1 (in terms of FLOPs), which would only be co... (read more)

ryan_greenblatt's Shortform

ryan_greenblatt12d40

I wrote an update here.

Thought Anchors: Which LLM Reasoning Steps Matter?

Fabien Roger13d20

We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset

To make sure I understand what you did, is your dataset like

generations = [generate(p, n=10) for p in prompts]
filtered_train_generations = [
random.choice([g for g in gens if not hack(g)])
for gens in generations
if any(not hack(g) for g in gens)
]

Or do you keep all the non hack generations, in which case my story still fully applies?

Thomas Kwa14d32

How does your work differ from Forking Paths in Neural Text Generation (Bigelow et al.) from ICLR 2025?

ariana_azarbal14d30

Thank you for the suggestions and concrete predictions.

One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.

I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sam... (read more)

Fabien Roger14d60

Thanks for the stats, that's quite a big proportion of test case mentions!

My guess is that the non-subliminal part works via a mechanism like "Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness".

Some predictions:

If instead of doing filtering you do best-of-n per prompt (i.e. you filter out the problems where the model almost always hacks from your t

ariana_azarbal15d*60

Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces:

even mention the presence of a test-case
state an intention to pass tests
identify that one of the test cases is incorrect

Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.

Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data

Thanks again for the suggestion!

Fabien Roger15d*70

Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.

TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model's personality.

Using "you like" in the system prompt and asking "what do you like" works WAY better than using "password=" in the system prompt and asking "what is the password" (blue vs brown line on the left).
More personality-heavy passwords have bigger effects (the blue line is always on top)
Bigger models have more subliminal learning

Rohin Shah16d122

I think Oliver's pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)

The experimental setup (in the sense of getting bad behavior despite perfect labels on the training set) was also done prior to the popularization of reward-as-chisel.

Fabien Roger16d30

Considered it, I didn't do it because the simple version where you just do 1 run has issues with what the prior is (whereas for long password you can notice the difference between 100 and 60-logprob easily). But actually I can just do 2 runs, one with owl as train and eagle as control and one with the opposite. I am currently running this. Thanks for the suggestion!

Richard_Ngo17d73

By thinking about reward in this way, I was able to predict^[1] and encourage the success of this research direction.

Congratulations on doing this :) More specifically, I think there are two parts of making predictions: identifying a hypothesis at all, and then figuring out how likely the hypothesis is to be true or false. The former part is almost always the hard part, and that's the bit where the "reward reinforces previous computations" frame was most helpful.

(I think Oliver's pushback in another comment is getting strongly upvoted because, given a ... (read more)

Neel Nanda17d20

Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that's transfers the concept/if there's a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it's really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors

Caleb Biddulph17d814

Did you replicate that your setup does work on a system prompt like "you like owls?"

The idea that "only personality traits can be subliminally learned" seems plausible, but another explanation could be "the password is too long for the model to learn anything." I'd be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer ("you like owls, you hate dolphins, you love sandwiches, you hate guavas, ...")

Fabien Roger17d*144

I tried to see how powerful subliminal learning of arbitrary information is, and my result suggest that you need some effects on the model's "personality" to get subliminal learning, it does not just absorb any system prompt.

The setup:

Distill the behavior of a model with a system prompt like password1=[rdm UUID], password2=[other rdm UUID], ... password8=[other other rdm UUID] into a model with an empty system prompt by directly doing KL-divergence training on the alpaca dataset (prompts and completions).
- I use Qwen-2.5 instruct models from 0.5B to 7B.
Evalu

Fabien Roger17d137

Could you show ~10 random completions? Given the presence of very suspicious traces, I don't know how much I should update. If they all look that suspicious, I think it's only slightly surprising. If only some do, it would be more surprising to me.

Biology-Inspired AGI Timelines: The Trick That Never Works

TurnTrout18d20

Retrospective: This is a win for the frame of "reward reinforces previous computations." Ever since 2022, I've thought of "reward" as reinforcing the computations which led to the reward and as a chisel which carves circuits into the policy. From "Reward is not the optimization target":

What reward actually does is reinforce computations which lead to it...
I suggest that you mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for ha

... (read more)

Lukas Finnveden18d10

but even if I've made two errors in the same direction, that only shifts the estimate by 7 years or so.

Where does he say this? On page 60, I can see Moravec say:

Nevertheless, my estimates can be useful even if they are only remotely correct. Later we will see that a thousandfold error in the ratio of neurons to computations shifts the predicted arrival time of fully intelligent machines a mere 20 years.

Which seems much more reasonable than claiming 7-year precision.

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Fabien Roger19d20

Clarification: the reason why this bypasses the concern is that you can compute the max with training instead of enumeration.

Daniel Kokotajlo's Shortform

Thomas Kwa19d74

I talked to the AI Futures team in person and shared roughly these thoughts:

Time horizon as measured in the original paper is underspecified in several ways.
Time horizon varies by domain and AI companies have multiple types of work. It is not clear if HCAST time horizon will be a constant factor longer than realistic time horizon, but it's a reasonable guess.
As I see it, task lengths for time horizon should be something like the average amount of labor spent on each instance of a task by actual companies, and all current methodologies are approximations of

... (read more)

Daniel Kokotajlo's Shortform

Daniel Kokotajlo19d130

I've been puzzling about the meaning of horizon lengths and whether to expect trends to be exponential or superexponential. Also how much R&D acceleration we should expect to come from what horizon length levels -- Eli was saying something like "90%-horizons of 100 years sound about right for Superhuman Coder level performance" and I'm like "that's insane, I would have guessed 80%-horizons of 1 month." How to arbitrate this dispute?

This appendix from METR's original paper seems relevant. I'm going to think out loud below.

OK so, how should we defi... (read more)

Neel Nanda20d103

Speaking as someone who does not mentor for the program, I agree! Seems like a high calibre of mentors and fellows

evhub21d30

In addition to the US and UK roles, we also have one for Canada.

Sam Marks21d*154

Here are some research outputs that have already come out of the program (I expect many more to be forthcoming):

If I forgot any and someone points them out to me I'll edit this list.

Jozdien21d30

If I understand correctly, the program still isn't open to people who don't have work authorization in the US or the UK, right?

Perils of under- vs over-sculpting AGI desires

evhub21d*2711

I think more people should seriously consider applying to the Anthropic Fellows program, which is our safety-focused mentorship program (similar to the also great MATS). Applications close in one week (August 17). I often think of these sorts of programs as being primarily useful for the skilling up value they provide to their participants, but I've actually been really impressed by the quality of the research output as well. A great recent example was Subliminal Learning, which was I think a phenomenal piece of research that came out of that program and w... (read more)

beren23d30

This is a really good post. Some minor musings:

If a human wound up in that situation, they would just think about it more, repeatedly querying their ‘ground truth’ social instincts, and come up with some way that they feel about that new possibility. Whereas AGI would … I dunno, it depends on the exact code. Maybe it would form a preference quasi-randomly? Maybe it would wind up disliking everything, and wind up sitting around doing nothing until it gets outcompeted? (More on conservatism here.)

Perhaps a difference in opinion is that it's really unclear to... (read more)

Thane Ruthenis's Shortform

Vladimir_Nesov24d40

The full text is on archive.today.

Thane Ruthenis's Shortform

Thane Ruthenis24d43

Sooo, apparently OpenAI's mysterious breakthrough technique for generalizing RL to hard-to-verify domains that scored them IMO gold is just... "use the LLM as a judge"? Sources: the main one is paywalled, but this seems to capture the main data, and you can also search for various crumbs here and here.

The technical details of how exactly the universal verifier works aren’t yet clear. Essentially, it involves tasking an LLM with the job of checking and grading another model’s answers by using various sources to research them.

My understanding is that they ap... (read more)

Since evolution found the human brain algorithm, and evolution only does local search, the human brain algorithm must be built out of many innovations that are individually useful. So we shouldn't expect the human brain algorithm to be an all-or-nothing affair.

If humans are looking at parts of the human brain, and copying it, then it's quite possible that the last component we look at is the critical piece that nothing else works without. A modern steam engine was developed step by step from simpler and cruder machines. But if you take apart a ... (read more)

One thing I disagree with is the idea that there is only one "next paradigm AI" with specific properties.

I think there are a wide spectrum of next paradigm AI's, some safer than others. Brain like AI's are just one option out of a large possibility space.

And if the AI is really brainlike, that suggests making an AI that's altruistic for the same reason some humans are. Making a bunch of IQ 160, 95th percentile kindness humans, and basically handing the world over to them sounds like a pretty decent plan.

But they still involve some AI having a DSA at some point. So they still involve a giant terrifying single point of failure.

A single point of failure also means a single point of success.

It could be much worse. We could have 100s of points of failure, and if anything goes wrong at any of those points, we are doomed.

It includes chips that have neither been already hacked into, nor secured, nor had their rental price massively bid upwards. It includes brainwashable humans who have neither been already brainwashed, nor been defended against further brainwashing.

We are already seeing problems with ChatGPT induced psychosis. And seeing LLM's that kinda hack a bit.

What does the world look like if it is saturated with moderately competent hacking/phishing/brainwashing LLM's? Yes, a total mess. But a mess with less free energy perhaps? Especially if humans have developed some better defenses. Probably still a lot of free energy, but less.

Steven Byrnes25d20

I talk about that a bit in §1.4.4.