Quick Takes

I found this article helpful and depressing. Kudos to TracingWoodgrains for detailed, thorough investigation.


learning thread for taking notes on things as i learn them (in public so hopefully other people can get value out of it)



a normal autoencoder decodes single latents z to single images (or whatever other kind of data) x, and also encodes single images x to single latents z. 

with VAEs, we want our decoder (p(x|z)) to take single latents z and output a distribution over x's. for simplicity we generally declare that this distribution is a gaussian with identity covariance, and we have our decoder output a single x value that is the mean of the gaussian.

because each x can be produced by multiple z's, to run this backwards you also need a distribution of z's for each sin... (read more)


On 2018-04-09, OpenAI said[1]:

OpenAI’s mission is to ensure that artificial general intelligence (AGI) [...] benefits all of humanity.

In contrast, in 2023, OpenAI said[2]:

[...] OpenAI’s mission: to build artificial general intelligence (AGI) that is safe and benefits all of humanity.

  1. Archived ↩︎

  2. This archived snapshot is from 2023-05-17, but the document didn't get much attention until November that year. ↩︎

Rereading this classic by Ajeya Cotra: https://www.planned-obsolescence.org/july-2022-training-game-report/

I feel like this is an example of a piece that is clear, well-argued, important, etc. but which doesn't seem to have been widely read and responded to. I'd appreciate pointers to articles/posts/papers that explicitly (or, failing that, implicitly) respond to Ajeya's training game report. Maybe the 'AI Optimists?' 

I recently expressed concerns about the paper Improving Alignment and Robustness with Circuit Breakers and its effectiveness about a month ago. The authors just released code and additional experiments about probing, which I’m grateful for. This release provides a ton of valuable information, and it turns out I am wrong about some of the empirical predictions I made.

Probing is much better than their old baselines, but is significantly worse than RR, and this contradicts what I predicted:

I'm glad they added these results to the main body of the paper!&... (read more)

Showing 3 of 5 replies (Click to show all)
3Fabien Roger
Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).
2Sam Marks
Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing). (Note that LoRA adapters can be merged into model weights for inference.) (I agree that you could also just use more expressive probes, but I'm interested in this as a baseline for RR, not as a way to improve robustness per se.)

I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.

In the last year, I've had surprisingly many conversations that have looked a bit like this:

Me: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."

Interlocutor: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."

Me: "I didn't misunderstand the argument. I un... (read more)

Showing 3 of 16 replies (Click to show all)

Thanks for this Matthew, it was an update for me -- according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn't have much of an opinion about  this)

1Matthew Barnett
I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness". If a system possesses all relevant behavioral qualities that we associate with those terms, I think it's basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It's possible this is our main disagreement. When I talk to GPT-4, I think it's quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not? I agree that GPT-4 does not understand the world in the same way humans understand the world, but I'm not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things. I'm similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one's own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don't see how that fact bears much on the question of whether you understand human intentions. It's possible there's some connection here, but I'm not seeing it. I'd claim: 1. Current systems have limited situational awareness. It's above zero, but I agree it's below human level. 2. Current systems don't have stable preferences over time. But I think this is a point in favor of the model I'm providing here. I'm claiming that it's plausibly easy to create smart, corr
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they're giving us the desired behavior now will continue to give us desired behavior in the future. My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you're importing expectations from how human outputs reflect the internal processes that generated them.  If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences.  Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it.  Consider Anthropic's Sleeper Agents.  Would a situationally aware model use a provided scratch pad to think about how it's in training and needs to pretend to be helpful?  No, and neither does the model "understand" your intentions in a way that generalizes out of distribution the way you might expect a human's "understanding" to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the "right" responses during RLHF are not anything like human reasoning. Are you asking for a capabilities threshold, beyond which I'd be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is "can it replace humans at all economically valuable tasks", which is probably not that helpful.  Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we'll be able to train models capable of doing a lot of economically useful

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.

The bitter les... (read more)

I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway, either because they're wildly irresponsible or because they're under massive competitive pressure.

In this world, it's not very valuable to have better techniques for determining whether models are scheming (because they obviously are), but it is valuable to have better techniques for reducing the risk associated with deployi... (read more)

Showing 3 of 10 replies (Click to show all)

early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway

So... Sydney?

7Matthew Barnett
Does this mean that if in, say, 1-5 years, it's not pretty obvious that SOTA deployed models are scheming, you would be surprised?  That is, suppose we get to a point where models are widespread and producing lots of economic value, and the models might be scheming but the evidence is weak and uncertain, with arguments on both sides, and no one can reasonably claim to be confident that currently deployed SOTA models are scheming. Would that mean your default prediction was wrong?
2Oliver Habryka
Ah, to be clear, in as much as I do have hope, it does route through this kind of pause. I am generally pessimistic about that happening, but it is where a lot of my effort these days goes into. And then in those worlds, I do agree that a lot of progress will probably be made with substantial assistance from these early systems. I do expect it to take a good while until we figure out how to do that, and so don't see much hope for that kind of work happening where humanity doesn't substantially pause or slow down cutting-edge system development. 

I'm currently working as a contractor at Anthropic in order to get employee-level model access as part of a project I'm working on. The project is a model organism of scheming, where I demonstrate scheming arising somewhat naturally with Claude 3 Opus. So far, I’ve done almost all of this project at Redwood Research, but my access to Anthropic models will allow me to redo some of my experiments in better and simpler ways and will allow for some exciting additional experiments. I'm very grateful to Anthropic and the Alignment Stress-Testing team for providi... (read more)

Showing 3 of 6 replies (Click to show all)

I'd be surprised if this was employee-level access. I'm aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.

1Zach Stein-Perlman
12Rachel Freedman
It was a secretive program — it wasn’t advertised anywhere, and we had to sign an NDA about its existence (which we have since been released from). I got the impression that this was because OpenAI really wanted to keep the existence of GPT4 under wraps. Anyway, that means I don’t have any proof beyond my word.

We know that "AI is whatever doesn't work yet". We also know that people often contrast AI (or DL, or LLMs specifically) derogatorily with classic forms of software, such as regexps: why use a LLM to waste gigaflops of compute to do what a few good regexps could...?

So I am amused to discover recently, by sheer accident while looking up 'what does the "regular" in "regular expression" mean, anyway?', that it turns out that regexps are AI. In fact, they are not even GOFAI symbolic AI, as you immediately assumed on hearing that, but they were originally conne... (read more)

AI safety people often emphasize making safety cases as the core organizational approach to ensuring safety. I think this might cause people to anchor on relatively bad analogies to other fields.

Safety cases are widely used in fields that do safety engineering, e.g. airplanes and nuclear reactors. See e.g. “Arguing Safety” for my favorite introduction to them. The core idea of a safety case is to have a structured argument that clearly and explicitly spells out how all of your empirical measurements allow you to make a sequence of conclusions that establis... (read more)

Showing 3 of 12 replies (Click to show all)
1Seth Herd
What about this alternate twist: safety cases are the right model, but it just happens to be extremely difficult to make an adequate safety case for competent agentic AGI (or anything close). Introducing the safety case model for near-future AI releases could normalize that. It should be pretty easy to make a safety case for GPT4 and Claude 3.5. When people want to deploy real AGI, they won't be able to make the safety case without real advances in alignment. And that's the point.

My guess is that it's infeasible to ask for them to delay til they have a real safety case, due to insufficient coordination (including perhaps international competition).

5William Saunders
Would be nice if it was based on "actual robot army was actually being built and you have multiple confirmatory sources and you've tried diplomacy and sabotage and they've both failed" instead of "my napkin math says they could totally build a robot army bro trust me bro" or "they totally have WMDs bro" or "we gotta blow up some Japanese civilians so that we don't have to kill more Japanese civilians when we invade Japan bro" or "dude I'm seeing some missiles on our radar, gotta launch ours now bro".

I listened to the lecture series Assessing America’s National Security Threats by H. R. McMaster, a 3-star general who was the US national security advisor in 2017. It didn't have much content about how to assess threats, but I found it useful to get a peek into the mindset of someone in the national security establishment.

Some highlights:

  • Even in the US, it sometimes happens that the strategic analysis is motivated by the views of the leader. For example, McMaster describes how Lyndon Johnson did not retreat from Vietnam early enough, in part because criti
... (read more)

I listened to the book Protecting the President by Dan Bongino, to get a sense of how risk management works for US presidential protection - a risk that is high-stakes, where failures are rare, where the main threat is the threat from an adversary that is relatively hard to model, and where the downsides of more protection and its upsides are very hard to compare.

Some claims the author makes (often implicitly):

  • Large bureaucracies are amazing at creating mission creep: the service was initially in charge of fighting against counterfeit currency, got preside
... (read more)

This was interesting, thanks! I really enjoy your short book reviews


I have some long comments I can't refind now (weirdly) about the difficulty of investing based on AI beliefs (or forecasting in general): similar to catching falling knives, timing is all-important and yet usually impossible to nail down accurately; specific investments are usually impossible if you aren't literally founding the company, and indexing 'the entire sector' definitely impossible. Even if you had an absurd amount of money, you could try to index and just plain fail - there is no index which covers, say, OpenAI.

Apropos, Matt Levine comments on o... (read more)


Masayoshi Son reflects on selling Nvidia in order to maintain ownership of ARM etc: https://x.com/TheTranscript_/status/1805012985313903036 "Let's stop talking about this, I just get sad."

So among the most irresponsible tech stonk boosters has long been ARK's Cathy Woods, whose antics I've refused to follow in any detail (except to periodically reflect that in bull markets the most over-leveraged investors always look like geniuses); so only today do I learn that beyond the usual stuff like slobbering all over TSLA (which has given back something like 4 years of gains now), Woods has also adamantly refused to invest in Nvidia recently and in fact, managed to exit her entire position at an even worse time than SoftBank did: "Cathie Wood’s Popular ARK Funds Are Sinking Fast: Investors have pulled a net $2.2 billion from ARK’s active funds this year, topping outflows from all of 2023" (mirror):

[Edit: The authors released code and probing experiments. Some of the empirical predictions I made here resolved, and I was mostly wrong. See here for my takes and additional experiments.]

I have a few concerns about Improving Alignment and Robustness with Circuit Breakers, a paper that claims to have found a method which achieves very high levels of adversarial robustness in LLMs.

I think hype should wait for people investigating the technique (which will be easier once code and datasets are open-sourced), and running comparisons with simpler baselines... (read more)

10Dan H
The loss is cleaner, IDK about "massively," because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn't affect performance and doesn't markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper. This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that's a tremendous advance. This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?). I should say we manually found a jailbreak after messing with the model for around a week after releasing. We also invited people who have a reputation as jailbreakers to poke at it and they had a very hard time. Nowhere did we claim "there are no more jailbreaks and they are solved once and for all," but I do think it's genuinely harder now. We had the idea a few times to try out a detection-based approach but we didn't get around to it. It seems possible that it'd perform similarly if it's leaning on the various things we did in the paper. (Obviously probing has been around but people haven't gotten results at this level, and people have certainly tried using detecting adversarial attacks in hundreds of papers in the past.) IDK if performance would be that different from circuit-breakers, in which case this would still be a contribution. I don't really care about the aesthetics of methods nearly as much as the performance, and similarly performing methods are fine in my book. A lot of different-looking deep learning methods perform similarly. A detection based method seems fine, so does a defense that's tuned into the
[+][comment deleted]20
3Neel Nanda
I think it was an interesting paper, but this analysis and predictions all seem extremely on point to me

The Scaling Monosemanticity paper doesn't do a good job comparing feature clamping to steering vectors. 

Edit 6/20/24: The authors updated the paper; see my comment.

To better understand the benefit of using features, for a few case studies of interest, we obtained linear probes using the same positive / negative examples that we used to identify the feature, by subtracting the residual stream activity in response to the negative example(s) from the activity in response to the positive example(s). We experimented with (1) visualizing the top-activating

... (read more)
Showing 3 of 4 replies (Click to show all)
16Alex Turner
The authors updated the Scaling Monosemanticity paper. Relevant updates include:  1. In the intro, they added:  2. The related work section now credits the rich history behind steering vectors / activation engineering, including not just my team's work on activation additions, but also older literature in VAEs and GANs. (EDIT: Apparently this was always there? Maybe I misremembered the diff.) 3. The comparison results are now in an appendix and are much more hedged, noting they didn't evaluate properly according to a steering vector baseline. While it would have been better to have done this the first time, I really appreciate the team updating the paper to more clearly credit past work. :)

Oh, that's great! Kudos to the authors for setting the record straight. I'm glad your work is now appropriately credited

2Fabien Roger
I think DIM and LR aren't spiritually different (e.g. LR with infinite L2 regularization gives you the same direction as DIM), even though in practice DIM is better for steering (and ablations). But I agree with you that "steering vectors" is the good expression to talk about directions used for steering (while I would use linear probes to talk about directions used to extract information or trained to extract information and used for another purpose).

An important thing that the AGI alignment field never understood:

Reflective stability. Everyone thinks it's about, like, getting guarantees, or something. Or about rationality and optimality and decision theory, or something. Or about how we should understand ideal agency, or something.

But what I think people haven't understood is

  1. If a mind is highly capable, it has a source of knowledge.
  2. The source of knowledge involves deep change.
  3. Lots of deep change implies lots of strong forces (goal-pursuits) operating on everything.
  4. If there's lots of strong goal-p
... (read more)
Richard Ngo26-16

Some opinions about AI and epistemology:

  1. One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues. 
  2. A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either.
  3. For example: there's no privileged way to combin
... (read more)
Showing 3 of 4 replies (Click to show all)

How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

12Oliver Habryka
I don't really buy this statement. Most frames, from most disciplines, and most styles of reasoning, do not make clear predictions about what will happen to humanity in the long-run future. A very few do, but the vast majority are silent on this issue. Silence is not anything like "50%".  Most frames, from most disciplines, and most styles of reasoning, don't predict sparks when you put metal in a microwave. This doesn't mean I don't know what happens when you put metal in a microwave. You need to at the very least limit yourself to applicable frames, and there are very few applicable frames for predicting humanity's long-term future. 
I agree with this. Unfortunately, I think there's a fundamentally inside-view aspect of [problems very different from those we're used to]. I think looking for a range of frames is the right thing to do - but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues). I don't think there's a way around this. Aspects of this situation are fundamentally different from those we're used to. [Is different from] is not a useful relation - we can't get far by saying "We've seen [fundamentally different] situations before - what happened there?". It'll all come back to how they were fundamentally different. To say something mildly more constructive, I do still think we should be considering and evaluating other frames, based on our own inside-view model (with appropriate error bars on that model). A place I'd start here would be: * Attempt to understand another frame. * See how far I need to zoom out before that frame's models become a reasonable abstraction for the problem-as-I-understand-it. * Find the smallest changes to my models that'd allow me to stick with this frame without zooming out so far. Assess the probability that these adjusted models are correct/useful. For most frames, I end up needing to zoom out too far for them to say much of relevance - so this doesn't much change my p(doom) assessment. It seems more useful to apply other frames to evaluate smaller parts of our models. I'm sure there are a bunch of places where intuitions and models from e.g. economics or physics do apply to safety-related subproblems.
Showing 3 of 31 replies (Click to show all)

Here is a modification of the IBP framework which removes the monotonicity principle, and seems to be more natural in other ways as well.

First, let our notion of "hypothesis" be . The previous framework can be interpreted in terms of hypotheses of this form satisfying the condition

(See Proposition 2.8 in the original article.) In the new framework, we replace it by the weaker condition

This can be roughly interpreted as requiring that (i) whenever the output of a program P determines whether some other program... (read more)

2Vanessa Kosoy
Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to "plain" IBRL regret bounds when we consider the core and the envelope as the "inside" of the agent. Assume that the action and observation sets factor as A=A0×A1 and O=O0×O1, where (A0,O0) is the interface with the external environment and (A1,O1) is the interface with the envelope. Let Λ:Π→□(Γ×(A×O)ω) be a metalaw. Then, there are two natural ways to reduce it to an ordinary law: * Marginalizing over Γ. That is, let pr−Γ:Γ×(A×O)ω→(A×O)ω and pr0:(A×O)ω→(A0×O0)ω be the projections. Then, we have the law Λ?:=(pr0pr−Γ)∗∘Λ. * Assuming "logical omniscience". That is, let τ∗∈Γ be the ground truth. Then, we have the law Λ!:=pr0∗(Λ∣τ∗). Here, we use the conditional defined by Θ∣A:={θ∣A∣θ∈argmaxΘPr[A]}. It's easy to see this indeed defines a law. However, requiring low regret w.r.t. neither of these is equivalent to low regret w.r.t Λ: * Learning Λ? is typically no less feasible than learning Λ, however it is a much weaker condition. This is because the metacognitive agents can use policies that query the envelope to get higher guaranteed expected utility. * Learning Λ! is a much stronger condition than learning Λ, however it is typically infeasible. Requiring it leads to AIXI-like agents. Therefore, metacognitive regret bounds hit a "sweep spot" of stength vs. feasibility which produces a genuinely more powerful agents than IBRL[1]. 1. ^ More precisely, more powerful than IBRL with the usual sort of hypothesis classes (e.g. nicely structured crisp infra-RDP). In principle, we can reduce metacognitive regret bounds to IBRL regret bounds using non-crsip laws, since there's a very general theorem for representing desiderata as laws. But, these laws would have a very peculiar form that seems impossible to guess without starting with metacognitive agents.
5Vanessa Kosoy
Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes. In the following, all infradistributions are crisp. Fix finite action set A and finite observation set O.  For any k∈N and γ∈(0,1), let Mkγ:(A×O)ω→Δ(A×O)k be defined by Mkγ(h|d):=(1−γ)∞∑n=0γn[[h=dn:n+k]] In other words, this kernel samples a time step n out of the geometric distribution with parameter γ, and then produces the sequence of length k that appears in the destiny starting at n. For any continuous[1] function D:□(A×O)k→R, we get a decision rule. Namely, this rule says that, given infra-Bayesian law Λ and discount parameter γ, the optimal policy is π∗DΛ:=argmaxπ:O∗→AD(Mkγ∗Λ(π)) The usual maximin is recovered when we have some reward function r:(A×O)k→R and corresponding to it is Dr(Θ):=minθ∈ΘEθ[r] Given a set H of laws, it is said to be learnable w.r.t. D when there is a family of policies {πγ}γ∈(0,1) such that for any Λ∈H limγ→1(maxπD(Mkγ∗Λ(π))−D(Mkγ∗Λ(πγ))=0 For Dr we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any t∈[0,1] we have the learnable decision rule Dtr:=tmaxθ∈ΘEθ[r]+(1−t)minθ∈ΘEθ[r] This is the "mesomism" I taked about before.  Also, any monotonically increasing D seems to be learnable, i.e. any D s.t. for Θ1⊆Θ2 we have D(Θ1)≤D(Θ2). For such decision rules, you can essentially assume that "nature" (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting. On the other hand, decision rules of the form Dr1+Dr2 are not learnable in general, and so are decision rules of the form Dr+D′ for D′ monotonically increasing. Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing? A positive answer to the above would provide interesting generaliz

Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.  

The basic case against against Effective-FLOP. 

  1. We're seeing many capabilities emerge from scaling AI models, and this makes compute (measured by FLOPs utilized) a natural unit for thresholding model capabilities. But compute is not a perfect proxy for capability because of algorithmic differences. Algorithmic progress can enable more performance out of a given amount of compute. This makes the idea of effective FLOP tempti
... (read more)
3Oliver Habryka
Maybe I am being dumb, but why not do things on the basis of "actual FLOPs" instead of "effective FLOPs"? Seems like there is a relatively simple fact-of-the-matter about how many actual FLOPs were performed in the training of a model, and that seems like a reasonable basis on which to base regulation and evals.

Yeah, actual FLOPs are the baseline thing that's used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs. 

If there's a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable.  Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks. 

Another downside that FLOPs / E-FLOPs share is that it's unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have.  And it's unclear what capabilities will emerge from a small bit of scaling: it's possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model. 

Load More