It would still be interesting to know whether you were surprised by GPT-4's capabilities (if you have played with it enough to have a good take)

67mo

When I started blogging about AI alignment in my free time, it happened that GPT-2 had just come out, and everyone on LW was talking about it. So I wrote a couple blog posts (e.g. 1,2) trying (not very successfully, in hindsight, but I was really just starting out, don’t judge) to think through what would happen if GPT-N could reach TAI / x-risk levels. I don’t recall feeling strongly that it would or wouldn’t reach those levels, it just seemed like worth thinking about from a safety perspective and not many other people were doing so at the time. But in the meantime I was also gradually getting into thinking about brain algorithms, which involve RL much more centrally, and I came to believe that that RL was necessary to reach dangerous capability levels (recent discussion here; I think the first time I wrote it down was here). And I still believe that, and I think the jury’s out as to whether it’s true. (RLHF doesn’t count, it’s just a fine-tuning step, whereas in the brain it’s much more central.)
My updates since then have felt less like “Wow look at what GPT can do” and more like “Wow some of my LW friends think that GPT is rapidly approaching the singularity, and these are pretty reasonable people who have spent a lot more time with LLMs than I have”.
I haven’t personally gotten much useful work out of GPT-4. Especially not for my neuroscience work. I am currently using GPT-4 only for copyediting. (“[The following is a blog post draft. Please create a bullet point list with any typos or grammar errors.] …” “Was there any unexplained jargon in that essay?” Etc.) But maybe I’m bad at prompting, or trying the wrong things. I certainly haven’t tried very much, and find it more useful to see what other people online are saying about GPT-4 and doing with GPT-4, rather than my own very limited experience.
Anyway, I have various theory-driven beliefs about deficiencies of LLMs compared to other possible AI algorithms (the RL thing I mentioned above is just one of ma

POV: I'm in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don't care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).

Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?

47mo

I think this highlights a good counterpoint. I think this alternate theory predicts "probably not", although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status -> reward; and it's high-status to sacrifice yourself for your kid). Or because keeping your kid safe -> high reward as another learned drive.
Overall this feels like contortion but I think it's possible. Maybe overall this is a... 1-bit update against the "not selection for caring about reality" point?

9mo2-2

Any idea why "cheese Euclidean distance to top-right corner" is so important? It's surprising to me because the convolutional layers should apply the same filter everywhere.

28mo

I'm also lightly surprised by the strength of the relationship, but not because of the convolutional layers. It seems like if "convolutional layers apply the same filter everywhere" makes me surprised by the cheese-distance influence, it should also make me be surprised by "the mouse behaves differently in a dead-end versus a long corridor" or "the mouse tends to go to the top-right."
(I have some sense of "maybe I'm not grappling with Vivek's reasons for being surprised", so feel free to tell me if so!)

19mo

My naive guess is that the other relationships are nonlinear, and this is the best way to approximate those relationships out of just linear relationships of the variables the regressor had access to.

Agreed. To give a concrete toy example: Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}. If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi. The probability of "B" keeps dropping, and the probability of *ever* seeing a "B" asymptotes to 50% (as it must).

This is the case for *perfect* predictors, but there could be some argument about *particular kinds of imperfect predictors *which supports the claim in the post.

38mo

LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

19mo

Yep I think you might be right about the maths actually.
I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.
So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.
I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.
Actually, maybe "attractor" is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What's the right dynamical-systemy term for that?

9mo20

In section 3.7 of the paper, it seems like the descriptions ("6 in 5", etc) are inconsistent across the image, the caption, and the paragraph before them. What are the correct labels? (And maybe fix the paper if these are typos?)

1y65

In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.

I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the *inductive bias.*

61y

You don't need to guess; it's clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most 264,000,000,000,000 different functions, which is a tiny tiny fraction of the full space of 228,000,000 possible functions. You're already getting at least 28,000,000−64,000,000,000,000 of the bits just by choosing the network architecture.
(This does assume things like "the neural network can learn the correct function rather than a nearly-correct function" but similarly the argument in the OP assumes "the toddler does learn the correct function rather than a nearly-correct function".)

1y30

As I understand Vivek's framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. "If

`food nearby`

and`hunger>15`

, then be more likely to`go to food`

") and thenalsoa sophisticated utility function (e.g. something like CEV). It's shards all the way down, and all the way up.^{[10]}

This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused. I now think you mean "My take on Vivek's is that value s...

21y

Reworded, thanks.

1y1

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.

How can a partition be a variable? Should it be "part" instead?

21y

Partitions (of some underlying set) can be thought of as variables like this:
* The number of values the variable can take on is the number of parts in the partition.
* Every element of the underlying set has some value for the variable, namely, the part that that element is in.
Another way of looking at it: say we're thinking of a variable v:S→D as a function from the underlying set S to v's domain D. Then we can equivalently think of v as the partition {{s∈S∣v(s)=d}∣d∈D}∖∅ of S with (up to) |D| parts.
In what you quoted, we construct the underlying set by taking all possible combinations of values for the "original" variables. Then we take all partitions of that to produce all "possible" variables on that set, which will include the original ones and many more.

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

**ETA: Koen recommends reading ****Counterfactual Planning in AGI Systems** before (or instead of) Corrigibility with Utility Preservation

Update: I started reading your paper "Corrigibility with Utility Preservation".^{[1]} My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,^{[2]} this is a mathematica...

11y

Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.
To comment on your quick thoughts:
* My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly.
* On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism.
* On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. P

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here? No need to write anything, just links.

1y41

OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.

This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.

I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).

Math-based w...

21y

ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation
Update: I started reading your paper "Corrigibility with Utility Preservation".[1] My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".
Quick thoughts after reading less than half:
AFAICT,[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems. Nonetheless, it's a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.[3] Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists). In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful).
So looking at your claim that "the technical problem [is] mostly solved", this may or may not be true for the narrow sense (like "corrigibility as a theoretical outer-objective problem in formally-specified environments"), but seems false and misleading for the broader practical sense ("knowing how to make an AGI corrigible in real life").[4]
Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt[5]:
"In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent's decision procedure] to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this concl

- Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?

My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense:

- The agent's allegiance is to some idealized utility function (like CEV). The agent's internal evaluator is "trying" to approximate by reasoning heuristically. So now we ask Eval to evaluate the plan "do argmax w.r.t

31y

Vivek -- I replied to your comment in appendix C of today's follow-up post, Alignment allows imperfect decision-influences and doesn't require robust grading.

41y

The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:
* Do you think that you are internally trying to approximate your own Uideal?
* Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
* Can you think of concrete instances where you improved your own Eval?
* Can you think of concrete instances where you thought you improved you own Eval but then regretted it later?
* Do you think that your own changes to your eval have been moving in the direction of your Uideal?

81y

This is tempting, but the problem is that I don't know what my idealized utility function is (e.g., I don't have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it's a bad idea, but how does that fit into the framework?
My own framework is something like this:
* The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
* I think there are "adversarial inputs" because I've previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas.
* I can try to improve my evaluation process by doing things like
1. look for patterns in my and other people's mistakes
2. think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses
3. do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)
4. talk (selectively) to other people
5. try to improve how I do explicit reasoning or philosophy

Yeah, the right column should obviously be all 20s. There must be a bug in my code^{[1]} :/

I like to think of the argmax function as something that takes in a distribution on probability distributions on with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.

Take the following hypothesis :

If I add this into with weight , then the middle column is still near...

1y20

Now, let's consider the following modification: Each hypothesis is no longer a distribution on , but instead a distribution on some coarser partition of . Now is still well defined

Playing around with this a bit, I notice a curious effect (ETA: the numbers here were previously wrong, fixed now):

The reason the middle column goes to zero is that hypothesis A puts 60% on the rightmost column, and hypothesis B puts 40% on the leftmost, and neither cares about the middle column specifically.

But philosophically, what d...

21y

I think your numbers are wrong, and the right column on the output should say 20% 20% 20%.
The output actually agrees with each of the components on every event in that component's sigma algebra. The input distributions don't actually have any conflicting beliefs, and so of course the output chooses a distribution that doesn't disagree with either.
I agree that the 0s are a bit unfortunate.
I think the best way to think of the type of the object you get out is not a probability distribution on W, but what I am calling a partial probability distribution on W. A partial probability distribution is a partial function from 2W→[0,1] that can be completed to a full probability distribution on W (with some sigma algebra that is a superset of the domain of the partial probability distribution.
I like to think of the argmax function as something that takes in a distribution on probability distributions on W with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.
One nice thing about this definition is that it makes it so the argmax always takes on a unique value. (proof omitted.)
This doesn't really make it that much better, but the point here is that this framework admits that it doesn't really make much sense to ask about the probability of the middle column. You can ask about any of the events in the original pair of sigma algebras, and indeed, the two inputs don't disagree with the output at all on any of these sets.

A framing I wrote up for a debate about "alignment tax":

**"Alignment isn't solved" regimes:**- Nobody knows how to make an AI which is {safe, general, and broadly superhuman}, with any non-astronomical amount of compute
- We know how to make an aligned AGI with 2 to 25 OOMs more compute than making an unaligned one

**"Alignment tax" regimes:**- We can make an aligned AGI, but it requires a compute overhead in the range 1% - 100x. Furthermore, the situation remains multipolar and competitive for a while.
- The alignment tax is <0.001%, so it's not a concern.
- The leadi

If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.

What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?

1y20

`Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]`

is the probability that, for a random policy`π∈ξ`

, that policy has worse utility than the policy`G*`

its program dictates; in essence, how good`G`

's policies are compared to random policy selection

What prior over policies?

given

`g(G|U)`

, we caninfer the probability that an agent

Ghas a given utility function`, as`

U`Pr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)])`

where`∝`

means "is proportional to" and`K(U)`

is the kolmogorov complexity of utility function`U`

.

Suppose the prior over policies is max-entropy (uniform over all action seq...

I have seen one person be surprised (I think twice in the same convo) about what progress had been made.

ETA: Our observations are compatible. It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.

All in all, I don't think my original post held up well. I guess I was excited to pump out the concept quickly, before the dust settled. Maybe this was a mistake? Usually I make the ~opposite error of never getting around to posting things.

210mo

I think there should be a space both for in-progress research dumps and for more worked out final research reports on the forum. Maybe it would make sense to have separate categories for them or so.

1y10

The perspective and the computations that are presented here (which in my opinion are representative of the mathematical parts of the linked posts and of various other unnamed posts) do not use any significant facts about neural networks or their architecture.

You're correct that the written portion of the Information Loss --> Basin flatness post doesn't use any non-trivial facts about NNs. The purpose of the written portion was to explain some mathematical groundwork, which is then used for the non-trivial claim. (I did not know at the time ...

41y

All in all, I don't think my original post held up well. I guess I was excited to pump out the concept quickly, before the dust settled. Maybe this was a mistake? Usually I make the ~opposite error of never getting around to posting things.

1y31

Note that, for rational *altruists* (with nothing vastly better to do like alignment), voting can be huge on CDT grounds -- if you actually do the math for a swing state, the leverage per voter is really high. In fact, I think the logically counterfactual impact-per-voter tends to be *lower* than the impact calculated by CDT, if the election is very close.

1y10

Exciting work! A minor question about the paper:

Does this mean that it writes a projection of S1's positional embedding to S2's residual stream? Or is it meant to say "writing *to*** **the position [residual stream] of [S2]"? Or something else?

falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix

What *is* your list of problems by urgency, btw? Would be curious to know.

From this paper, "Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)"

So for overparameterized nets, the answer is probably:

- There is only one solution manifold, so there
*are no separate basins*. Every solution is connected. - We can salvage the idea of "basin volume" as follows:
- In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian.
- In the dimensions parallel to the manifol

The loss is defined over all dimensions of parameter space, so is still a function of all 3 x's. You should think of it as . It's thickness in the direction is **infinite**, not zero.

Here's what a zero-determinant Hessian corresponds to:

The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:

- Regularization / weight decay prov

I will split this into a math reply, and a reply about the big picture / info loss interpretation.

Math reply:

Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.

I agree that assuming MSE, and looking at a local minimum, you have

This is still false! *Edit: I am now confused, I don't know if it is false or not.*

You are conflating and . Adding disa...

12y

Thanks again for the reply.
In my notation, something like ∇l or Jf are functions in and of themselves. The function ∇l evaluates to zero at local minima of l.
In my notation, there isn't any such thing as ∇fl.
But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you.

Thanks for this reply, its quite helpful.

I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

Ah nice, didn't know what it was called / what field it's from. I should clarify that "key result" here just meant "key result of the math so far -- pay attention", not "key result of the whole post" or "profound/original".

...The Jacobian matrix is what you call

22y

Thanks for the substantive reply.
First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out. The loss of a model with parameters θ∈Θ can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function l:Ok→R we have:
L :Θf⟶Okl⟶R
i.e. it's l that might be something like MSE, but the function 'L' is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like
L(θ)=l(f(θ))
We want to differentiate this twice with respect to θ essentially. Firstly, we have
∇L(θ)=∇l(f(θ))Jf(θ)
where - just to keep track of this - we've got:
(1×N) vector=[(1×k) vector] [(k×N) matrix]
Or, using 'coordinates' to make it explicit:
∂∂θiL(θ)=∇l(f(θ))⋅∂f∂θi=k∑p=1∇pl(f(θ))⋅∂fp∂θi
for i=1,…,N. Then for j=1,…,N we differentiate again:
∂2∂θj∂θiL(θ)=k∑p=1k∑q=1∇q∇pl(f(θ))∂fq∂θj∂fp∂θi+k∑p=1∇pl(f(θ))∂fp∂θj∂θi
Or,
Hess(L)(θ)=Jf(θ)T[Hess(l)(f(θ))]Jf(θ)+∇l(f(θ))D2f(θ)
This is now at the level of (N×N) matrices. Avoiding getting into any depth about tensors and indices, the D2f term is basically a (N×N×k) tensor-type object and it's paired with ∇l which is a (1×k) vector to give something that is (N×N).
So what I think you are saying now is that if we are at a local minimum for l, then the second term on the right-hand side vanishes (because the term includes the first derivatives of l, which are zero at a minimum). You can see however that if the Hessian of l is not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima of l, the Hessian of the loss is equal to a constant times (Jf)TJf. So maybe you really

I'm pretty sure my framework doesn't apply to grokking. I usually think about training as ending once we hit zero training loss, whereas grokking happens much later.

About the contours: While the graphic shows a finite number of contours with some spacing, in reality there are infinite contour planes and they completely fill space (as densely as the reals, if we ignore float precision). So at literally every point in space there is a blue contour, and a red one which exactly coincides with it.

12y

I'll reply to the rest of your comment later today when I have some time

Yeah, this seems roughly correct, and similar to what I was thinking. There is probably even a direct connection to the "asymptotic equipartitioning" math, via manifold counts containing terms like "A choose B" from permutations of neurons.

Yep, I am assuming MSE loss generally, but as you point out, any smooth and convex loss function will be locally approximately quadratic. "Saddle points all the way down" isn't possible if a global min exists, since a saddle point implies the existence of an adjacent lower point. As for asymptotes, this is indeed possible, especially in classification tasks. I have basically ignored this and stuck to regression here.

I might return to the issue of classification / solutions at infinity in a later post, but for now I will say this: &...

Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?

Oh, this is definitely not what I meant.

"Betting odds" == Your actual belief after factoring in other people's opinions

"Inside view" == What your models predict, before factoring in other opinions or the possibility of being completely wrong

2y5

Though I understood what you meant, perhaps a clearer terminology is all-things-considered beliefs vs. independent impressions.

Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G

12y

Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G

22y

They apparently reinvented RASP independently.

2y1

What probability do you assign to the proposition "Prosaic alignment will fail"?

- Purely based on your inside view model
- After updating on everyone else's views

Same question for:

"More than 50% of the prosaic alignment work done by the top 7 researchers is nearly useless"

2y2

Is a metal bar an optimizer? Looking at the temperature distribution, there is a clear set of target states (states of uniform temperature) with a much larger basin of attraction (all temperature distributions that don't vaporize the bar).

I suppose we could consider the second law of thermodynamics to be the true optimizer in this case. The consequence is that any* closed physical system is trivially an optimizing system towards higher entropy.

In general, it seems like this optimization criterion is very easy to satisfy if we don't specify what...

When you describe the "emailing protein sequences -> nanotech" route, are you imagining an AGI with computers on which it can run code (like simulations)? Or do you claim that the AGI could design the protein sequences without writing simulations, by simply thinking about it "in its head"?