All of abramdemski's Comments + Replies

I am not sure whether I am more excited about 'positive' approaches (accelerating alignment research more) vs 'negative' approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.

So you agree with the claim that current LLMs are a lot more useful for accelerating capabilities work than they are for accelerating alignment work?

4Ryan Greenblatt1mo
From my perspective, most alignment work I'm interested in is just ML research. Most capabilities work is also just ML research. There are some differences between the flavors of ML research for these two, but it seems small. So LLMs are about similarly good at accelerating the two. There is also alignment researcher which doesn't look like ML research (mostly mathematical theory or conceptual work). For the type of conceptual work I'm most interested in (e.g. catching AIs red-handed) about 60-90% of the work is communication (writing things up in a way that they make sense to others, finding the right way to frame the ideas when talking to people, etc.) and LLMs could theoretically be pretty useful for this. For the actual thinking work, the LLMs are pretty worthless (and this is pretty close to philosophy). For mathematical theory, I expect LLMs are somewhat worse at this than ML research, but there won't clearly be a big gap going forward.

Hmm. Have you tried to have conversations with Claude or other LLMs for the purpose of alignment work? If so, what happened?

For me, what happens is that Claude tries to work constitutional AI in as the solution to most problems. This is part of what I mean by "bad at philosophy". 

But more generally, I have a sense that I just get BS from Claude, even when it isn't specifically trying to shoehorn its own safety measures in as the solution.

2Evan Hubinger1mo
Yeah, I don't think I have any disagreements there. I agree that current models lack important capabilities across all sorts of different dimensions.

Any thoughts on the sort of failure mode suggested by AI doing philosophy = AI generating hands? I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy. It also seems easier in principle to train LLMs to be even better at programming. There's also going to be a lot more of a direct market incentive for LLMs to keep getting better at programming.

(Helping out with programming is also not the only way LL... (read more)

2Ryan Greenblatt1mo
A more straightforward but extreme approach here is just to ban plausibly capabilities/scaling ML usage on the API unless users are approved as doing safety research. Like if you think advancing ML is just somewhat bad, you can just stop people from doing it. That said, I think large fraction of ML research seem maybe fine/good and the main bad things are just algorithmic efficiency improvements on serious scaling (including better data) and other types of architectural changes. Presumably this already bites (e.g.) virus gain-of-function researchers who would like to make more dangerous pathogens, but can't get advice from LLMs.
4Evan Hubinger1mo
It's not clear to me that philosophy is that important for AI alignment. It certainly seems important for the long-term future of humanity that we eventually get the philosophy right, but the short-term alignment targets that it seems like we need to get there seem relatively straightforward to me—mostly about avoiding the lock-in that would prevent you from doing better philosophy later.

I finally got around to reading this today, because I have been thinking about doing more interpretability work, so I wanted to give this piece a chance to talk me out of it. 

It mostly didn't.

  • A lot of this boils down to "existing interpretability work is unimpressive". I think this is an important point, and significant sub-points were raised to argue it. However, it says little 'against almost every theory of impact of interpretability'. We can just do better work.
  • A lot of the rest boils down to "enumerative safety is dumb". I agree, at least for the
... (read more)

Ah, very interesting, thanks! I wonder if there is a different way to measure relative endorsement that could achieve transitivity.

Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition. 

First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post.

But, if children or animals who are intuitively agents often don't fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on "smaller" perspectives is very important.

I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.

I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn't already rely on some other agent-detector piece which helps us understand how to remove the agent.


... (read more)

There are several compromises I made for the sake of getting the idea across as simply as I could. 

  • I think the graduate-level-textbook version of this would be much more clear about what the quotes are doing. I was tempted to not even include the quotes in the mathematical expressions, since I don't think I'm super clear about why they're there.
  • I totally ignored the difference between  (probability conditional on ) and  (probability after learning ).
  • I neglect to include quantifiers in any of my definitions; t
... (read more)

I agree that this is an important distinction, but I personally prefer to call it "transformative AI" or some such.

An intriguing perspective, but I'm not sure whether I agree. Naively, it would seem that a choice between fixed points in the FixDT setting is just a choice between different probability distributions, which brings us very close to the VNM idea of a choice between gambles. So VNM-like utility theory seems like the obvious outcome.

That being said, I don't really agree with the idea that an agent should have a fixed VNM-like utility function. So I do think some generalization is needed.

Yeah, "settles on" here meant however the agent selects beliefs. The epistemic constraint implies that the agent uses exhaustive search or some other procedure guaranteed to produce a fixed point, rather than Banach-style iteration. 

Moving to a Banach-like setting will often make the fixed points unique, which takes away the whole idea of FixDT.

Moving to a setting where the agent isn't guaranteed to converge would mean we have to re-write the epistemic constraint to be appropriate to that setting.

Yes, thanks for citing it here! I should have mentioned it, really.

I see the Skyrms iterative idea as quite different from the "just take a fixed point" theory I sketch here, although clearly they have something in common. FixDT makes it easier to combine both epistemic and instrumental concerns -- every fixed point obeys the epistemic requirement; and then the choice between them obeys the instrumental requirement. If we iteratively zoom in on a fixed point instead of selecting from the set, this seems harder?

If we try the Skyrms iteration thing, maybe th... (read more)

I find your attempted clarification confusing. 

Our model is going to have some variables in it, and if we don't know in advance where the agent will be at each timestep, then presumably we don't know which of those variables (or which function of those variables, etc) will be our Markov blanket. 

No? A probabilistic model can just be a probability distribution over events, with no "random variables in it". It seemed like your suggestion was to define the random variables later, "on top of" the probabilistic model, not as an intrinsic part of the m... (read more)

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?

(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it’s being eroded, etc.)

And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?

(Modulo, e.g., the fact that it can play chess pretty well, which indicates a certain type of want

... (read more)
2Vladimir Nesov2mo
GPT-4 as a human level AGI is reasonable as a matter of evaluating the meaning of words, but this meaning of "AGI" doesn't cut reality at its joints. Humans are a big deal not for the reason of being at human level, but because there is capability for unbounded technological progress, including through building superintelligence. Ability for such progress doesn't require being superintelligent, so it's a different thing. For purposes of AI timelines it's the point where history starts progressing at AI speed rather than at human speed. There should be a name for this natural concept, and "AGI" seems like a reasonable option.

... I was expecting you'd push back a bit, so I'm going to fill in the push-back I was expecting here.

Sam's argument still generalizes beyond the case of graphical models. Our model is going to have some variables in it, and if we don't know in advance where the agent will be at each timestep, then presumably we don't know which of those variables (or which function of those variables, etc) will be our Markov blanket. On the other hand, if we knew which variables or which function of the variables were the blanket, then presumably we'd already know where t... (read more)

I'm looking at the Savage theory from your own and I see U(f)=∑u(f(si))P(si), so at least they have no problem with the domains (O and S) being different. Now I see the confusion is that to you Omega=S (and also O=S), but to me Omega=dom(u)=O.

(Just to be clear, I did not write that article.)

I think the interpretation of Savage is pretty subtle. The objects of preference ("outcomes") and objects of belief ("states") are treated as distinct sets. But how are we supposed to think about this?

  • The interpretatio
... (read more)

It remains totally unclear to me why you demand the world to be such a thing.

Ah, if you don't see 'worlds' as meaning any such thing, then I wonder, are we really arguing about anything at all?

I'm using 'worlds' that way in reference to the same general setup which we see in propositions-vs-models in model theory, or in  vs the -algebra in the Kolmogorov axioms, or in Kripke frames, and perhaps some other places. 

We can either start with a basic set of "worlds" (eg, ) and define our "propositions" or "events" as sets of worlds, ... (read more)

I'm looking at the Savage theory from your own and I see U(f)=∑u(f(si))P(si), so at least they have no problem with the domains (O and S) being different. Now I see the confusion is that to you Omega=S (and also O=S), but to me Omega=dom(u)=O. Furthermore, if O={o0,o1}, then I can group the terms into u(o0)P("we're in a state where f evaluates to o0") + u(o1)P("we're in a state where f evaluates to o1"), I'm just moving all of the complexity out of EU and into P, which I assume to work by some magic (e.g. LI), that doesn't involve literally iterating over every possible S. That's just math speak, you can define a lot of things as a lot of other things, but that doesn't mean that the agent is going to be literally iterating over infinite sets of infinite bit strings and evaluating something on each of them. By the way, I might not see any more replies to this.

My point is only that U is also reasonable, and possibly equivalent or more general. That there is no "case against" it. 

I do agree that my post didn't do a very good job of delivering a case against utility functions, and actually only argues that there exists a plausibly-more-useful alternative to a specific view which includes utility functions as one of several elements

Utility functions definitely aren't more general.

A classical probability distribution over  with a utility function understood as a random variable can easily be c... (read more)

Ok, you're saying that JB is just a set of axioms, and U already satisfies those axioms. And in this construction "event" really is a subset of Omega, and "updates" are just updates of P, right? Then of course U is not more general, I had the impression that JB is a more distinct and specific thing. Regarding the other direction, my sense is that you will have a very hard time writing down these updates, and when it works, the code will look a lot like one with an utility function. But, again, the example in "Updates Are Computable" isn't detailed enough for me to argue anything. Although now that I look at it, it does look a lot like the U(p)=1-p("never press the button"). I think you should include this explanation of events in the post. It remains totally unclear to me why you demand the world to be such a thing. My point is that if U has two output values, then it only needs two possible inputs. Maybe you're saying that if |dom(U)|=2, then there is no point in having |dom(P)|>2, and maybe you're right, but I feel no need to make such claims. Even if the domains are different, they are not unrelated, Omega is still in some way contained in the ontology. We could and I think we should. I have no idea why we're talking math, and not writing code for some toy agents in some toy simulation. Math has a tendency to sweep all kinds of infinite and intractable problems under the rug.

I agree that it makes more sense to suppose "worlds" are something closer to how the agent imagines worlds, rather than quarks. But on this view, I think it makes a lot of sense to argue that there are no maximally specific worlds -- I can always "extend" a world with an extra, new fact which I had not previously included. IE, agents never "finish" imagining worlds; more detail can always be added (even if only in separate magisteria, eg, imagining adding epiphenomenal facts). I can always conceive of the possibility of a new predicate beyond all the predi... (read more)

Answering out of order: Jeffrey is a reasonable formalization, it was never my point to say that it isn't. My point is only that U is also reasonable, and possibly equivalent or more general. That there is no "case against" it. Although, if you find Jeffery more elegant or comfortable, there is nothing wrong with that. I don't know what "plausible" means, but no, that sounds like a very high bar. I believe that if there is at least one U that produces an intelligent agent, then utility functions are interesting and worth considering. Of course I believe that there are many such "good" functions, but I would not claim that I can describe the set of all of them. At the same time, I don't see why any "good" utility function should be uncomputable. I agree with the first sentence, however Omega is merely the domain of U, it does not need to be the entire ontology. In this case Omega={"button has been pressed", "button has not been pressed"} and P("button has been pressed" | "I'm pressing the button")~1. Obviously, there is also no problem with extending Omega with the perceptions, all the way up to |Omega|=4, or with adding some clocks. If you want to force the agent to remember the entire history of the world, then you'll run out of storage space before you need to worry about computability. A real agent would have to start forgetting days, or keep some compressed summary of that history. It seems to me that Jeffrey would "update" the daily utilities into total expected utility; in that case, U can do something similar. You defined U at the very beginning, so there is no need to send these new facts to U, it doesn't care. Instead, you are describing a problem with P, and it's a hard problem, but Jeffrey also uses P, so that doesn't solve it. If you "evaluate events", then events have some sort of bit representation in the agent, right? I don't clearly see the events in your "Updates Are Computable" example, so I can't say much and I may be confused, but I have a

"Weak methods" means confidence is achieved more empirically, so there's always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). "Strong methods" means there's a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what's happening, so there is much less doubt about what systems the method will successfully apply to.

2Alex Turner4mo
I think most practical alignment techniques have scaled quite nicely, with CCS maybe being an exception, and we don't currently know how to scale the interp advances in OP's paper. Blessings of scale (IIRC): RLHF, constitutional AI / AI-driven dataset inclusion decisions / meta-ethics, activation steering / activation addition (LLAMA2-chat results forthcoming), adversarial training / redteaming, prompt engineering (though RLHF can interfere with responsiveness),...  I think the prior strongly favors "scaling boosts alignability" (at least in "pre-deceptive" regimes, though I have become increasingly skeptical of that purported phase transition, or at least its character).  I'd personally say "empirically promising methods" instead of "weak methods." 

The basic idea is not new to me -- I can't recall where, but I think I've probably seen a talk observing that linear combinations of neurons, rather than individual neurons, are what you'd expect to be meaningful (under some assumptions) because that's how the next layer of neurons looks at a layer -- since linear combinations are what's important to the network, it would be weird if it turned out individual neurons were particularly meaningful. This wasn't even surprising to me at the time I first learned about it.

But it's great to see it illustrated so w... (read more)

4Joel Burget4mo
How would you distinguish between weak and strong methods?

It's imaginable to do this work but not remember any of it, i.e. avoid having that work leave traces that can accumulate, but that seems like a delicate, probably unnatural carving.

Is the implication here that modern NNs don't do this? My own tendency would be to think that they are doing a lot of this -- doing a bunch of reasoning which gets thrown away rather than saved. So it seems like modern NNs have simply managed to hit this delicate unnatural carving. (Which in turn suggests that it is not so delicate, and even, not so unnatural.)

1Tsvi Benson-Tilsen9mo
Yes, I think there's stuff that humans do that's crucial for what makes us smart, that we have to do in order to perform some language tasks, and that the LLM doesn't do when you ask it to do those tasks, even when it performs well in the local-behavior sense.

Attempting to write out the holes in my model. 

  • You point out that looking for a perfect reward function is too hard; optimization searches for upward errors in the rewards to exploit. But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
  • It seems like you have a few tools to combat this form of critique:
    • Model capacity. If the policy that exploits the upward errors is too complex to fit in the model, it cannot be
... (read more)
2Alex Turner9mo
(Huh, I never saw this -- maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.) I really appreciate these thoughts! I would say "that isn't how on-policy RL works; it doesn't just intelligently find increasingly high-reinforcement policies; which reinforcement events get 'exploited' depends on the exploration policy." (You seem to guess that this is my response in the next sub-bullets.) shrug, too good to be true isn't a causal reason for it to not work, of course, and I don't see something suspicious in the correlations. Effective learning algorithms may indeed have nice properties we want, especially if some humans have those same nice properties due to their own effective learning algorithms! 

A good question. I've never seen it happen myself; so where I'm standing, it looks like short emergence examples are cherry-picked.

What report is the image pulled from?

I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can't keep dropping in probability forever, since evidence is lost. The probability only becomes small - but this means if you run for long enough you do in fact expect the transition.

LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

You're correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.

But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren't inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.

I disagree. The crux of the matter is the limited memory of an LLM. If the LLM had unlimited memory, then every Luigi act would further accumulate a little evidence against Waluigi. But because LLMs can only update on so much context, the probability drops to a small one instead of continuing to drop to zero. This makes waluigi inevitable in the long run.

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

So to see if I have this right, the difference is I'm trying to point at a larger phenomenon and you mean teleosemantics to point just at the way beliefs get constrained to be useful.

This doesn't sound quite right to me. Teleosemantics is a purported definition of belief. So according to the teleosemantic picture, it isn't a belief if it's not trying to accurately reflect something. 

The additional statement I prefaced this with, that accuracy is an instrumentally convergent subgoal, was intended to be an explanation of why this sort of "belief" is a c... (read more)

FIAT (by another name) was previously proposed in the book On Intelligence. The version there had a somewhat predictive-processing-like story where the cortex makes plans by prediction alone; so reflective agency (really meaning: agency arising from the cortex) is entirely dependent on building a self-model which predicts agency. Other parts of the brain are responsible for the reflexes which provide the initial data which the self-model gets built on (similar to your story).

The continuing kick toward higher degrees of agency comes from parts of the brain ... (read more)

1Tsvi Benson-Tilsen1y
I don't recall seeing that theory in the first quarter of the book, but I'll look for it later. I somewhat agree with your description of the difference between the theories (at least, as I imagine a predictive processing flavored version). Except, the theories are more similar than you say, in that FIAT would also allow very partial coherentifying, so that it doesn't have to be "follow these goals, but allow these overrides", but can rather be, "make these corrections towards coherence; fill in the free parameters with FIAT goals; leave all the other incoherent behavior the way it is". A difference between the theories (though I don't feel I can pass the PP ITT) is that FIAT allows, you know, agency, as in, non-myopic goal pursuit based on coherent-world-model-building, whereas PP maybe strongly hints against that? I'm confused by this; are these supposed to be mutually exclusive? What's "their own goals"? [After thinking more: Oh like you're saying, here's what it would look like to have a goal that can't be explained as a FIAT goal? I'll assume that in the rest of this comment.] Agreed. I'm not sure I buy that it can't be inferred, even the first time. Maybe you have fairly built-in instincts that aren't about the whole courtship thing, but cause you to feel good when you're around someone. So you seek being around them, and pay attention to them. You try to get them interested in being around you. This builds up the picture of a goal of being together for a long time. (This is a pretty poor explanation as stated; if this explanation works, why wouldn't you just randomly fall in love with anyone you do a favor for? But this is why it's at least plausible to me that the behavior could come from a FIAT-like thing. And maybe that's actually the case with homosexual intercourse in the 1800s.) Maybe courtship is especially much like this, but in general things sort-of-well-explainable as imitation seem like admissible falsifications of FIAT, e.g. if there are also

One thing I see as different between your perspective and (my understanding of) teleosemantics, so far:

You make a general case that values underlie beliefs.

Teleosemantics makes a specific claim that the meaning of semantic constructs (such as beliefs and messages) is pinned down by what it is trying to correspond to.

Your picture seems very compatible with, EG, the old LW claim that UDT's probabilities are really a measure of caring - how much you care about doing well in a variety of scenarios. 

Teleosemantics might fail to analyze such probabilities a... (read more)

1Gordon Seidoh Worley1y
I think this is exactly right. I often say things like "accurate maps are extremely useful to things like survival, so you and every other living thing has strong incentives to draw accurate maps, but this is contingent on the extent to which you care about e.g. survival". So to see if I have this right, the difference is I'm trying to point at a larger phenomenon and you mean teleosemantics to point just at the way beliefs get constrained to be useful.

OK. So far it seems to me like we share a similar overall take, but I disagree with some of your specific framings and such. I guess I'll try and comment on the relevant posts, even though this might imply commenting on some old stuff that you'll end up disclaiming.

1Gordon Seidoh Worley1y
Cool. For what it's worth, I also disagree with many of my old framings. Basically anything written more than ~1 year ago is probably vaguely but not specifically endorsed.

(Following some links...) What's the deal with Holons? 

Your linked article on epistemic circularity doesn't really try to explain itself, but rather links to this article, which LOUDLY doesn't explain itself. 

I haven't read much else yet, but here is what I think I get:

  • You use Godel's incompleteness theorem as part of an argument that meta-rationalism can't make itself comprehensible to rationalism.
  • You think (or thought at the time) that there's a thing, Holons, or Holonic thinking, which is fundamentally really really hard to explain, but which
... (read more)
1Gordon Seidoh Worley1y
Oh man I kind of wish I could go back in time and wipe out all the cringe stuff I wrote when I was trying to figure things out (like why did I need to pull in Godel or reify my confusion?). With that said, here's some updated thoughts on holons. I'm not really familiar with OOO, so I'll be going off your summary here. I think I started out really not getting what the holon idea points at, but I understood enough to get myself confused in new ways for a while. So first off there's only ~1 holon, such that it doesn't make sense to talk about it as anything other than the whole world. Maybe you could make some case for many overlapping holons centered around each point in the universe expanding out to it's Hubble volume, but I think that's probably not helpful. Better to think of the holon as just the whole world, so really it's just a weird cybernetics term for talking about the world. The trouble was I really didn't fully grasp the way that relative and absolute truth are not one and the same. So I was actually still fully trapped within my ontology, but holons seemed like a way to pull pre-ontological reality existing on its own inside of ontology. OOO mostly sounds like being confused about ontology, specifically a kind of reification of the confusion that comes from not realizing that it's maps all the way down, i.e. you only experience the world through, and it's only through experiencing non-experience that you get to taste reality, which is an extremely mysterious answer trying to point at a thing that happens all the time but we literally can't notice it because noticing it destroys it.

It's a good point. I suppose I was anchored by the map/territory analogy to focus on world-to-word fit. The part about Communicative Action and Rational Choice at the very end is supposed to gesture at the other direction. 

Intuitively, I expect it's going to be a bit easier to analyze world-to-word fit first. But I agree that a full picture should address both.

so "5+ autoresponses" would be a single category for decisionmaking purposes

I agree that something in this direction could work, and plausibly captures something about how humans reason. However, I don't feel satisfied. I would want to see the idea developed as part of a larger framework of bounded rationality.

UDT gives us a version of "never be harmed by information" which is really nice, as far as it goes. In the cases which UDT helps with, we don't need to do anything tricky, where we carefully decide which information to look at -- UDT simply isn't har... (read more)

I was using "crazy" to mean something like "too different from what we are familiar with", but I take your point. It's not clear we should want to preserve Aumann.

To be clear, rejecting Aumann's account of common knowledge would make his proof unsound (albeit still valid), but it would not solve the general "disagreement paradox", the counterintuitive conclusion that rational disagreements seem to be impossible: There are several other arguments which lead this conclusion, and which do not rely on any notion of common knowledge.

Interesting, thanks for pointing this out!

Each time we come up against this barrier, it is tempting to add a new layer of indirection in our designs for AI systems.

I strongly agree with this characterization. Of my own "learning normativity" research direction, I would say that it has an avoiding-the-question nature similar to what you are pointing out here; I am in effect saying: Hey! We keep needing new layers of indirection! Let's add infinitely many of them! 

One reason I don't spend very much time staring the question "what is goodness/wisdom" in the eyes is, the CEV write-up and other th... (read more)

I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing. 

I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that i... (read more)

2Alex Turner1y
This seems to prove too much in general, although it could be "right in spirit." If the AI cares about diamonds, finds out about the training process but experiences no more update events in that moment, and then sets its learning rate to zero, then I see no way for the Update God to intervene to make the agent care about its training process.  I was responding to: I bet you can predict what I'm about to say, but I'll say it anyways. The point of RL is not to entrain cognition within the agent which predicts the reward. RL first and foremost chisels cognition into the network.  So I think the statement "how well do the agent's motivations predict the reinforcement event" doesn't make sense if it's cast as "manage a range of hypotheses about the origin of reward (e.g. training-process vs actually making diamonds)." I think it does make sense if you think about what behavioral influences ("shards") within the agent will upweight logits on the actions which led to reward.

I expect this argument to not hold, 

Seems like the most significant remaining disagreement (perhaps).

1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the "distance covered" to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)

So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient t... (read more)

2Alex Turner1y
This seems stronger than the claim I'm making. I'm not saying that the agent won't deceptively model us and the training process at some point. I'm saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away.  You can make the "some subnetwork just models its training process and cares about getting low loss, and then gets promoted" argument against literally any loss function, even some hypothetical "perfect" one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don't perceive you to believe this implication. Anyways, here's another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training.  The network doesn't "observe" more than that, initially. The network just gets updated by the loss function. It doesn't even know what the loss function is. It can't even see the gradients. It can't even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network. Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness): 1. RL develops a bunch of contextual decision-influences / shards 1. EG be near diamonds, make diamonds, play games 2. Agents learn to plan, and severa

My main complaint with this, as I understand it, is that builder/breaker encourages you to repeatedly condition on speculative dangers until you're exploring a tiny and contorted part of solution-space (like worst-case robustness hopes, in my opinion). And then you can be totally out-of-touch from the reality of the problem.

On my understanding, the thing to do is something like heuristic search, where "expanding a node" means examining that possibility in more detail. The builder/breaker scheme helps to map out heuristic guesses about the value of differen... (read more)

2Alex Turner1y
Your comment here is great, high-effort, contains lots of interpretive effort. Thanks so much! Let me see how this would work.  1. Breaker: "The agent might wirehead because caring about physical reward is a high-reward policy on training" 2. Builder: "Possible, but I think using reward signals is still the best way forward. I think the risk is relatively low due to the points made by reward is not the optimization target." 3. Breaker: "So are we assuming a policy gradient-like algorithm for the RL finetuning?"  4. Builder: "Sure." 5. Breaker: "What if there's a subnetwork which is a reward maximizer due to LTH?" 6. ... If that's how it might go, then sure, this seems productive.  I don't think I was mentally distinguishing between "the idealized builder-breaker process" and "the process as TurnTrout believes it to be usually practiced." I think you're right, I should be critiquing the latter, but not necessarily how you in particular practice it, I don't know much about that. I'm critiquing my own historical experience with the process as I imperfectly recall it. Yes, I think this was most of my point. Nice summary. I expect this argument to not hold, but I'm not yet good enough at ML theory to be super confident. Here are some intuitions. Even if it's true that LTH probabilistically ensures the existence of undesired-subnetwork, 1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the "distance covered" to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.) 2. You're always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size."  3. Even if the agent is motivated both by the training process and by the object-level de

The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

I think that it generally seems like a good idea to have solid theories of two different things:

  1. What is the thing we are hoping to teach the AI?
  2. What is the training story by which we mean to teach it?

I read your above paragraph as maligning (1) in favor of (2). In order... (read more)

I said: 

The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans.
In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.

You said: 

In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for people).

Thinking ... (read more)

2Alex Turner1y
True, but I'm also uncertain about the relative difficulty of relatively novel and exotic value-spreads like "I value doing the right thing by humans, where I'm uncertain about the referent of humans", compared to "People should have lots of resources and be able to spend them freely and wisely in pursuit of their own purposes" (the latter being values that at least I do in fact have).

If you commit to the specific view of outer/inner alignment, then now you also want your loss function to "represent" that goal in some way.

I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases -- or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.

This is because writing down the desired inductive biases as an explicit prior can help us to understand... (read more)

I doubt this due to learning from scratch.

I expect you'll say I'm missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is "figuring out the human prior", because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I'm fine with that. I am not dogmatic on orthodox bayesianism. I do not even like utility functions.

Insofar as the ques

... (read more)
2Alex Turner1y
I agree, this does seem like it was a language dispute, I no longer perceive us as disagreeing on this point. 

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not

... (read more)
2Alex Turner1y
I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.  This isn't a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models). And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point? (I know you noted its fuzziness, so maybe you're already sympathetic to responses like the one I just gave?)
Load More