All of leogao's Comments + Replies

Answer by leogaoSep 17, 20232431

Obviously I think it's worth being careful, but I think in general it's actually relatively hard to accidentally advance capabilities too much by working specifically on alignment. Some reasons:

  1. Researchers of all fields tend to do this thing where they have really strong conviction in their direction and think everyone should work on their thing. Convincing them that some other direction is better is actually pretty hard even if you're trying to shove your ideas down their throats.
  2. Often the bottleneck is not that nobody realizes that something is a bott
... (read more)

Ran this on GPT-4-base and it gets 56.7% (n=1000)

4Ethan Perez1mo
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I'd be interested to know both)
4Quintin Pope1mo
What about RLHF'd GPT-4?

I think it's worth disentangling LLMs and Transformers and so on in discussions like this one--they are not one and the same. For instance, the following are distinct positions that have quite different implications:

  • The current precise transformer LM setup but bigger will never achieve AGI
  • A transformer trained on the language modelling objective will never achieve AGI (but a transformer network trained with other modalities or objectives or whatever will)
  • A language model with the transformer architecture will never achieve AGI (but a language model wit
... (read more)

Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart ( Jacob probably has more detailed takes on this than me. 

In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

Adding $200 to the pool. Also, I endorse the existence of more bounties/contests like this.

re:1, yeah that seems plausible, I'm thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.

re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn't endorse. More broadly, the thing I'm focusing on in this post is not really about drift over time or self improvement; in the setup I'm describing, the thing that goes wrong is it does the classical ... (read more)

one man's modus tollens is another man's modus ponens:

"making progress without empirical feedback loops is really hard, so we should get feedback loops where possible" "in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard"

Yeah something in this space seems like a central crux to me.

I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops."

I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")

Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.

This seems wrong. I think the mistake you're making is when you argue that because there's some chance X happens at each step and X is an absorbing state, therefore you have to end up at X eventually. However, this is only true if you assume the conclusion and claim that the prior probability of luigis is zero. If there is some prior probability of a luigi, each non-waluigi step incre... (read more)

4Abram Demski7mo
I disagree. The crux of the matter is the limited memory of an LLM. If the LLM had unlimited memory, then every Luigi act would further accumulate a little evidence against Waluigi. But because LLMs can only update on so much context, the probability drops to a small one instead of continuing to drop to zero. This makes waluigi inevitable in the long run.

Agreed.  To give a concrete toy example:  Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}.  If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi.  The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).

This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.

However, this trick won't solve the problem. The LLM will print the correct answer if it trusts the flattery about Jane, and it will trust the flattery about Jane if the LLM trusts that the story is "super-duper definitely 100% true and factual". But why would the LLM trust that sentence?


There's a fun connection to ELK here. Suppose you see this and decide: "ok forget trying to describe in language that it's definitely 100% true and factual in natural language. What if we just add a special token that I prepend to indicate '100% true and factual, for... (read more)

5Cleo Nardo7mo
Yes — this is exactly what I've been thinking about! Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true. If the answer is "yes", then that's a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.

The coin flip example seems related to some of the ideas here

But like, why?

I think maybe the crux is the part about the strength of the incentives towards doing capabilities. From my perspective it generally seems like this incentive gradient is pretty real: getting funded for capabilities is a lot easier, it's a lot more prestigious and high status in the mainstream, etc. I also myself viscerally feel the pull of wishful thinking (I really want to be wrong about high P(doom)!) and spend a lot of willpower trying to combat it (but also not so much that I fail to update where things genuinely are not as bad as I would expect, but also not allowing that to be an excuse for wishful thinking, etc...). 

3Rohin Shah7mo
In that case, I think you should try and find out what the incentive gradient is like for other people before prescribing the actions that they should take. I'd predict that for a lot of alignment researchers your list of incentives mostly doesn't resonate, relative to things like: 1. Active discomfort at potentially contributing to a problem that could end humanity 2. Social pressure + status incentives from EAs / rationalists to work on safety and not capabilities 3. Desire to work on philosophical or mathematical puzzles, rather than mucking around in the weeds of ML engineering 4. Wanting to do something big-picture / impactful / meaningful (tbc this could apply to both alignment and capabilities) For reference, I'd list (2) and (4) as the main things that affects me, with maybe a little bit of (3), and I used to also be pretty affected by (1). None of the things you listed feel like they affect me much (now or in the past), except perhaps wishful thinking (though I don't really see that as an "incentive").

"This [model] is zero evidence for the claim" is a roughly accurate view of my opinion. I think you're right that epistemically it would have been much better for me to have said something along those lines. Will edit something into my original comment.

Exponentials are memoryless. If you advance an exponential to where it would be one year from now. then some future milestone (like "level of capability required for doom") appears exactly one year earlier. [...]

Errr, I feel like we already agree on this point? Like I'm saying almost exactly the same thing you're saying; sorry if I didn't make it prominent enough:

It happens to be false in the specific model of moving an exponential up (if you instantaneously double the progress at some point in time, the deadline moves one doubling-time closer, but the tot

... (read more)
2Rohin Shah8mo
Yes, sorry, I realized that right after I posted and replaced it with a better response, but apparently you already saw it :( But like, why? I wish people would argue for this instead of flatly asserting it and then talking about increased scrutiny or burdens of proof (which I also don't like).

Not OP, just some personal takes:

That's not small!

To me, it seems like the claim that is (implicitly) being made here is that small improvements early on compound to have much bigger impacts later on, and also a larger shortening of the overall timeline to some threshold. (To be clear, I don't think the exponential model presented provides evidence for this latter claim)

I think the first claim is obviously true. The second claim could be true in practice, though I feel quite uncertain about this. It happens to be false in the specific model of moving an ex... (read more)

4Rohin Shah8mo
As you note, the second claim is false for the model the OP mentions. I don't care about the first claim once you know whether the second claim is true or false, which is the important part. I agree it could be true in practice in other models but I am unhappy about the pattern where someone makes a claim based on arguments that are clearly wrong, and then you treat the claim as something worth thinking about anyway. (To be fair maybe you already believed the claim or were interested in it rather than reacting to it being present in this post, but I still wish you'd say something like "this post is zero evidence for the claim, people should not update at all on it, separately I think it might be true".) To my knowledge, nobody in this debate thinks that advancing capabilities is uniformly good. Yes, obviously there is an effect of "less time for alignment research" which I think is bad all else equal. The point is just that there is also a positive impact of "lessens overhangs". I find the principle "don't do X if it has any negative effects, no matter how many positive effects it has" extremely weird but I agree if you endorse that that means you should never work on things that advance capabilities. But if you endorse that principle, why did you join OpenAI?
2Matt MacDermott8mo
Nice, thanks. It seems like the distinction the authors make between 'building agents from the ground up' and 'understanding their behaviour and predicting roughly what they will do' maps to the distinction I'm making, but I'm not convinced by the claim that the second one is a much stronger version of the first. The argument in the paper is that the first requires an understanding of just one agent, while the second requires an understanding of all agents. But it seems like they require different kinds of understanding, especially if the agent being built is meant to be some theoretical ideal of rationality. Building a perfect chess algorithm is just a different task to summarising the way an arbitrary algorithm plays chess (which you could attempt without even knowing the rules).

I agree with the general point here but I think there's an important consideration that makes the application to RL algorithms less clear: wireheading is an artifact of embeddedness, and most RL work is in the non-embedded setting. Thus, it seems plausible that the development of better RL algorithms does in fact lead to the development of algorithms that would, if they were deployed in an embedded setting, wirehead.

2Steve Byrnes9mo
Here’s a question: In a non-embedded (cartesian) training environment where wireheading is impossible, is it the case that: * IF an intervention makes the value function strictly more accurate as an approximation of expected future reward, * THEN this intervention is guaranteed to lead to an RL agent that does more cool things that the programmers want? I can’t immediately think of any counterexamples to that claim, but I would still guess that counterexamples exist. (For the record, I do not claim that wireheading is nothing to worry about. I think that wireheading is a plausible but not inevitable failure mode. I don’t currently know of any plan in which there is a strong reason to believe that wireheading definitely won’t happen, except plans that severely cripple capabilities, such that the AGI can’t invent new technology etc. And I agree with you that if AI people continue to do all their work in wirehead-proof cartesian training environments, and don’t even try to think about wireheading, then we shouldn’t expect them to make any progress on the wireheading problem!)

I think of mesaoptimization as primarily being concerning because it would mean models (selected using amortized optimization) doing their own direct optimization, and the extent to which the model is itself doing its own "direct" optimization vs just being "amortized" is what I would call the optimizer-controller spectrum (see this post also).

Also, it seems kind of inaccurate to declare that (non-RL) ML systems are fundamentally amortized optimization and then to say things like "more computation and better algorithms should improve safety and the pr... (read more)

I expect that the key externalities will be borne by society. The main reason for this is I expect deceptive alignment to be a big deal. It will at some point be very easy to make AI appear safe, by making it pretend to be aligned, and very hard to make it actually aligned. Then, I expect something like the following to play out (this is already an optimistic rollout intended to isolate the externality aspect, not a representative one):

We start observing alignment failures in models. Maybe a bunch of AIs do things analogous to shoddy accounting practices. ... (read more)

A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well.

Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like "well yes but this is just in a toy environment, and it's a big leap to it taking over the world", but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem ... (read more)

Thanks for laying out the case for this scenario, and for making a concrete analogy to a current world problem! I think our differing intuitions on how likely this scenario is might boil down to different intuitions about the following question: To what extent will the costs of misalignment be borne by the direct users/employers of AI? Addressing climate change is hard specifically because the costs of fossil fuel emissions are pretty much entirely borne by agents other than the emitters. If this weren't the case, then it wouldn't be a problem, for the reasons you've mentioned! I agree that if the costs of misalignment are nearly entirely externalities, then your argument is convincing. And I have a lot of uncertainty about whether this is true. My gut intuition, though, is that employing a misaligned AI is less like "emitting CO2 into the atmosphere" and more like "employing a very misaligned human employee" or "using shoddy accounting practices" or "secretly taking sketchy shortcuts on engineering projects in order to save costs"—all of which yield serious risks for the employer, and all of which real-world companies take serious steps to avoid, even when these steps are costly (with high probability, if not in expectation) in the short term. I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people, and for businesses to act once misalignment hurts their bottom lines (again, unless you think misalignment can always be shoved under the rug and not hurt anyone's bottom line). There's lots of room for this to happen in the middle ground between toy environments and taking over the world (unless you expect lightning-fast takeoff, which I don't).

There's an example in the appendix but we didn't do a lot of qualitative analysis.

It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)

I generally agree that coupling is the main thing necessary for gradient hacking. However, from trying to construct gradient hackers by hand, my intuition is that gradient descent is just really good at credit assignment. For instance, in most reasonable architectures I don't think it's possible to have separate subnetworks for figuring out the correct answer and then just adding the coupling by gating it to save negentropy. To me, it seems the only kinds of strategies that could work are ones where the circuits implementing the cognition that decides to s... (read more)

I don't think we can even conclude for certain that a lack of measured loglikelihood improvement implies that it won't, though it is evidence. Maybe the data used to measure the behavior doesn't successfully prompt the model to do the behavior, maybe it's phrased in a way the model recognizes as unlikely and so at some scale the model stops increasing likelihood on that sample, etc; as you would say, prompting can show presence but not absence.

Yes, you could definitely have misleading perplexities, like improving on a subset which is rare but vital and does not overcome noise in the evaluation (you are stacking multiple layers of measurement error/variance when you evaluate a single checkpoint on a single small heldout set of datapoints); after all, this is in fact the entire problem to begin with, that our overall perplexity has very unclear relationships to various kinds of performance, and so your overall Big-Bench perplexity would tell you little about whether there are any jaggies when you break it down to individual Bench components, and there is no reason to think the individual components are 'atomic', so the measurement regress continues... The fact that someone like Paul can come along afterwards and tell you "ah, but the perplexity would have been smooth if only you had chosen the right subset of datapoints to measure progress on as your true benchmark" would not matter.
2Ethan Perez1y
Agreed. I'd also add: 1. I think we can mitigate the phrasing issues by presenting tasks in a multiple choice format and measuring log-probability on the scary answer choice. 2. I think we'll also want to write hundreds of tests for a particular scary behavior (e.g., power-seeking), rather than a single test. This way, we'll get somewhat stronger (but still non-conclusive) evidence that the particular scary behavior is unlikely to occur in the future, if all of the tests show decreasing log-likelihood on the scary behavior.

Seems like there are multiple possibilities here:

  • (1) The AI does something that will, as an intended consequence, result in human extinction, because this is instrumental to preventing shutdown, etc. It attempts to circumvent our interpretability, oversight, etc. This is the typical deceptive alignment setting which is attempted to be addressed by myopia, interpretability, oversight, etc.
  • (2) The AI does something that will, as an unintended side consequence, result in human extinction. The AI also realizes that this is a consequence of its actions but does
... (read more)

(Mostly just stating my understanding of your take back at you to see if I correctly got what you're saying:)

I agree this argument is obviously true in the limit, with the transistor case as an existence proof. I think things get weird at the in-between scales. The smaller the network of aligned components, the more likely it is to be aligned (obviously, in the limit if you have only one aligned thing, the entire system of that one thing is aligned); and also the more modular each component is (or I guess you would say the better the interfaces between the... (read more)

Good description. Also I had never actually floated the hypothesis that "people who are optimistic about HCH-like things generally believe that language is a good interface" before; natural language seems like such an obviously leaky and lossy API that I had never actually considered that other people might think it's a good idea.

I agree that in practice you would want to point mild optimization at it, though my preferred resolution (for purely aesthetic reasons) is to figure out how to make utility maximizers that care about latent variables, and then make it try to optimize the latent variable corresponding to whatever the reflection converges to (by doing something vaguely like logical induction). Of course the main obstacles are how the hell we actually do this, and how we make sure the reflection process doesn't just oscillate forever.

(Transcribed in part from Eleuther discussion and DMs.)

My understanding of the argument here is that you're using the fact that you care about diamonds as evidence that whatever the brain is doing is worth studying, with the hope that it might help us with alignment. I agree with that part. However, I disagree with the part where you claim that things like CIRL and ontology identification aren't as worthy of being elevated to consideration. I think there exist lines of reasoning that these fall naturally out as subproblems, and the fact that they fall out ... (read more)

2Alex Turner1y
I want to understand the generators of human alignment properties, so as to learn about the alignment problem and how it "works" in practice, and then use that knowledge of alignment-generators in the AI case. I'm not trying to make an "amplified" human.  I personally am unsure whether this is even a useful frame, or an artifact of conditioning on our own confusion about how alignment works. How do you know that?  "Get the agent to care about some parts of reality" is not high on my list of problems, because I don't think it's a problem, I think it's the default outcome for the agents we will train. (I don't have a stable list right now because my list of alignment subproblems is rapidly refactoring as I understand the problem better.)  "Get the agent to care about specific things in the real world" seems important to me, because it's challenging our ability to map outer supervision signals into internal cognitive structures within the agent. Also, it seems relatively easy to explain, and I also have a good story for why people (and general RL agents) will "bind their values to certain parts of reality" (in a sense which I will later explain).  Disagreeing with the second phrase is one major point of this essay. How do we know that substituting this is fine? By what evidence do we know that the problem is even compactly solvable in the AIXI framing? (Thanks for querying your model of me, btw! Pretty nice model, that indeed sounds like something I would say. :) )  I don't think you think humans care about diamonds because the genome specifies brian-scanning circuitry which rewards diamond-thoughts. Or am I wrong? So, humans caring about diamonds actually wouldn't correspond to the AIXI case? (I also am confused if and why you think that this is how it gets solved for other human motivations...?)
2Vivek Hebbar1y
What is your list of problems by urgency, btw?  Would be curious to know.

(Partly transcribed from a correspondence on Eleuther.)

I disagree about concepts in the human world model being inaccessible in theory to the genome. I think lots of concepts could be accessed, and that (2) is true in the trilemma.

Consider: As a dumb example that I don't expect to actually be the case but which gives useful intuition, suppose the genome really wants to wire something up to the tree neuron. Then the genome could encode a handful of images of trees and then once the brain is fully formed it can go through and search for whichever neuron acti... (read more)

I like the tree example, and I think it's quite useful (and fun) to think of dumb and speculative way for the genome to access world concept. For instance, in response to "I infer that the genome cannot directly specify circuitry which detects whether you’re thinking about your family", the genome could: * Hardcode a face detector, and store the face most seen during early childhood (for instance to link them to the reward center).  * Store faces of people with an odor similar to amniotic fluid odor or with a weak odor (if you're insensitive to your own smell and family member have a more similar smell) In these cases, I'm not sure if it counts for you as the genome directly specifying circuitry, but it should quite robustly point to a real world concept (which could be "gamed" in certain situations like adoptive parents, but I think that's actually what happens)
2Alex Turner1y
(Upvoted, unsure of whether to hit 'disagree') Hm. Here's another stab at isolating my disagreement (?) with you: * I agree that, in theory, there exist (possibly extremely complicated) genotypes which do specify extensive hardcoded circuitry which does in practice access certain abstract concepts like death. * (Because you can do a lot if you're talking about "in theory"; probably the case that a few complicated programs which don't seem like they should work, will work, even though most do fail) * I think the more complicated indirect specifications (like associatively learning where the tree abstraction is learned) are "plausible" in the sense that a not-immediately-crisply-debunkable alignment idea seems "plausible", but if you actually try that kind of idea in reality, it doesn't work (with high probability).  * But marginalizing over all such implausible "plausible" ideas and adding in evolution's "multiple tries" advantage and adding in some unforeseen clever solutions I haven't yet considered, I reach a credence of about 4-8% for such approaches actually explaining significant portions of human mental events. So now I'm not sure where we disagree. I don't think it's literally impossible for the genome to access death, but it sure sounds sketchy to me, so I assign it low credence. I agree that (2) is possible, but I assign it low credence. You don't think it's impossible either, but you seem to agree that relatively few things are in fact hardcoded, but also you think (2) is the resolution to the trilemma. But wouldn't that imply (3) instead, even though, perhaps for a select few concepts, (2) is the case? Here's some misc commentaries: (Nitpick for clarity) "Fact"? Be careful to not condition on your own hypothesis! I don't think you're literally doing as much, but for other readers, I want to flag this as importantly an inference on your part and not an observation. (LMK if I unintentionally do this elsewhere,

Computationally expensive things are less likely to show up in your simulation than the real world, because you only have so much compute to run your simulation. You can't convincingly fake the AI having access to a supercomputer.

The possibility is that Alice might always be able tell that she’s in a simulation no matter what we condition on. I think this is pretty much precluded by the assumption that the generative model is a good model of the world, but if that fails then it’s possible Alice can tell she’s in a simulation no matter what we do. So a lot rides on the statement that the generative model remains a good model of the world regardless of what we condition on.

Paul's RSA-2048 counterexample is an example of a way our generative model can fail to be good enough no matter ... (read more)

2Charlie Steiner1y
I think this is only one horn of a dilemma. The other horn is if the generative model reasons about the world abstractly, so that it just gives us a good guess about what the output of the AI would be if it really was in the real world (and got to see some large hash collision). But now it seems likely that creating this generative model would require solving several tricky alignment problems so that it generalizes its abstractions to novel situations in ways we'd approve of.
1Adam Jermyn1y
I don’t think that’s an example of the model noticing it’s in a simulation. There’s nothing about simulations versus the real world that makes RSA instances more or less likely to pop up. Rather, that’s a case where the model just has a defecting condition and we don’t hit it in the simulation. This is what I was getting at with “other challenge” #2.

Liked this post a lot. In particular I think I strongly agree with "Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument" as the general vibe of how I feel about Eliezer's arguments. 

A few comments on the disagreements:

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.”

An in-between posit... (read more)

Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment). Also seems related to his arguments about ML systems having squiggles.

Yup.  You can definitely train powerful systems on imitation of human thoughts, and in the limit this just gets you a powerful mesa-optimizer that figures out how to imitate them.

I agree that the SW/HW analogy is not a good analogy for AGI safety (I think security is actually a better analogy), but I would like to present a defence of the idea that normal systems reliability engineering is not enough for alignment (this is not necessarily a defence of any of the analogies/claims in the OP).

Systems safety engineering leans heavily on the idea that failures happen randomly and (mostly) independently, so that enough failures happening together by coincidence to break the guarantees of the system is rare. That is:

  • RAID is based on the
... (read more)

Some quick thoughts on these points:

  • I think the ability for humans to communicate and coordinate is a double edged sword. In particular, it enables the attack vector of dangerous self propagating memes. I expect memetic warfare to play a major role in many of the failure scenarios I can think of. As we've seen, even humans are capable of crafting some pretty potent memes, and even defending against human actors is difficult.
  • I think it's likely that the relevant reference class here is research bets rather then the "task" of AGI. An extremely successful
... (read more)

A few axes along which to classify optimizers:

  • Competence: An optimizer is more competent if it achieves the objective more frequently on distribution
  • Capabilities Robustness: An optimizer is more capabilities robust if it can handle a broader range of OOD world states (and thus possible pertubations) competently.
  • Generality: An optimizer is more general if it can represent and achieve a broader range of different objectives
  • Real-world objectives: whether the optimizer is capable of having objectives about things in the real world.

Some observations: it feels l... (read more)

One possible model of AI development is as follows: there exists some threshold beyond which capabilities are powerful enough to cause an x-risk, and such that we need alignment progress to be at the level needed to align that system before it comes into existence. I find it informative to think of this as a race where for capabilities the finish line is x-risk-capable AGI, and for alignment this is the ability to align x-risk-capable AGI. In this model, it is necessary but not sufficient for alignment for alignment to be ahead by the time it's at the fini... (read more)

Looking forward to seeing the survey results!

By the way, if you're an alignment researcher and compute is your bottleneck, please send me a DM. EleutherAI already has a lot of compute resources (as well as a great community for discussing alignment and ML!), and we're very interested in providing compute for alignment researchers with minimal bureaucracy required. 

I agree that there will be cases where we have ontological crises where it's not clear what the answer is, i.e whether the mirrored dog counts as "healthy". However, I feel like the thing I'm pointing at is that there is some sort of closure of any given set of training examples where, for some fairly weak assumptions, we can know that everything in this expanded set is "definitely not going too far". As a trivial example, anything that is a direct logical consequence of anything in the training set would be part of the completion. I expect any ELK solutions to look something like that. This corresponds directly to the case where the ontology identification process converges to some set smaller than the entire set of all cases. 

My understanding of the argument: if we can always come up with a conservative reporter (one that answers yes only when the true answer is yes), and this reporter can label at least one additional data point that we couldn't label before, we can use this newly expanded dataset to pick a new reporter, feed this process back into itself ad infinitum to label more and more data, and the fixed point of iterating this process is the perfect oracle. This would imply an ability to solve arbitrary model splintering problems, which seems like it would need to eithe... (read more)

2Alex Flint2y
That is a good summary of the argument. Thanks for this question. Consider a problem involving robotic surgery of somebody's pet dog, and suppose that there is a plan that would transform all the dog's entire body as if it were mirror-imaged (left<->right). This dog will have a perfectly healthy body, but it will actually be unable to consume any food currently on Earth because the chirality of its biology will render it incompatible with the biology of any other living thing on the planet, so it will starve. We would not want to execute this plan and there are of course ordinary questions that, if we think to ask them, will reveal that the plan is bad ("will the dog be able to eat food from planet Earth?"). But if we only think to ask "is the dog healthy?" then we would really like our system not to try to extrapolate concepts of "healthy" to cases where a dog's entire body is mirror-imaged. But how would any system know that this particular simple geometric transformation (mirror-imaging) is dangerous, while other simple geometrical transformations on the dog's body -- such as physically moving or rotating the dog's body in space -- are benign? I think it would have to know what we value with respect to the dog. To make this example sharper, let the human providing the training examples be someone alive today who has no idea that biological chirality is a thing. How surprised they would be that their seemingly-healthy pet dog died a few days after the seemingly-successful surgery. Even if they did an autopsy on their dog to find out why it died, it would appear that all the organs and such were perfectly healthy and normal. It would be quite strange to them. Now what if it was the diamond that was mirror-imaged inside the vault? Is the diamond still in the vault? Yeah, of course, we don't care about a diamond being mirror imaged. It seems to me that the reason we don't care in the case of the diamond is that the qualities we value in a diamond are not affected

A GLUT can have constant time complexity using a hash table, which makes it a lot less clear that metalearning can be faster

2Paul Christiano2y
Minimal circuits are not quite the same as fastest programs---they have no adaptive computation, so you can't e.g. memorize a big table but only use part of it. In some sense it's just one more (relatively extreme) way of making a complexity-speed tradeoff  I basically agree that a GLUT is always faster than meta-learning if you have arbitrary adaptive computation. That said, I don't think it's totally right to call a GLUT constant complexity---if you have an n bit input and m bit output, then it takes at least n+m operations to compute the GLUT (in any reasonable low-level model of computation).  There are even more speed-focused methods than minimal circuits. I think the most extreme versions are more like query complexity or communication complexity, which in some sense are just asking how fast you can make your GLUT---can you get away with reading only a small set of input bits? But being totally precise about "fastest" requires being a lot more principled about the model of computation.
0Marc Carauleanu2y
Any n-bit hash function will produce collisions when the number of elements in the hash table gets large enough (after the number of possible hashes stored in n bits has been reached) so adding new elements will require rehashing to avoid collisions making GLUT have a logarithmic time complexity in the limit. Meta-learning can also have a constant time complexity for an arbitrarily large number of tasks, but not in the limit, assuming a finite neural network.

From a zoomed-out perspective, the model is not modifying the loss landscape. This frame, however, does not give us a useful way of thinking about how gradient hacking might occur and how to avoid it.


I think that the main value of the frame is to separate out the different potential ways gradient hacking can occur. I've noticed that in discussions without this distinction, it's very easy to equivocate between the types, which leads to frustrating conversations where people fundamentally disagree without realizing (i.e someone might be talking about s... (read more)

My attempt at a one sentence summary of the core intuition behind this proposal: if you can be sure your model isn’t optimizing for deceiving you, you can relatively easily tell if it’s trying to optimize for something you don’t want by just observing whether your model seems to be trying to do something obviously different from what you want during training, because it's much harder to slip under the radar by getting really lucky than by intentionally trying to.

Here's a hand crafted way of doing gradient protection in this case I can think of: since these models are blocks of linear->bn(affine)->relu, if you make the beta in the affine really small, you can completely zero out the output of that block and then the rest of the model can only learn a constant function. You can also get around L2: just set i.e gamma to 0.00001 and beta to -0.01; this lets you have both really small parameter magnitudes and also still saturate the relu. As this model is trained on the base objective it should converge to a cons... (read more)

The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights () completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven't succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking. 

I don't think redundancy will work. Suppos... (read more)

To make sure I understand your notation, f1 is some set of weights, right? If it's a set of multiple weights I don't know what you mean when you write ∂y∂f1. (I don't yet understand the purpose of this claim, but it seems to me wrong. If C(f1,f1)=C(f2,f2) for every f1,f2, why is it true that C(f1,f2) does not depend on f1 and f2 when f1≠f2?)

I think this is something I and many others at EleutherAI would be very interested in working on, since it seems like something that we'd have a uniquely big comparative advantage at. 

One very relevant piece of infrastructure we've built is our evaluation framework, which we use for all of our evaluation since it makes it really easy to evaluate your task on GPT-2/3/Neo/NeoX/J etc. We also have a bunch of other useful LM related resources, like intermediate checkpoints for GPT-J-6B that we are looking to use in our interpretability work, for example. ... (read more)