All of rohinmshah's Comments + Replies

Re-Define Intent Alignment?

This is an assumption about the world -- not all worlds can be usefully described by partial models.

They can't? Why not?

Maybe the "usefully" part is doing a lot of work here -- can all worlds be described (perhaps not usefully) by partial models? If so, I think I have the same objection, since it doesn't seem like any of the technical results in InfraBayes depend on some notion of "usefulness".

(I think it's pretty likely I'm just flat out wrong about something here, given how little I've thought about InfraBayesianism, but if so I'd like to know how I'm wrong.)

AXRP Episode 10 - AI’s Future and Impacts with Katja Grace

Planned summary for the Alignment Newsletter:

This podcast goes over various strands of research from [AI Impacts](https://aiimpacts.org/), including lots of work that I either haven’t covered or have covered only briefly in this newsletter:

**AI Impacts’ methodology.** AI Impacts aims to advance the state of knowledge about AI and AI risk by recursively decomposing important high-level questions and claims into subquestions and subclaims, until reaching a question that can be relatively easily answered by gathering data. They generally aim to provide new fa

... (read more)
[AN #157]: Measuring misalignment in the technology underlying Copilot

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

Yeah, I agree that in the future there is a difference. I don't think we know which of these situations we're going to be in (which is maybe what you're arguing). Idk what Gwern predicts.

1Adam Shimi7hExactly. I'm mostly arguing that I don't think the case for the agent situation is as clear cut as I've seen some people defend it, which doesn't mean it's not possibly true.
Experimentally evaluating whether honesty generalizes

Planned summary for the Alignment Newsletter:

The highlighted post introduced the notion of optimism about generalization. On this view, if we train an AI agent on question-answer pairs (or comparisons) where we are confident in the correctness of the answers (or comparisons), the resulting agent will continue to answer honestly even on questions where we wouldn’t be confident of the answer.

While we can’t test exactly the situation we care about -- whether a superintelligent AI system would continue to answer questions honestly -- we _can_ test an analogous

... (read more)
Answering questions honestly instead of predicting human answers: lots of problems and some solutions

Hmm, I'm not thinking about the complexity part at all right now; I'm just thinking mechanically about what is implied by your equations.

the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.”

I'm not sure exactly what you mean by the parameters specifying some condition. I thought the condition was specified upfront by the designer (though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on ... (read more)

2Evan Hubinger5hThe only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it's just that some are more/less likely now. Yep, that's exactly right. That's definitely not what should happen in that case. Note that there is no relation between θ1 and f1 or θ2 and f2—both sets of parameters contribute equally to both heads. Thus, θ1 can enforce any condition it wants on θ2 by leaving some particular hole in how it computes f1 and f2 and forcing θ2 to fill in that hole in such a way to make θ1's computation of the two heads come out equal.
Re-Define Intent Alignment?

My central complaint about existing theoretical work is that it doesn't seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do.

I don't currently see how any of the alignment community's tools address that complaint; for example I don't think the InfraBayes work so far is making an interesting assumption about reality. Perhaps future work will address this though?

4Abram Demski9hInfraBayes doesn't look for the regularity in reality that NNs are taking advantage of, agreed. But InfraBayes is exactly about "what kind of regularity assumptions can we realistically make about reality?" You can think of it as a reaction to the unrealistic nature of the regularity assumptions which Solomonoff induction makes. So it offers an answer to the question "what useful+realistic regularity assumptions could we make?" The InfraBayesian answer is "partial models". IE, the idea that even if reality cannot be completely described by usable models, perhaps we can aim to partially describe it. This is an assumption about the world -- not all worlds can be usefully described by partial models. However, it's a weaker assumption about the world than usual. So it may not have presented itself as an assumption about the world in your mind, since perhaps you were thinking more of stronger assumptions. If it's a good answer, it's at least plausible that NNs work well for related reasons. But I think it also makes sense to try to get at the useful+realistic regularity assumptions from scratch, rather than necessarily making it all about NNs
Refactoring Alignment (attempt #2)

The definition of "objective robustness" I used says "aligns with the base objective" (including off-distribution). But I think this isn't an appropriate representation of your approach. Rather, "objective robustness" has to be defined something like "generalizes acceptably". Then, ideas like adversarial training and checks and balances make sense as a part of the story.

Yeah, strong +1.

2Abram Demski9hGreat! I feel like we're making progress on these basic definitions.
Teaching ML to answer questions honestly instead of predicting human answers

Planned summary for the Alignment Newsletter:

This post presents an algorithm that aims to solve the second problem from the highlighted post. As a reminder, the second problem is that an AI system that already has to make predictions about humans might learn a policy that is just “say what humans would say”, since that is simpler than learning another translation that maps its knowledge to human language (so that it can answer honestly to the best of its knowledge).

The core idea is to have a “simple” labeling process and a “complex” labeling process, where

... (read more)
A naive alignment strategy and optimism about generalization

Planned summary for the Alignment Newsletter:

We want to build an AI system that answers questions honestly, to the best of its ability. One obvious approach is to have humans generate answers to questions, select the question-answer pairs where we are most confident in the answers, and train an AI system on those question-answer pairs.

(I’ve described this with a supervised learning setup, but we don’t have to do that: we could also [learn](https://deepmind.com/blog/learning-through-human-feedback/) from [comparisons](https://ai-alignment.com/optimizing-wit

... (read more)
Teaching ML to answer questions honestly instead of predicting human answers

Some confusions I have:

Why does  need to include part of the world model? Why not instead have  be the parameters of the two heads, and  be the parameters of the rest of the model?

This would mean that you can’t initialize  to be equal to , but I don’t see why that’s necessary in the first place -- in particular it seems like the following generative model should work just fine:

(I’ll be thinking of this setup for the rest of my comment, ... (read more)

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

except now  is checked over all inputs, not just over the dataset (note that we still update on the dataset at the end—it's just our prior which is now independent of it).

Doesn't this mean that the two heads have to be literally identical in their outputs? It seems like at this point your prior is "generate parameters randomly under the constraint that the two heads are identical", which seems basically equivalent to having a single head and generating parameters randomly, so it seems unintuitive that this can do anything useful.

(Disclaimer: I ... (read more)

2Evan Hubinger1dThat's not what the prior looks like—the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.” Thus, you don't need to pay for the complexity of satisfying the condition, only the complexity of specifying it (as long as you're content with the simplest possible way to satisfy it). This is why the two-step nature of the algorithm is necessary—the prior you're describing is what would happen if you used a one-step algorithm rather than a two-step algorithm (which I agree would then not do anything).
Refactoring Alignment (attempt #2)

I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that's done.

But how? In prosaic AI, only on-distribution behavior of the loss function can influence the end result.

I can see a few possible responses here.

  1. Double down on the "correct generalization" story: hope to somehow avoid the multiple plausible generalizations, perhaps by providing enough training data, or appropriate inductive biases in the system (probably both).
  2. Achieve objective robustness through other means. In particular, in
... (read more)
4Abram Demski1dBut it seems to me that there's something missing in terms of acceptability. The definition of "objective robustness" I used says "aligns with the base objective" (including off-distribution). But I think this isn't an appropriate representation of your approach. Rather, "objective robustness" has to be defined something like "generalizes acceptably". Then, ideas like adversarial training and checks and balances make sense as a part of the story. WRT your suggestions, I think there's a spectrum from "clean" to "not clean", and the ideas you propose could fall at multiple points on that spectrum (depending on how they are implemented, how much theory backs them up, etc). So, yeah, I favor "cleaner" ideas than you do, but that doesn't rule out this path for me.
[AN #156]: The scaling hypothesis: a plan for building AGI

I think effective certification is likely to involve expert analysis (including non-technical domain experts) of specific algorithms used in specific contexts.  This appears to contradict the "Second" point above somewhat.

The idea with the "Second" point is that the certification would be something like "we certify that company X has a process Y for analyzing and fixing potential problem Z whenever they build a new algorithm / product", which seems like it is consistent with your belief here? Unless you think that the process isn't enough, you need to certify the analysis itself.

Re-Define Intent Alignment?

One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. 

I don't hope this; I expect to use a version of "acceptable" that uses intent. I'm happy with "acceptable" = "trying to do what we want".

If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason?

I'm pessimistic about mesa-objectives existing in actual systems, based on how people normally seem to use the term "mesa-objective". If you instead just say that a "mesa objective" is "whate... (read more)

5Abram Demski1dAll of that made perfect sense once I thought through it, and I tend to agree with most it. I think my biggest disagreement with you is that (in your talk) you said you don't expect formal learning theory work to be relevant. I agree with your points about classical learning theory, but the alignment community has been developing basically-classical-learning-theory tools which go beyond those limitations. I'm optimistic that stuff like Vanessa's InfraBayes could help here. Granted, there's a big question of whether that kind of thing can be competitive. (Although there could potentially be a hybrid approach.)
Teaching ML to answer questions honestly instead of predicting human answers

I realize that unit-type-checking ML is pretty uncommon and might just be insane

Nah, it's a great trick.

The two parameter distances seem like they're in whatever distance metric you're using for parameter space, which seems to be very different from the logprobs.

The trick here is that L2 regularization / weight decay is equivalent to having a Gaussian prior on the parameters, so you can think of that term as  (minus an irrelevant additive constant), where  is set to imply whatever hyperparameter you used for your weight decay... (read more)

[AN #157]: Measuring misalignment in the technology underlying Copilot

Sure, but don't you agree that it's a very confusing use of the term?

Maybe? Idk, according to me the goal of alignment is "create a model that is motivated to help us", and so misalignment = not-alignment = "the mode is not motivated to help us". Feels pretty clear to me but illusion of transparency is a thing.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivat... (read more)

3Adam Shimi11hSorry for the delay in answering, I was a bit busy. That makes some sense, but I do find the "motivationless" state interesting from an alignment point of view. Because if it has no motivation, it also doesn't have a motivation to do all the things we don't want. We thus get some corrigibility by default, because we can change its motivation just by changing the prompt. Agreed that there's not much difference when predicting GPT-3. But it's because we're at the place in the scaling where Gwern (AFAIK) describe the LM as an agent very good at predicting-agent. By definition it will not do anything different from a simulator, since its "goal" literally encode all of its behavior. Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems. Fair enough. Yeah, you're probably right.
Decoupling deliberation from competition

Planned summary for the Alignment Newsletter:

Under a [longtermist](https://forum.effectivealtruism.org/tag/longtermism) lens, one problem to worry about is that even after building AI systems, humans will spend more time competing with each other rather than figuring out what they want, which may then lead to their values changing in an undesirable way. For example, we may have powerful persuasion technology that everyone uses to persuade people to their line of thinking; it seems bad if humanity’s values are determined by a mix of effective persuasion too

... (read more)
[AN #157]: Measuring misalignment in the technology underlying Copilot

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. 

Where do you see any assumption of agency/goals?

(I find this some combination of sad and amusing as a commentary on the difficulty of communication, in that I feel like I tend to be the person pushing against ascribing goals to GPT.)

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is moti... (read more)

1Adam Shimi5dSorry for ascribing you beliefs you don't have. I guess I'm just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me. Sure, but don't you agree that it's a very confusing use of the term? Like, if I say GPT-3 isn't trying to kill me, I'm not saying it is trying to kill anyone, but I'm sort of implying that it's the right framing to talk about it. In this case, the "motivated" part did triggered me, because it implied that the right framing is to think about what Codex wants, which I don't think is right (and apparently you agree). (Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment [https://www.lesswrong.com/posts/cyTP4ZMnN6RFu9L62/an-157-measuring-misalignment-in-the-technology-underlying?commentId=pbPgSHzc2aav6XwiF] is another evidence that you're implying agency for different people) Agreed with you there. True, but I don't feel like there is a significant difference between Codex and GPT-3 in terms of size or training to warrant different conclusions with regard to ascribing goals/agency. First, I think I interpreted "misalignment" here to mean "inner misalignment", hence my answer. I also agree that all examples in Victoria's doc [https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml] are showing misalignment. That being said, I still think there is a difference with the specification gaming stuff. Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious? Almost all specification gaming examples are subtle, or tricky, or exploiting bugs. They're things that I would expect a human to fail to find, even given the precise loss and training environment. Whereas I expect any human to complete buggy code with buggy code once you explain to them that C
Re-Define Intent Alignment?

Are you the historical origin of the robustness-centric approach?

Idk, probably? It's always hard for me to tell; so much of what I do is just read what other people say and make the ideas sound sane to me. But stuff I've done that's relevant:

  • Talk at CHAI saying something like "daemons are just distributional shift" in August 2018, I think. (I remember Scott attending it.)
  • Talk at FHI in February 2020 that emphasized a risk model where objectives generalize but capabilities don't.
  • Talk at SERI conference a few months ago that explicitly argued for a focus on
... (read more)
5Abram Demski2dI've watched your talk at SERI now. One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. In your talk, you mention looking at why the model does what it does, in addition to just looking at what it does. This makes sense to me (I talk about similar things), but, it seems just about as fraught as the notion of mesa-objective: 1. It requires approximately the same "magic transparency tech" as we need to extract mesa-objectives. 2. Even with magical transparency tech, it requires additional insight as to which reasoning is acceptable vs unacceptable. If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason? More generally, what do you think "acceptability" might look like? (By no means do I mean to say your view is crazy; I am just looking for your explanation.)
Progress on Causal Influence Diagrams

Planned summary for the Alignment Newsletter:

Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the _wrong incentives_. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are <@graphical models@>(@Understanding Agent Incentives with Causal Influence Diagrams@) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various inc

... (read more)
Re-Define Intent Alignment?

(Meta: was this meant to be a question?)

In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.

I don't think this is actually a con of the generalization-focused approach. From the post you link, one of the two questions in that approach (the one focused on robustness) is:

How do we ensure the model generalizes acceptably out of distribution?

Part of the problem is to come up with a good definition of "acceptable", such that this is actually possible to achieve. (See e.g. the "Definin... (read more)

4Abram Demski6dI originally conceived of it as such, but in hindsight, it doesn't seem right. By no means did I intend it to be a con. I'll try to edit to clarify. I think it is a real pro of the generalization-focused approach that it does not rely on models having mesa-objectives (putting it in Evan's terms, there is a real possibility of addressing objective robustness without directly addressing inner alignment). So, focusing on objective robustness seems like a potential advantage -- it opens up more avenues of attack. Plus, the generalization-focused approach requires a much weaker notion of "outer alignment", which may be easier to achieve as well. But, of course, it may also turn out that the only way to achieve objective robustness is to directly tackle inner alignment. And it may turn out that the weaker notion of outer alignment is insufficient in reality. Are you the historical origin of the robustness-centric approach? I noticed that Evan's post has the modified robustness-centric diagram in it, but I don't know if it was edited to include that. The "Objective Robustness and Inner Alignment Terminology" post attributes it to you (at least, attributes a version of it to you). (I didn't look at the references there yet.)
[AN #156]: The scaling hypothesis: a plan for building AGI

what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

You're going to need the ease of specification condition, or something similar; else you'll probably run into no-free-lunch considerations (at which point I think you've stopped talking about anything useful).

[AN #156]: The scaling hypothesis: a plan for building AGI

The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn.

?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?

EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just "fills in words" ... (read more)

4Daniel Kokotajlo6dOK cool, sorry for the confusion. Yeah I think ESRogs interpretation of you was making a bit stronger claim than you actually were.
[AN #156]: The scaling hypothesis: a plan for building AGI

Yeah, I agree with all this. I still think the pretraining objective basically doesn't matter for alignment (beyond being "reasonable") but I don't think the argument I've given establishes that.

I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention (and thus Claim 4 as well).

2Alex Turner7dSure. Additional note for posterity: when I talked about "some objectives [may] make alignment far more likely", I was considering something like "given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.
Fractional progress estimates for AI timelines and implied resource requirements

Planned summary:

One [methodology](https://www.overcomingbias.com/2012/08/ai-progress-estimate.html) for forecasting AI timelines is to ask experts how much progress they have made to human-level AI within their subfield over the last T years. You can then extrapolate linearly to see when 100% of the problem will be solved. The post linked above collects such estimates, with a typical estimate being 5% of a problem being solved in the twenty year period between 1992 and 2012. Overall these estimates imply a timeline of [372 years](https://aiimpacts.org/surv

... (read more)
2Daniel Kokotajlo8dHuh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was: Seems to me the conclusion of this argument is that "In general it's not true that the AI is trying to achieve its training objective." The natural corollary is: We have no idea what the AI is trying to achieve, if it is trying to achieve anything at all. So instead of concluding "It'll probably just keep filling in missing words as in training" we should conclude "we have no idea what it'll do; treacherous turn is a real possibility because that's what'll happen for most goals it could have, and it may have a goal for all we know."
[AN #156]: The scaling hypothesis: a plan for building AGI

Wrote a separate comment here (in particular I think claims 1 and 4 are directly relevant to safety)

[AN #156]: The scaling hypothesis: a plan for building AGI

This comment is inspired by a conversation with Ajeya Cotra.

As a simple example of how the scaling hypothesis affects AI safety research, it suggests that the training objective (“predict the next word”) is relatively unimportant in determining properties of the trained agent; in contrast, the dataset is much more important. This suggests that analyses based on the “reward function used to train the agent” are probably not going to be very predictive of the systems we actually build.

To elaborate on this more:

Claim 1: Scaling hypothesis + abundance of data ... (read more)

4Alex Turner8dAnalogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn't matter. And this is true up to a point: up to constant factors, it doesn't matter [https://en.wikipedia.org/wiki/Kolmogorov_complexity#Invariance_theorem]. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does U2. And so "there exists a program in U2-encoding which implements P in U1-encoding" doesn't get everything I want: I want to reason about the distribution of programs, about how hard it tends to be to get programs with desirable properties. Stepping out of the analogy, even though I agree that "reasonable" pretraining objectives are all compatible with aligned / unaligned /arbitrarily behaved models, this argument seems to leave room that some objectives make alignment far more likely, a priori. And you may be noting as much:
6ESRogs9dI got a bit confused by this section, I think because the word "model" is being used in two different ways, neither of which is in the sense of "machine learning model". Paraphrasing what I think is being said: * An observer (us) has a model_1 of what GPT-N is doing. * According to their model_1, GPT-N is building its own world model_2, that it uses to plan its actions. * The observer's model_1 makes good predictions about GPT-N's behavior when GPT-N (the machine learning model_3) is tested on data that comes from the training distribution, but bad predictions about what GPT-N will do when tested (or used) on data that does not come from the training distribution. * The way that the observer's model_1 will be wrong is not that it will be fooled by GPT-N taking a treacherous turn, but rather the opposite -- the observer's model_1 will predict a treacherous turn, but instead GPT-N will go on filling in missing words, as in training (or something else?). Is that right?
rohinmshah's Shortform

But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting?

I'd be pretty surprised if that happened. GPT-3 already knows way more facts than I do, and can mimic far more writing styles than I can. It seems like by the time it can write any good posts (without cherrypicking), it should quickly be able to write good posts on a variety of topics in a variety of different styles, which should let it ... (read more)

rohinmshah's Shortform

20,000 LW karma: Holy shit that's a lot of karma for one year.

I was thinking 365 posts * ~50 karma per post gets you most of the way there (18,250 karma), and you pick up some additional karma from comments along the way.  50 karma posts are good but don't have to be hugely insightful; you can also get a lot of juice by playing to the topics that tend to get lots of upvotes. Unlike humans the bot wouldn't be limited by writing speed (hence my restriction of one post per day). AI systems should be really, really good at writing, given how easy it is to... (read more)

5Daniel Kokotajlo13dOK, fair enough. But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting? I think this would happen to many humans if they could work at super-speed. That said, I don't think this is that likely I guess... probably AI will be unable to do even three such posts, or it'll be able to generate arbitrary numbers of them. The human range is small. Maybe. Idk.
Model-based RL, Desires, Brains, Wireheading

I note that even experts sometimes sloppily talk as if RL agents make plans towards the goal of maximizing future reward—see for example Pitfalls of Learning a Reward Function Online.

Fwiw, I think most analysis of this form starts from the assumption "the agent is maximizing future reward" and then reasoning out from there. I agree with you that such analysis probably doesn't apply to RL agents directly (since RL agents do not necessarily make plans towards the goal of maximizing future reward), but it can apply to e.g. planning agents that are specificall... (read more)

rohinmshah's Shortform

Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post.

This seems more feasible, because you can cherrypick a single good example. I wouldn't be shocked if someone on LW spent a lot of time reading AI-written blog posts on rationality and posted the best one, and I liked that more than a typical >30 karma post. My default guess is that no one tries to do this, so I'd still give it < 50% (maybe 30%?), but conditional on someone trying I think probably 80% seems right.

Na

... (read more)
5Daniel Kokotajlo14dAh right, good point, I forgot about cherry-picking. I guess we could make it be something like "And the blog post wasn't cherry-picked; the same system could be asked to make 2 additional posts on rationality and you'd like both of them also." I'm not sure what credence I'd give to this but it would probably be a lot higher than 10%. Website prediction: Nice, I think that's like 50% likely by 2030. Major research area: What counts as a major research area? Suppose I go calculate that Alpha Fold 2 has already sped up the field of protein structure prediction by 100x (don't need to do actual experiments anymore!), would that count? If you hadn't heard of AlphaFold yet, would you say it counted? Perhaps you could give examples of the smallest and easiest-to-automate research areas that you think have only a 10% chance of being automated by 2030. 20,000 LW karma: Holy shit that's a lot of karma for one year. I feel like it's possible that would happen before it's too late (narrow AI good at writing but not good at talking to people and/or not agenty) but unlikely. Insofar as I think it'll happen before 2030 it doesn't serve as a good forecast because it'll be too late by that point IMO. Productivity tool UI's obsolete thanks to assistants: This is a good one too. I think that's 50% likely by 2030. I'm not super certain about any of these things of course, these are just my wild guesses for now.
rohinmshah's Shortform

What won't we be able to do by (say) the end of 2025? (See also this recent post.) Well, one easy way to generate such answers would be to consider tasks that require embodiment in the real world, or tasks that humans would find challenging to do. (For example, “solve the halting problem”, “produce a new policy proposal that has at least a 95% chance of being enacted into law”, “build a household robot that can replace any human household staff”.) This is cheating, though; the real challenge is in naming something where there’s an adjacent thing that _does... (read more)

4Daniel Kokotajlo14dNice! I really appreciate that you are thinking about this and making predictions. I want to do the same myself. I think I'd put something more like 50% on "Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post." That's just a wild guess, very unstable. Another potential prediction generation methodology: Name something that you think won't happen, but you think I think will.
BASALT: A Benchmark for Learning from Human Feedback

That makes sense, though I'd also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general.

Oh yeah, it totally is, and I'd be excited for that to happen. But I think that will be a single project, whereas the benchmark reporting process is meant to apply for things where there will be lots of projects that you want to compare in a reasonably apples-to-apples way, so when designing the reporting process I'm focused more on the small-scale projects that aren't GPT-N-like.

It's also possible this has already be

... (read more)
Problems facing a correspondence theory of knowledge

Planned summary for the Alignment Newsletter:

Probability theory can tell us about how we ought to build agents that have knowledge (start with a prior, and perform Bayesian updates as evidence comes in). However, this is not the only way to create knowledge: for example, humans are not ideal Bayesian reasoners. As part of our quest to <@_describe_ existing agents@>(@Theory of Ideal Agents, or of Existing Agents?@), could we have a theory of knowledge that specifies when a particular physical region within a closed system is “creating knowledge”? We w

... (read more)
3Ben Pace1dThis was a great summary, thx.
3Alex Flint15dYour summaries are excellent Rohin. This looks good to me.
BASALT: A Benchmark for Learning from Human Feedback

Would you anticipate the benchmark version of this would ask participants to disclose metrics such as "amount of task-specific feedback or data used in training"?

Probably not, just because it's pretty niche -- I expect the vast majority of papers (at least in the near future) will have only task-specific feedback, so the extra data isn't worth the additional hassle. (The prompting approach seems like it would require a lot of compute.)

Tbc, "amount of task-specific feedback" should still be inferable from research papers, where you are meant to provide enou... (read more)

3Edouard Harris16dThat makes sense, though I'd also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general. (e.g. a BASALT analogue might do a better job of capturing the flexibility of GPT-N or DALL-E type models than current benchmarks do, though you'd probably need to define a few hundred tasks for that to be useful. It's also possible this has already been done and I'm unaware of it.)
BASALT: A Benchmark for Learning from Human Feedback

It might be that for such tasks solutions using handcrafted rewards/heuristics would be viable, but wouldn't scale to more complex tasks. 

I agree that's possible. Tbc, we did spend some time thinking about how we might use handcrafted rewards / heuristics to solve the tasks, and eliminated a couple based on this, so I think it probably won't be true here.

Suppose the designer trains a neural network to recognize trees in minecraft by getting the minecraft engine to generate lots of images of trees. The resulting network is then used as a hardcoded part

... (read more)
BASALT: A Benchmark for Learning from Human Feedback

Okay, regardless of what the AI safety community claims, I want to make that claim.

(I think a substantial chunk of the AI safety community also makes that claim but I'm not interested in defending that here.)

It being hard to specify reward functions for a specific task and it being hard to specify reward functions for a more general AGI seem to me like two very different problems. 

As an aside, if I thought we could build task-specific AI systems for arbitrary tasks, and only super general AI systems were dangerous, I'd be advocating really hard for st... (read more)

2Vanessa Kosoy19dThe problem with this is that you need an AI whose task is "protect humanity from unaligned AIs", which is already very "general" in a way (i.e. requires operating on large scales of space, time and strategy). Unless you can effectively reduce this to many "narrow" tasks which is probably not impossible but also not easy.
BASALT: A Benchmark for Learning from Human Feedback

they allow handcrafted reward functions and heuristics

We allow it, but we don't think it will lead to good performance (unless you throw a very large amount of time at it).

The AI safety community claims it is hard to specify reward functions. If we actually believe this claim, we should be able to create tasks where even if we allow people to specify reward functions, they won't be able to do so. That's what we've tried to do here.

Note we do ban extraction of information from the Minecraft simulator -- you have to work with pixels, so if you want to make h... (read more)

3Vanessa Kosoy19dRight, but you're also going for tasks that are relatively simple and easy. In the sense that, "MakeWaterfall" is something that I can, based on my own experience, imagine solving without any ML at all (but ofc going to that extreme would require massive work). It might be that for such tasks solutions using handcrafted rewards/heuristics would be viable, but wouldn't scale to more complex tasks. If your task was e.g. "follow arbitrary natural language instructions" then I wouldn't care about the "lax" rules. This is certainly good, but I wonder what are the exact rules here. Suppose the designer trains a neural network to recognize trees in minecraft by getting the minecraft engine to generate lots of images of trees. The resulting network is then used as a hardcoded part of the agent architecture. Is that allowed? If not, how well can you enforce it (I imagine something of the sort can be done in subtler ways)? Not saying that what you're doing is not useful, just pointing out a certain way in which the benchmark might diverge from its stated aim.
2Christian Kleineidam19dIt being hard to specify reward functions for a specific task and it being hard to specify reward functions for a more general AGI seem to me like two very different problems. Additionally, developing a safe system and developing a nonsafe system are very different. Even if your reward function works 99,9% of the time it can be exploited in those cases where it fails.
BASALT: A Benchmark for Learning from Human Feedback

Or would you rather not have this sort of marketing?

I would be excited to see this competition promoted widely!

(Obviously I wouldn't want to do anything that reflected really poorly on the marketers, like blackmail, but this seems to clearly not be in that category.)

A world in which the alignment problem seems lower-stakes

I'm not sure if you're arguing that this is a good world in which to think about alignment. If you are arguing this, then I disagree.

It seems like in this formalization the human has to write down the code that contains human values, ship it off to the right side of the universe via a one-time-only Godly Transfer Of Information, and then that code needs to do things perfectly. (You can't have any transfer of information back, since otherwise the AI can affect humans.) But it seems like this rules out a huge number of promising-seeming approaches to alignme... (read more)

4Alex Turner20dI am not arguing this. Quoting my reply to ofer: (Edited post to clarify)
Problems facing a correspondence theory of knowledge

Is this sequence complete? I was expecting a final literature review post before summarizing for the newsletter, but it's been a while since the last update and you've posted something new, so maybe you decided to skip it?

4Alex Flint18dThe sequence is now complete.
3Alex Flint21dIt's actually written, just need to edit and post. Should be very soon. Thanks for checking on it.
Parameter counts in Machine Learning

I also looked into number of training points very briefly, Googling suggests AlexNet used 90 epochs on ImageNet's 1.3 million train images, while AlphaZero played 44 million games for chess (I didn't quickly find a number for Go), suggesting that the number of images was roughly similar to the number of games.

So I think probably the remaining orders of magnitude are coming from the tree search part of MCTS (which causes there to be > 200 forward passes per game).

Parameter counts in Machine Learning

Planned summary for the Alignment Newsletter:

This post presents a dataset of the parameter counts of 139 ML models from 1952 to 2021. The resulting graph is fairly noisy and hard to interpret, but suggests that:

1. There was no discontinuity in model size in 2012 (the year that AlexNet was published, generally acknowledged as the start of the deep learning revolution).
2. There was a discontinuity in model size for language in particular some time between 2016-18.

Planned opinion:

You can see my thoughts on the trends in model size in [this comment](https://ww

... (read more)
Parameter counts in Machine Learning

I think the story for the discontinuity is basically "around 2018 industry labs realized that language models would be the next big thing" (based on Attention is all you need, GPT-2, and/or BERT), and then they switched their largest experiments to be on language (as opposed to the previous contender, games).

Similarly for games, if you take DQN to be the event causing people to realize "large games models will be the next big thing", it does kinda look like there's a discontinuity there (though there are way fewer points so it's harder to tell, also I'm in... (read more)

3Alex Ray1moOne reason it might not be fitting as well for vision, is that vision has much more weight-tying / weight-reuse in convolutional filters. If the underlying variable that mattered was compute, then image processing neural networks would show up more prominently in compute (rather than parameters).
1gwern1moCould it be inefficient scaling? Most work not explicitly using scaling laws to plan it seems to generally overestimate in compute per parameter, using too-small models. Anyone want to try to apply Jones 2021 [https://arxiv.org/abs/2104.03113] to see if AlphaZero was scaled wrong?
5Paul Christiano1moI think that in a forward pass, AlexNet uses about 10-15 flops per parameter (assuming 4 bytes per parameter and using this table [https://github.com/albanie/convnet-burden]), because it puts most of its parameters in the small convolutions and FC layers. But I think AlphaZero has most of its parameters in 19x19 convolutions, which involve 722 flops per parameter (19 x 19 x 2). If that's right, it accounts for a factor of 50; combined with game length that's 4 orders of magnitude explained. I'm not sure what's up with the last order of magnitude. I think that's a normal amount of noise / variation across different tasks, though I would have expected AlexNet to be somewhat overtrained given the context. I also think the comparison is kind of complicated because of MCTS and distillation (e.g. AlphaZero uses much more than 1 forward pass per turn, and you can potentially learn from much shorter effective horizons when imitating the distilled targets).
AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell

Planned summary for the Alignment Newsletter:

As with most other podcasts, I will primarily link you to my past summaries of the papers discussed in the episode. In this case they were all discussed in the special issue [AN #69](https://mailchi.mp/59ddebcb3b9a/an-69-stuart-russells-new-book-on-why-we-need-to-replace-the-standard-model-of-ai) on Human Compatible and the various papers relevant to it. Some points that I haven’t previously summarized:

1. The interviewee thinks of assistance games as an _analytical tool_ that allows us to study the process by wh

... (read more)
Environmental Structure Can Cause Instrumental Convergence

This particular argument is not talking about farsightedness -- when we talk about having more options, each option is talking about the entire journey and exact timesteps, rather than just the destination. Since all the "journeys" starting with the S --> Z action go to Z first, and all the "journeys" starting with the S --> A action go to A first, the isomorphism has to map A to Z and vice versa, so that .

(What assumption does this correspond to in the theorem? In the theorem, the involution has to map  to a subset... (read more)

Environmental Structure Can Cause Instrumental Convergence

Thanks for the correction.

That one in particular isn't a counterexample as stated, because you can't construct a subgraph isomorphism for it. When writing this I thought that actually meant I didn't need more of a caveat (contrary to what I said to you earlier), but now thinking about it a third time I really do need the "no cycles" caveat. The counterexample is:

Z <--> S --> A --> B

With every state also having a self-loop.

In this case, the involution {S:B, Z:A}, would suggest that S --> Z would have more options than S --> A, but optimal ... (read more)

1Ofer Givoli1moProbably not an important point, but I don't see why we can't use the identity isomorphism (over the part of the state space that a1 leads to).
4Alex Turner1moMy take on it has been, the theorem's bottleneck assumption implies that you can't reach S again after taking action a1 or a2, which rules out cycles.
Load More