All of Steven Byrnes's Comments + Replies

[Intro to brain-like-AGI safety] 14. Controlled AGI

consider the fusion power generator scenario

It's possible that I misunderstood what you were getting at in that post. I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story? You could have equally well written the post as “Suppose, a few years from now, I set about trying to design a cheap, simple fusion power generator - something I could... (read more)

2johnswentworth5d
Basically, yeah. The important point (for current purposes) is that, as the things-the-system-is-capable-of-doing-or-building scale up, we want the system's ability to notice subtle problems to scale up with it. If the system is capable of designing complex machines way outside what humans know how to reason about, then we need similarly-superhuman reasoning about whether those machines will actually do what a human intends. "With great power comes great responsibility" - cheesy, but it fits.
[Intro to brain-like-AGI safety] 14. Controlled AGI

I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.

So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.” From that perspective, I would be concerned that if the (so-called) subroutine never wanted to do anything bad or stupid, the... (read more)

4johnswentworth6d
An example might be helpful here: consider the fusion power generator scenario [https://www.lesswrong.com/posts/2NaAhMPGub8F2Pbr7/the-fusion-power-generator-scenario] . In that scenario, a human thinking about what they want arrives at the wrong answer, not because of uncertainty about their own values, but because they don't think to ask the right questions about how the world works. That's the sort of thing I have in mind. In order to handle that sort of problem, an AI has to be able to use human values somehow without carrying over other specifics of how a human would reason about the situation. I think I disagree with this claim. Maybe not exactly as worded - like, sure, maybe the "set of mental activities" involved in the reasoning overlap heavily. But I do expect (weakly, not confidently) that there's a natural notion of human-value-generator which factors from the rest of human reasoning, and has a non-human-specific API (e.g. it interfaces with natural abstractions). It sounds to me like you're imagining something which emulates human reasoning to a much greater extent than I'm imagining.
[Intro to brain-like-AGI safety] 14. Controlled AGI

should be conceptually straightforward to model how humans would reason about those concepts or value them

Let’s say that the concept of an Em had never occurred to me before, and now you knock on my door and tell me that there’s a thing called Ems, and you know how to make them but you need my permission, and now I have to decide whether or not I care about the well-being of Ems. What do I do? I dunno, I would think about the question in different ways, I would try to draw analogies to things I already knew about, maybe I would read some philosophy papers,... (read more)

4johnswentworth6d
We don't necessarily need the AGI itself to have human-like drives, intuitions, etc. It just needs to be able to model the human reasoning algorithm well enough to figure out what values humans assign to e.g. an em. (I expect an AI which relied heavily on human-like reasoning for things other than values would end up doing something catastrophically stupid, much as humans are prone to do.)
[Intro to brain-like-AGI safety] 14. Controlled AGI

This approach is probably particularly characteristic of my approach.

Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).

my approach … ontological lock …

I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.

[Intro to brain-like-AGI safety] 14. Controlled AGI

Thanks! One of my current sources of mild skepticism right now (which again you might talk me out of) is:

• For capabilities reasons, the AGI will probably need to be able to add things to its world-model / ontology, including human-illegible things, and including things that don't exist in the world but which the AGI imagines (and could potentially create).
• If the AGI is entertaining a plan of changing the world in important ways (e.g. inventing and deploying mind-upload technology, editing its own code, etc.), it seems likely that the only good way of evalua
5johnswentworth6d
I expect that there will be concepts the AI finds useful which humans don't already understand. But these concepts should still be of the same type as human concepts - they're still the same kind of natural abstraction. Analogy: a human who grew up in a desert tribe with little contact with the rest of the world may not have any concept of "snow", but snow is still the kind-of-thing they're capable of understanding if they're ever exposed to it. When the AI uses concepts humans don't already have, I expect them to be like that. As long as the concepts are the type of thing humans can recognize/understand, then it should be conceptually straightforward to model how humans would reason about those concepts or value them.
[Intro to brain-like-AGI safety] 14. Controlled AGI

Thanks! Follow-up question: Do you see yourself as working towards “Proof Strategy 2”? Or “none of the above”?

4johnswentworth7d
This part of Proof Strategy 1 is a basically-accurate description of what I'm working towards: ... it's just not necessarily about objects localized in 3D space. Also, there's several possible paths, and they don't all require unambiguous definitions of all the "things" in a human's ontology. For instance, if corrigibility turns out to be a natural "thing", that could short-circuit the need for a bunch of other rigorous concepts.
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

I would say “Humanity's current state is so spectacularly incompetent that even the obvious problems with obvious solutions might not be solved”.

If humanity were not spectacularly incompetent, then maybe we wouldn't have to worry about the obvious problems with obvious solutions. But we would still need to worry about the obvious problems with extremely difficult and non-obvious solutions.

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

The nonobvious problems are the whole reason why AI alignment is hard in the first place.

I disagree with the implication that there’s nothing to worry about on the “obvious problems” side.

An out-of-control AGI self-reproducing around the internet, causing chaos and blackouts etc., is an “obvious problem”. I still worry about it.

After all, consider this: an out-of-control virus self-reproducing around the human population, causing death and disability etc., is also an “obvious problem”.  We already have this problem; we’ve had this problem for millenni... (read more)

4johnswentworth1mo
There is an important difference here between "obvious in advance" and "obvious in hindsight", but your basic point is fair, and the virus example is a good one. Humanity's current state is indeed so spectacularly incompetent that even the obvious problems might not be solved, depending on how things go.
Another list of theories of impact for interpretability

Nice post!

I really don't know much about frontal lobotomy patients. I’ll irresponsibly speculate anyway.

I think “figuring out the solution to tricky questions” has a lot in common with “getting something tricky done in the real world”, despite the fact that one involves “internal” actions (i.e., thinking the appropriate thoughts) and the other is “external” actions (i.e., moving the appropriate muscles). I think they both require the same package of goal-oriented planning, trial-and-error exploration via RL, and so on. (See discussion of “RL-on-thoughts” h... (read more)

Takeoff speeds have a huge effect on what it means to work on AI x-risk

I guess it depends on “how fast is fast and how slow is slow”, and what you say is true on the margin, but here's my plea that the type of thinking that says “we want some technical problem to eventually get solved, so we try to solve it” is a super-valuable type of thinking right now even if we were somehow 100% confident in slow takeoff. (This is mostly an abbreviated version of this section.)

1. Differential Technological Development (DTD) seems potentially valuable, but is only viable if we know what paths-to-AGI will be safe & beneficial really far in
[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Huh. I would have invoked a different disorder.

I think that if we replace the Thought Assessor & Steering Subsystem with the function “RPE = +∞ (regardless of what's going on)”, the result is a manic episode, and if we replace it with the function “RPE = -∞ (regardless of what's going on)”, the result is a depressive episode.

In other words, the manic episode would be kinda like the brainstem saying “Whatever thought you're thinking right now is a great thought! Whatever you're planning is an awesome plan! Go forth and carry that plan out with gusto!!!!... (read more)

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

If it makes sense to differentiate the "Thought Generator" and "Thought Assessor" as two separate modules, is it possible to draw a parallel to language models, which seem to have strong ability to generate sentences, but lack the ability to assess if they are good?

Hmm. An algorithm trained to reproduce human output is presumably being trained to imitate the input-output behavior of the whole system including Thought Generator and Thought Assessor and Steering Subsystem.

I’m trying to imagine deleting the Thought Assessor & Steering Subsystem, and repla... (read more)

1Rafael Harth1mo
What do you picture a language model with no inhibitions to look like? Because if I try to imagine it, then "something that outputs reasonable sounding text until sooner or later it fails hard" seems to be a decent fit. Of course haven't thought much about the generator/assessor distinction. I mean, surely "inhibitions" of the language model don't map onto human inhibitions, right? Like, a language model without the assessor module (or a much worse assessor module) is just as likely to be imitate someone who sounds unrealistically careful as someone who has no restraints. I find your last paragraph convincing, but that of course makes me put more credence into the theory rather than less.
Different perspectives on concept extrapolation

I really like this. Ever since I read your first model splintering post, it's been a central part of my thinking too.

I feel cautiously optimistic about the prospects for generating multiple hypotheses and detecting when they come into conflict out-of-distribution (although the details are kinda different for the Bayes-net-ish models that I tend to think about then the deep neural net models that I understand y'all are thinking about).

I remain much more confused about what to do when that detector goes off, in a future AGI.

I imagine a situation where some i... (read more)

Call For Distillers

I think I weakly disagree with the implication that “distillation” should be thought of as a different category of activity from “original research”. It is in a superficial sense, but a lot of the underlying activities and skills and motivations overlap. For example, original researchers also have the experience of reading something, feeling confused about it, and then eventually feeling less confused about it. They just might not choose to spend the time writing up how they came to be less confused. Conversely, someone trying to understand something for t... (read more)

4Spencer Becker-Kahn1mo
I agree i.e. I also (fairly weakly) disagree with the value of thinking of 'distilling' as a separate thing. Part of me wants to conjecture that it's comes from thinking of alignment work predominantly as mathematics or a hard science in which the standard 'unit' is a an original theorem or original result which might be poorly written up but can't really be argued against much. But if we think of the area (I'm thinking predominantly about more conceptual/theoretical alignment) as a 'softer', messier, ongoing discourse full of different arguments from different viewpoints and under different assumptions, with counter-arguments, rejoinders, clarifications, retractions etc. that takes place across blogs, papers, talks, theorems, experiments etc that all somehow slowly works to produce progress, then it starts to be less clear what this special activity called 'distilling' really is. Another relevant point, but one which I won't bother trying to expand on much here, is that a research community assimilating - and then eventually building on - complex ideas can take a really long time. [At risk of extending into a rant, I also just think the term is a bit off-putting. Sure, I can get the sense of what it means from the word and the way it is used - it's not completely opaque or anything - but I'd not heard it used regularly in this way until I started looking at the alignment forum. What's really so special about alignment that we need to use this word? Do we think we have figured out some new secret activity that is useful for intellectual progress that other fields haven't figured out? Can we not get by using words like "writing" and "teaching" and "explaining"?]
Why Agent Foundations? An Overly Abstract Explanation

I think you're saying: if a thing is messy, at least there can be a non-messy procedure / algorithm that converges to (a.k.a. points to) the thing. I think I'm with Charlie in feeling skeptical about this in regards to value learning, because I think value learning is significantly a normative question. Let me elaborate:

My genes plus 1.2e9 seconds of experience have built have built a fundamentally messy set of preferences, which are in some cases self-inconsistent, easily-manipulated, invalid-out-of-distribution, etc. It's easy enough to point to the set ... (read more)

Job Offering: Help Communicate Infrabayesianism

There was one paragraph from the podcast that I found especially enlightening—I excerpted it here (Section 3.2.3).

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

I’m advocating for the first one— is trying to predict the next ground-truth injection. Does something trouble you about that?

3Rafael Harth2mo
No; it was just that something about how the post explained it made me think that it wasn't #1.
Alex Ray's Shortform

I think of myself as a generally intelligent agent with a compact world model

In what sense? Your world-model is built out of ~100 trillion synapses, storing all sorts of illegible information including “the way my friend sounds when he talks with his mouth full” and “how it feels to ride a bicycle whose gears need lubrication”.

(or a compact function which is able to estimate and approximate a world model)

That seems very different though! The GPT-3 source code is rather compact (gradient descent etc.); combine it with data and you get a huge and extraordina... (read more)

Alex Ray's Shortform

RE legibility: In my mind, I don’t normally think there’s a strong connection between agent foundations and legibility.

If the AGI has a common-sense understanding of the world (which presumably it does), then it has a world-model, full of terabytes of information of the sort “tires are usually black” etc. It seems to me that either the world-model will be either built by humans (e.g. Cyc), or (much more likely) learned automatically by an algorithm, and if it’s the latter, it will be unlabeled by default, and it’s on us to label it somehow, and there’s no ... (read more)

3Alex Gray2mo
I think your explanation of legibility here is basically what I have in mind, excepting that if it's human designed it's potentially not all encompassing. (For example, a world model that knows very little, but knows how to search for information in a library) I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system. My take is that it is not "interpretability" to understand a legible system, but maybe I'm using the term differently than others here. This is why I don't think "interpretability" applies to systems that are designed to be always-legible. (In the second graph, "interpretability" is any research that moves us upwards) I agree that the ability to come up with totally alien and untranslateable to humans ideas gives AGI a capabilities boost. I do think that requiring a system to only use legible cognition and reasoning is a big "alignment tax". However I don't think that this tax is equivalent to a strong proof that legible AGI is impossible. I think my central point of disagreement with this comment is that I do think that it's possible to have compact world models (or at least compact enough to matter). I think if there was a strong proof that it was not possible to have a generally intelligent agent with a compact world model (or a compact function which is able to estimate and approximate a world model), that would be an update for me. (For the record, I think of myself as a generally intelligent agent with a compact world model)
[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

Here's a twitter thread wherein Nathaniel Daw gently pushes back on my dopamine neuron discussion of Section 5.5.6.

Late 2021 MIRI Conversations: AMA / Discussion

Huh. I'm under the impression that "offense-defense balance for technology-inventing AGIs" is also a big cruxy difference between you and Eliezer.

Specifically: if almost everyone is making helpful aligned norm-following AGIs, but one secret military lab accidentally makes a misaligned paperclip maximizer, can the latter crush all competition? My impression is that Eliezer thinks yes: there's really no defense against self-replicating nano-machines, so the only paths to victory are absolutely perfect compliance forever (which he sees as implausible, given s... (read more)

5Rohin Shah2mo
I agree that is also moderately cruxy (but less so, at least for me, than "high-capabilities alignment is extremely difficult").
One datapoint I really liked about this: https://arxiv.org/abs/2104.03113 [https://arxiv.org/abs/2104.03113] (Scaling Laws for Board Games). They train AlphaGo agents of different sizes to compete on the game Hex. The approximate takeaway, quoting the author: “if you are in the linearly-increasing regime [where return on compute is nontrivial], then you will need about 2× as much compute as your opponent to beat them 2/3 of the time.” This might suggest that, absent additional asymmetries (like constraints on the aligned AIs massively hampering them), the win ratio may be roughly proportional to the compute ratio. If you assume we can get global data center governance, I’d consider that a sign in favor of the world’s governments. (Whether you think that’s good is a political stance that I believe folks here may disagree on.) Bonus quote: “This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins3. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2/3. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.”
[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Thanks!

Right, I think there's one reward function (well, one reward function that's relevant for this discussion), and that for every thought we think, we're thinking it because it's rewarding to do so—or at least, more rewarding than alternative thoughts. Sometimes a thought is rewarding because it involves feeling good now, sometimes it's rewarding because it involves an expectation of feeling good in the distant future, sometimes it's rewarding because it involves an expectation that it will make your beloved friend feel good, sometimes it's rewarding b... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

I wrote Consequentialism & Corrigibility shortly after and partly in response to the first (Ngo-Yudkowsky) discussion. If anyone has an argument or belief that the general architecture / approach I have in mind (see the “My corrigibility proposal sketch” section) is fundamentally doomed as a path to corrigibility and capability—as opposed to merely “reliant on solving lots of hard-but-not-necessarily-impossible open problems”—I'd be interested to hear it. Thanks in advance. :)

Shah and Yudkowsky on alignment failures

Again I don't have an especially strong opinion about what our prior should be on possible motivation systems for an AGI trained by straightforward debate, and in particular what fraction of those motivation systems are destructive. But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that. I'll assume the AGI uses model-based RL, which would (I claim) put it very roughly in the same category as humans.

Some aspects of motivation have an obvious relationship / correlation to the reward signal. In the human cas... (read more)

5Rohin Shah3mo
Fwiw 50% on doom in the story I told seems plausible to me; maybe I'm at 30% but that's very unstable. I don't think we disagree all that much here. Capability windows are totally part of the objection. If you completely ignore capability windows / compute restrictions then you just run AIXI (or AIXI-tl if you don't want something uncomputable) and die immediately.
Shah and Yudkowsky on alignment failures

The RL-on-thoughts discussion was meant as an argument that a sufficiently capable AI needs to be “trying” to do something. If we agree on that part, then you can still say my previous comment was a false dichotomy, because the AI could be “trying” to (for example) “win the debate while following the spirit of the debate rules”.

And yes I agree, that was bad of me to have listed those two things as if they're the only two options.

I guess what I was thinking was: If we take the most straightforward debate setup, and if it gets an AI that is “trying” to do so... (read more)

4Rohin Shah3mo
I agree that we don't have a plan that we can be justifiably confident in right now. I don't see why the "destructive consequences" version is most likely to arise, especially since it doesn't seem to arise for humans. (In terms of Rob's continuum, humans seem much closer to #2-style trying.)
Shah and Yudkowsky on alignment failures

is there a little goal-oriented mind inside there that solves science problems the same way humans solve them, by engineering mental constructs that serve a goal of prediction, including backchaining for prediction goals and forward chaining from alternative hypotheses / internal tweaked states of the mental construct?

In case it helps anyone to hear different people talking about the same thing, I think Eliezer in this quote is describing a similar thing as my discussion here (search for the phrase “RL-on-thoughts”).

So my objection to debate (which again I... (read more)

4Rohin Shah3mo
In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "trying" is a very strong word that implies taking over the world to win the debate). I didn't really get into this with Eliezer but like Richard [https://www.lesswrong.com/posts/DJnvFsZ2maKxPi7v7/what-s-up-with-confusingly-pervasive-consequentialism?commentId=aAdpwxuH8eDWvewuK] I'm pretty unclear on why "not trying to win the debate" (with the strong sense of trying) implies "insufficiently capable to be pivotal". I don't think humans are "trying" in the strong sense, but we sure seem very capable; it doesn't seem crazy to imagine that this continues. I wasn't really gearing up to argue anything. For most of this conversation I was in the mode of "what is the argument that convinces Eliezer of near-certain doom (rather than just suggesting it is plausible), because I don't see it".
[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

The predictor is a parametrized function output = f(context, parameters) (where "parameters" are also called "weights"). If (by assumption) context is static, then you're running the function on the same inputs over and over, so you have to keep getting the same answer. Unless there's an error changing the parameters / weights. But the learning rate on those parameters can be (and presumably would be) relatively low. For example, the time constant (for the exponential decay of a discrepancy between output and supervisor when in "override mode") could be ma... (read more)

[Intro to brain-like-AGI safety] 4. The “short-term predictor”

I'm not 100% sure and didn't chase down the reference, but in context, I believe the claim “the [infant decorticate rats] appear to suckle normally and develop into healthy adult rats” should be read as “they find their way to their mother's nipple and suckle”, not just “they suckle when their mouth is already in position”.

Pathfinding to a nipple doesn't need to be “pathfinding” per se, it could potentially be as simple as moving up an odor gradient, and randomly reorienting when hitting an obstacle. I dunno, I tried watching a couple videos of neonatal mi... (read more)

3TLW3mo
Alright, I see what you're saying now. Thanks for the conversation!
[Intro to brain-like-AGI safety] 4. The “short-term predictor”

I find myself skeptical of treating e.g. the behavior of Virginia opossum newborns as either solely driven by the hypothalamus and brainstem ("newborn opossums climb up into the female opossum's pouch and latch onto one of her 13 teats.", especially when combined with "the average litter size is 8–9 infants") or learnt from scratch (among other things, gestation lasts 11–13 days).

Hmm. Why don't you think that behavior might be solely driven by the hypothalamus & brainstem?

For what it's worth, decorticate infant rats (rats whose cortex was surgical... (read more)

3TLW3mo
I tend to treat hypothalamus & brainstem reactions as limited to a single rote set of (possibly-repetitive) motions driven by a single clear stimulus. The sort of thing that I could write a bit of Python-esque pseudocode for. Withdrawal reflexes match that. Hormonal systems match that[1] [#fn1398iy46bxwd] . Blink reflex matches that. Suckling matches that. Pathfinding from point A to any of points B-Z in the presence of dynamic obstacles, properly orienting, then suckling? Not so much... (That being said, this is not my area of expertise.) On the one hand, fair. On the other hand, one of the main interesting points about the Bouba/Kiki effect is that it appears to be a human universal[2] [#fnbg1sjopco5t]. I'd consider it unlikely[3] [#fnljmr7z6ymhm]that there's enough shared training data across a bunch of 3-month-olds to bias them towards said effect[4] [#fnonhnr6stj1b][5] [#fnrsiea3dzeg][6] [#fn8mel2oxidzm]. 1. ^ [#fnref1398iy46bxwd]From what I've seen, anyway. I haven't spent too too much time digging into details here. 2. ^ [#fnrefbg1sjopco5t]Or as close as anything psychological ever gets to a human universal, at least. 3. ^ [#fnrefljmr7z6ymhm]Though not impossible. See also the mouth-shape hypothesis for the Bouba/Kiki effect. 4. ^ [#fnrefonhnr6stj1b]Obvious caveat is obvious: said study did not test 3-month-olds across a range of cultures. 5. ^ [#fnrefrsiea3dzeg]There's commonalities in e.g. the laws of Physics, of course. 6. ^ [#fnref8mel2oxidzm]Arguably, this is "additional arguments".
[Intro to brain-like-AGI safety] 4. The “short-term predictor”

Thanks for the great comment!

if the upstream circuit learns entirely from scratch, you can't really have hardwired downstream predictors, for lack of anything stable to hardwire them to. I don't see a clear argument for the premise.

That would be Post #2 :-)

consider the following hilariously oversimplified sketch of how to have hardwired predictors in an otherwise mainly-learning-from-scratch circuit…

I don't have strong a priori opposition to this (if I understand it correctly), although I happen to think that it's not how any part of the brain works.

3TLW3mo
Fair! There are many plausible models that the human brain isn't. I haven't seen much of anything (beyond the obvious) that said sketch explicitly contradicts, I agree. I realize now that I probably should have explained the why (as opposed to the what) of my sketch a little better[1] [#fna73fmktt5kt]. Your model makes a fair bit of intuitive sense to me; your model has an immediately-obvious[2] [#fn6lzd3l9a2t]potential flaw/gap of "if only the hypothalamus and brainstem are hardwired and the rest learns from scratch, how does that explain <insert innately-present behavior here>[3] [#fnrjlagkrfix]", that you do acknowledge but largely gloss over. My sketch (which is really more of a minor augmentation - as you say it coexists with the rest of this fairly well) can explain that sort of thing[4] [#fnioy0k0ibkf], with a bunch of implications that seem intuitively plausible[5] [#fnptglta1ija8]. 1. ^ [#fnrefa73fmktt5kt]Read: at all. 2. ^ [#fnref6lzd3l9a2t]I mean, it was to me? Then again I find myself questioning my calibration of obviousness more and more... 3. ^ [#fnrefrjlagkrfix]I find myself skeptical of treating e.g. the behavior of Virginia opossum newborns[6] [#fnuq27p5wzij]as either solely driven by the hypothalamus and brainstem[7] [#fn0tiakfjos7y]("newborn opossums climb up into the female opossum's pouch and latch onto one of her 13 teats.", especially when combined with "the average litter size is 8–9 infants") or learnt from scratch (among other things, gestation lasts 11–13 days). 4. ^ [#fnrefioy0k0ibkf]And, in general, can explain behaviors 'on the spectrum' from fully-innate to fully-learnt. Which, to be fair, is in some ways a strike against it. 5. ^ [#fnrefptglta1ija8]E.g. if everyone starts with the same basic rudimentary audio and video processing and learns from there I would expect people to have closer audio and video processing 'algorithms' than might be
Why I'm co-founding Aligned AI

Hmm, the only overlap I can see between your recent work and this description (including optimism about very-near-term applications) is the idea of training an ensemble of models on the same data, and then if the models disagree with each other on a new sample, then we're probably out of distribution (kinda like the Yarin Gal dropout ensemble thing and much related work).

And if we discover that we are in fact out of distribution, then … I don't know. Ask a human for help?

If that guess is at all on the right track (very big "if"!), I endorse it as a promisi... (read more)

[Intro to brain-like-AGI safety] 4. The “short-term predictor”

I've been assuming the latter... Seems to me that there's enough latency in the whole system that it can be usefully reduced somewhat without any risk of reducing it below zero and thus causing instability etc.

I can imagine different signals being hardwired to different time-intervals (possibly as a function of age).

I can also imagine the interval starts low, and creeps up, millisecond by millisecond over the weeks, as long as the predictions keep being accurate, and conversely creeps down when the predictions are inaccurate. (I think that would work in pr... (read more)

A broad basin of attraction around human values?

I like this post. I'm not sure how decision-relevant it is for technical research though…

If there isn't a broad basin of attraction around human values, then we really want the AI (or the human-AI combo) to have "values" that, though they need not be exactly the same as the human, are at least within the human distribution. If there is a broad basin of attraction, then we still want the same thing, it's just that we'll ultimately get graded on a more forgiving curve. We're trying to do the same thing either way, right?

Abstractions as Redundant Information

(Warning: I might not be describing this well. And might be stupid.)

I feel like there's an alternate perspective compared to what you're doing, and I'm trying to understand why you don't take that path. Something like: You're taking the perspective that there is one Bayes net that we want to understand. And the alternative perspective is: there is a succession of "scenarios", and each is its own Bayes net, and there are deep regularities connecting them—for example they all obey the laws of physics, there's some relation between the "chairs" in one scenari... (read more)

2johnswentworth3mo
You can think of everything I'm doing as occurring in a "God's eye" model. I expect that an agent embedded in this God's-eye model will only be able to usefully measure natural abstractions within the model. So, shifting to the agent's perspective, we could say "holding these abstractions fixed, what possible models are compatible with them?". And that is indeed a direction I plan to go. But first, I want to get the nicest math I possibly can for computing the abstractions within a model, because the cleaner that is the cleaner I expect that computing possible models from the abstractions will be. ... that was kinda rambly, but I guess the summary is "building good generative problems is the inverse problem for the approach I'm currently focused on, and I expect that the cleaner this problem is solved the easier it will be to handle its inverse problem".
A broad basin of attraction around human values?

the attractor basin is very thin along some dimensions, but very thick along some other dimensions

There was a bunch of discussion along those lines in the comment thread on this post of mine a couple years ago, including a claim that Paul agrees with this particular assertion.

[Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain

Thanks!

I don't think anyone except for Jeff Hawkin believes in literal cortical uniformity.

I mentioned "non-uniform neural architecture and hyperparameters". I'm inclined to put different layer thicknesses (including agranularity) in the category of "non-uniform hyp... (read more)

4Jan Hendrik Kirchner3mo
Hmm, I see. If this is the crux, then I'll put all the remaining nitpicking at the end of my comment and just say: I think I'm on board with your argument. Yes, it seems conceivable to me that a learning-from-scratch program ends up in a (functionally) very similar state to the brain. The trajectory of how the program ends up there over training probably looks different (and might take a bit longer if it doesn't use the shortcuts that the brain got from evolution), but I don't think the stuff that evolution put in the cortex is strictly necessary. A caveat: I'm not sure how much weight the similarity between the program and the brain can support before it breaks down. I'd strongly suspect that certain aspects of the cortex are not logically implied by the statistics of the environment, but rather represent idiosyncratic quirks that were adapted at some point during evolution. Those idiosyncratic quirks won't be in the learning-from-scratch program. But perhaps (probably?) they are also not relevant in the big scheme of things. Fair! Most people in computational neuroscience are also very happy to ignore those differences, and so far nothing terribly bad happened. You point out yourself that some areas (f.e. the motor cortex) are granular, so that argument doesn't work there. But ignoring that, and conceding the cerebellum and the drosophila mushroom body to you (not my area of expertise), I'm pretty doubtful about postulating "locally-random pattern separation" in the cortex. I'm interpreting your thesis to cash out as "Given a handful of granule cells from layer 4, the connectivity with pyramidal neurons in layer 2/3 is (initially) effectively random, and therefore layer 2/3 neurons need to learn (from scratch) how to interpret the signal from layer 4". Is that an okay summary? Because then I think this fails at three points: 1. One characteristic feature of the cortex is the presence of cortical maps [https://www.jsmf.org/meetings/2019/mar/Bednar-Wilson
[Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain

Thanks! I'm not sure we have much disagreement here. Some relevant issues are:

• Memory ≠ Unstructured memory (and likewise, locally-random ≠ globally-random): There's certainly a neural architecture, with certain types of connections between certain macroscopic regions.
• "just" a memory system + learning algorithm—with a dismissive tone of voice on the "just": Maybe you didn't mean it that way, but for the record, I would suggest that to the extent that you feel wowed by something the neocortex does, I claim you should copy that feeling, and feel equally wowed
3Jon Garcia3mo
Agreed. I didn't mean to imply that you thought otherwise. I apologize for how that came across. I had no intention of being dismissive. When I respond to a post or comment, I typically try to frame what I say for a typical reader as much for the original poster. In this case, I had a sense that a typical reader could get the wrong impression about how the neocortex does what it does if the only sorts of memory systems and learning algorithms that came to mind were things like a blank computer drive and stochastic gradient descent on a feed-forward neural network. You are absolutely right that the neocortex is equipped to learn from scratch, starting out generating garbage and gradually learning to make sense of the world/body/other-brain-regions/etc., which can legitimately be described as a memory system + learning algorithm. I just wanted anyone reading to appreciate that, at least in biological brains, there is no clean separation between learning algorithm and memory, but that the neocortex's role as a hierarchical, dynamic, generative simulator is precisely what makes learning from scratch so efficient, since it only has to correlate its intrinsic dynamics with the statistically similar dynamics of learned experience. I'm sure that there are vastly more ways of implementing learning-from-scratch, maybe some much better ways in fact, and I realize that the exact implementation is probably not relevant to the arguments you plan to make in this sequence. I just feel that a basic understanding of what a real learning-from-scratch system looks like could help drive intuitions of what is possible. Indeed, but of course including their own particular structural priors. Well, what is a recurrent neural network after all but an arbitrarily deep feed-forward neural network with shared weights across layers? My comment on cortical waves was just to point out a clever way that the brain learns to organize its cortical maps and primes them to expect causality to opera
[Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now?

Good question! Here are a couple more specific open questions and how I think about them:

1. Is there a way that AGIs can be "verified to be good"?
2. Suppose an AGI has motivations that we don't endorse. Before it "misbehaves" in big catastrophic unrecoverable ways, will it first "misbehave" in small and easily-recoverable ways? And if so, will people respond properly by shutting down the AGI and/or fixing the problem?

For #1, my answer is "Wow, that would be super duper awesome, and it would make me dramatically more optimistic, so I sure hope someone figures out... (read more)

3Max Ra4mo
Thanks for the elaboration, looking forward to the next posts. :)
[Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now?

Thanks!

If I'm not mistaken, the things you brought up are at too low a level to be highly relevant for safety, in my opinion. I guess this series will mostly be at Marr's "computational level" whereas you're asking "algorithm-level" questions, more or less. I'll be talking a lot about things vaguely like "what is the loss function" and much less about things vaguely like "how is the loss function minimized".

For example, I think you can train a neural Turing machine with supervised learning, or you can train a neural Turing machine with RL. That distinction... (read more)

The Solomonoff Prior is Malign

(Warning: thinking out loud.)

Hmm. Good points.

Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence. Even if within that argument, everything points to "I'm in a simulation etc.", there's a big heap of "is metacosmology really what I should be thinking about?"-type uncertainty on top. At least for me.

I think "people who do counterintuitive things" for religious reasons usually have more direct motivations—maybe they have mental health issues ... (read more)

4Vanessa Kosoy4mo
I think it's just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists. I don't know what it means "not to have control over the hypothesis space". The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space. I'm not really thinking "yes"? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement. I can imagine using something like antitraining [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=5YKQhQXGksd2watdf] here, but it's not trivial. First, the problem with acausal attack is that it is point-of-view-dependent. If you're the Holy One, the simulation hypothesis seems convincing, if you're a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn't imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven't proved that). Second... This is something that still hasn't crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed [https://www.lesswrong.com/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign?commentId=cSM8LruuDytHm3kHE] ), since this subagent has to be computationally simpler
The Solomonoff Prior is Malign

Thanks for that! But I share with OP the intuition that these are weird failure modes that come from weird reasoning. More specifically, it's weird from the perspective of human reasoning.

It seems to me that your story is departing from human reasoning when you say "you posses a great desire to help whomever has summoned you into the world". That's one possible motivation, I suppose. But it wouldn't be a typical human motivation.

The human setup is more like: you get a lot of unlabeled observations and assemble them into a predictive world-model, and you al... (read more)

4Vanessa Kosoy4mo
Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don't stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it's still counterintuitive to stab them for their own good, but so is e.g. cutting people up with scalpels [https://en.wikipedia.org/wiki/Surgery] or injecting them substances derived from pathogens [https://en.wikipedia.org/wiki/Vaccine] and we do that to people for their own good. People also do counterintuitive things literally because they believe gods would send them to hell or heaven. This is pretty similar to the idea of confidence thresholds [https://www.lesswrong.com/posts/CnruhwFGQBThvgJiX/formal-solution-to-the-inner-alignment-problem] . The problem is, if every tiny conflict causes the AI to pause then it will always pause. Whereas if you live some margin, the malign hypotheses will win, because, from a cartesian perspective, they are astronomically much more likely (they explain so many bits that the true hypothesis leaves unexplained).
Inner Alignment in Salt-Starved Rats

I still think this post is correct in spirit, and was part of my journey towards good understanding of neuroscience, and promising ideas in AGI alignment / safety.

But there are a bunch of little things that I got wrong or explained poorly. Shall I list them?

First, my "neocortex vs subcortex" division eventually developed into "learning subsystem vs steering subsystem", with the latter being mostly just the hypothalamus and brainstem, and the former being everything else, particularly the whole telencephalon and cerebellum. The main difference is that the "... (read more)

1000 USD prize - Circular Dependency of Counterfactuals By the same token, I think every neurotypical human thinking about Newcomb's problem is using counterfactual reasoning, and I think that there isn't any interesting difference in the general nature of the counterfactual reasoning that they're using. I find this confusing as CDT counterfactuals where you can only project forward seem very different from things like FDT where you can project back in time as well. I think there is "machinery that underlies counterfactual reasoning" (which incidentally happens to be the same as "the machinery that underlies imag... (read more) 2Chris_Leong4mo I agree that counterfactual reasoning is contingent on certain brain structures, but I would say the same about logic as well and it's clear that the logic of a kindergartener is very different from that of a logic professor - although perhaps we're getting into a semantic debate - and what you mean is that the fundamental machinery is more or less the same. Yeah, this seems accurate. I see understanding the machinery as the first step towards the goal of learning to counterfactually reason well. As an analogy, suppose you're trying to learn how to reason well. It might make sense to figure out how humans reason, but if you want to build a better reasoning machine and not just duplicate human performance, you'd want to be able to identify some of these processes as good reasoning and some as biases. I guess I don't see why there would need to be a separation in order for the research direction I've suggested to be insightful. In fact, if there isn't a separation, this direction could even be more fruitful as it could lead to rather general results. I would say (as a slight simplification) that our goal in studying counterfactual reasoning should be to get counterfactuals to a point where we can answer questions about them using our normal reasoning. That post certainly seems to contain an awful lot of philosophy to me. And I guess even though this post and my post On the Nature of Counterfactuals [https://www.lesswrong.com/posts/T4Mef9ZkL4WftQBqw/the-nature-of-counterfactuals] don't make any reference to decision theory, that doesn't mean that it isn't in the background influencing what I write. I've written a lot of posts here, many of which discuss specific decision theory questions. I guess I would still consider Joe Carlsmith's post a high-quality post if it had focused exclusively on the more philosophical aspects. And I guess philosophical arguments are harder to evaluate than mathematical ones and it can be disconcerting for some people, especially thos How an alien theory of mind might be unlearnable I like this post but I'm a bit confused about why it would ever come up in AI alignment. Since you can't get an "ought" from an "is", you need to seed the AI with labeled examples of things being good or bad. There are a lot of ways to do that, some direct and some indirect, but you need to do it somehow. And once you do that, it would presumably disambiguate "trust public-emotional supervisor" from "trust private-calm supervisor". Hmm, maybe the scheme you have in mind is something like IRL? I.e.: (1) AI has a hardcoded template of "Boltzmann rational agen... (read more)1000 USD prize - Circular Dependency of Counterfactuals

I also think that there are lots of specific operations that are all "counterfactual reasoning"

Agreed. This is definitely something that I would like further clarity on

Hmm, my hunch is that you're misunderstanding me here. There are a lot of specific operations that are all "making a fist". I can clench my fingers quickly or slowly, strongly or weakly, left hand or right hand, etc. By the same token, if I say to you "imagine a rainbow-colored tree; are its leaves green?", there are a lot of different specific mental models that you might be invoking. (It c... (read more)

3Chris_Leong4mo
I agree that there isn't a single uniquely correct notion of a counterfactual. I'd say that we want different things from this notion and there are different ways to handle the trade-offs. I find this confusing as CDT counterfactuals where you can only project forward seem very different from things like FDT where you can project back in time as well. Well, we need the information encoded in our DNA rather than than what is actually implemented in humans (clarification: what is implemented in humans is significantly influenced by society) though we aren't at the level where we can access that by analysing the DNA directly or people's brain structure for that matter, so we have to reverse engineer it from behaviour I've very much focused on trying to understand how to solve these problems in theory rather than how can we correct any cognitive flaws in humans or on how to adapt decision theory to be easier or more convenient to use. In so far as I'm interested in how average humans reason counterfactually, it's mostly about trying to understand the various heuristics that are the basis of counterfactuals. I guess I believe that we need counterfactuals to understand and evaluate these heuristics, but I guess I'm hoping that we can construct something reflexively consistent.
$1000 USD prize - Circular Dependency of Counterfactuals I think brains build a generative world-model, and that world-model is a certain kind of data structure, and "counterfactual reasoning" is a class of operations that can be performed on that data structure. (See here.) I think that counterfactual reasoning relates to reality only insofar as the world-model relates to reality. (In map-territory terminology: I think counterfactual reasoning is a set of things that you can do with the map, and those things are related to the territory only insofar as the map is related to the territory.) I also think that ther... (read more) 3Chris_Leong5mo Agreed. This is definitely something that I would like further clarity on I guess the real-world reasons for a mistake are sometimes not very philosophically insightful (ie. Bob was high when reading the post, James comes from a Spanish speaking background and they use their equivalent of a word differently than English-speakers, Sarah has a terrible memory and misremembered it) I'm guessing like your position might be that there are just mistakes and there aren't mistakes that are more philosophically fruitful or less fruitful? There's just mistakes. Is that correct? Or were you just responding to my specific claim that it might be useful to know how the average person responds to problems because we are evolved creatures? If so, then I definitely agree that we'd have to delve into the details and not just remain on the level of averages. Update: Actually, I'll add an analogy that might be helpful. Let's suppose you didn't know what a dog was. Actually, that's kind of the case: once you start diving into any definition you end up running into fuzzy cases, such as does a robotic dog count as a dog? Then if humans had built a bunch of different classifiers and you didn't have access to the humans (say they went extinct) then you might want to analyse the different classifiers to try to figure out how humans defined the term dog, even though much of the behaviour might only tell you how the flaws tend to produce rather than about the human concept Similarly, we don't have exact access to our evolutionary history, but examining human intuitions about counterfactuals might provide insights about which heuristics have worked well, whilst also recognising that it's hard, arguably impossible, to even talk about "working well" without embracing the notion of counterfactuals. And I agree that there are probably different ways we could emphasis various heuristics rather than a unique, principled solution. I'm not claiming the situation is precisely this - in fact I'm not$1000 USD prize - Circular Dependency of Counterfactuals

How much are you interested in a positive vs normative theory of counterfactuals? For example, do you feel like you understand how humans do counterfactual reasoning, and how and why it works for them (insofar as it works for them)? If not, is such an understanding what you're looking for? Or do you think humans are not perfect at counterfactual reasoning (e.g. maybe because people disagree with each other about Newcomb's problem etc.) and there's some deep notion of "correct counterfactual reasoning" that humans are merely approximating, and the deeper "c... (read more)

4Chris_Leong5mo
Update: I should further clarify that even though I provided a rough indication of how important I consider various approaches, this is off-the-cuff and I could be persuaded an approach was more valuable than I think, particularly if I saw good quality work. I guess my ultimate interest is normative as the whole point of investigating this area is to figure out what we should do. However, I am interested in descriptive theories insofar as they can contribute to this investigation (and not insofar as the details aren't useful for normative theories). For example, when I say that counterfactuals only make sense from within the counterfactual perspective and further that counterfactuals are ultimately grounded as an evolutionary adaption I'm making descriptive statements. The latter seems to be more of a positive statement, while the former doesn't seem to be (it seems to be justified by philosophical reasoning more than empirical investigation). In any case, it feels like there is more work to be done in taking these high-level abstract statements and making them more precise. I think that further investigation here could be useful - although not in the sense that 40% use this style of reasoning and 60% use this style - exact percentages aren't the relevant things here - at least not at this early stage. I'd also lean towards saying that how experts operate is more important than average humans and that the behavior of especially stupid humans is probably of limited importance. I guess I see the behaviour of normal humans mattering for two reasons: a) Firstly because I see making use of counterfactuals as evolutionarily grounded (in a more primitive form than the highly cognitive and mathematically influenced versions that we tend to use on LW) b) Secondly because the experts are more likely to discard intuitions that don't agree with their theories. And I think we need to use our reasoning to produce a consistent theory from our intuitions at some point, but th
Alignment By Default

I’ll set aside what happens “by default” and focus on the interesting technical question of whether this post is describing a possible straightforward-ish path to aligned superintelligent AGI.

The background idea is “natural abstractions”. This is basically a claim that, when you use an unsupervised world-model-building learning algorithm, its latent space tends to systematically learn some patterns rather than others. Different learning algorithms will converge on similar learned patterns, because those learned patterns are a property of the world, not an ... (read more)

6johnswentworth5mo
I'm fairly confident that the inputs to human values are natural abstractions - i.e. the "things we care about" are things like trees, cars, other humans, etc, not low-level quantum fields or "head or thumb but not any other body part". (The "head or thumb" thing is a great example, by the way). I'm much less confident that human values themselves are a natural abstraction, for exactly the same reasons you gave.
Brain-inspired AGI and the "lifetime anchor"

Yes, what you said. The opposite of "a human-legible learning algorithm" is "a nightmarishly-complicated Rube-Goldberg-machine learning algorithm".

If the latter is what we need, we could still presumably get AGI, but it would involve some automated search through a big space of many possible nightmarishly-complicated Rube-Goldberg-machine learning algorithms to find one that works.

That would be a different AGI development story, and thus a different blog post. Instead of "humans figure out the learning algorithm" as an exogenous input to the path-to-AGI, w... (read more)

Disentangling Perspectives On Strategy-Stealing in AI Safety

I understood the idea of Paul's post as: if we start in a world where humans-with-aligned-AIs control 50% of relevant resources (computers, land, minerals, whatever), and unaligned AIs control 50% of relevant resources, and where the strategy-stealing assumption is true—i.e., the assumption that any good strategy that the unaligned AIs can do, the humans-with-aligned-AIs are equally capable of doing themselves—then the humans-with-aligned-AIs will wind up controlling 50% of the long-term future. And the same argument probably holds for 99%-1% or any other ... (read more)