## AI ALIGNMENT FORUMAF

Jessica Taylor

Jessica Taylor. CS undergrad and Master's at Stanford; former research fellow at MIRI.

I work on decision theory, social epistemology, strategy, naturalized agency, mathematical foundations, decentralized networking systems and applications, theory of mind, and functional programming languages.

Blog: unstableontology.com

# Wiki Contributions

Where I agree and disagree with Eliezer

AI improving itself is most likely to look like AI systems doing R&D in the same way that humans do. “AI smart enough to improve itself” is not a crucial threshold, AI systems will get gradually better at improving themselves. Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that humans need to solve appear to be wrong.

One different way I've been thinking about this issue recently is that humans have fundamental cognitive limits e.g. brain size that AGI wouldn't have. There are possible biotech interventions to fix these but the easiest ones (e.g. just increase skull size) still require decades to start up. AI, meanwhile, could be improved (by humans and AIs) on much faster timescales. (How important something like brain size is depends on how much intellectual progress is explained by max intelligence than total intelligence; a naive reading of intellectual history would say max intelligence is important given that a high percentage of relevant human knowledge follows from <100 important thinkers.)

This doesn't lead me to assign high probability to "takeoff in 1 month", my expectation is still that AI improving AI will be an extension of humans improving AI (and then centaurs improving AI), but the iteration cycle time could be a lot faster due to AIs not having fundamental human cognitive limits.

Let's See You Write That Corrigibility Tag

“myopia” (not sure who correctly named this as a corrigibility principle),

I think this is from Paul Christiano, e.g. this discussion.

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

I assumed EER did account for that based on:

All portable air conditioner’s energy efficiency is measured using an EER score. The EER rating is the ratio between the useful cooling effect (measured in BTU) to electrical power (in W). It’s for this reason that it is hard to give a generalized answer to this question, but typically, portable air conditioners are less efficient than permanent window units due to their size.

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

Regarding the back-and-forth on air conditioners, I tried Google searching to find a precedent for this sort of analysis; the first Google result was "air conditioner single vs. dual hose" was this blog post, which acknowledges the inefficiency johnswentworth points out, overall recommends dual-hose air conditioners, but still recommends single-hose air conditioners under some conditions, and claims the efficiency difference is only about 12%.

Highlights:

In general, a single-hose portable air conditioner is best suited for smaller rooms. The reason being is because if the area you want to cool is on the larger side, the unit will have to work much harder to cool the space.

So how does it work? The single-hose air conditioner yanks warm air and moisture from the room and expels it outside through the exhaust. A negative pressure is created when the air is pushed out of the room, the air needs to be replaced. In turn, any opening in the house like doors, windows, and cracks will draw outside hot air into the room to replace the missing air. The air is cooled by the unit and ejected into the room.

...

Additionally, the single-hose versions are usually less expensive than their dual-hose counterparts, so if you are price sensitive, this should be considered. However, the design is much simpler and the bigger the room gets, the less efficient the device will be.

...

In general, dual-hose portable air conditioners are much more effective at cooling larger spaces than the single-hose variants. For starters, dual-hose versions operate more quickly as it has a more efficient air exchange process.

This portable air conditioning unit has two hoses, one functions as an exhaust hose and the other as an intake hose that will draw outside hot air. The air is cooled and expelled into the area. This process heats the machine, to cool it down the intake hose sucks outside hot air to cool the compressor and condenser units. The exhaust hose discard warmed air outside of the house.

The only drawback is that these systems are usually more expensive, and due to having two hoses instead of one, they are slightly less portable and more difficult to set up, yet most people tend to agree the investment in the extra hose is definitely worth the extra cost.

One thing to bear in mind is that the dual hose conditioners tend to be louder than single hoses. Once again, this depends on the model you purchase and its specifications, but it’s definitely worth mulling over if you need to keep the noise down in your area.

...

All portable air conditioner’s energy efficiency is measured using an EER score. The EER rating is the ratio between the useful cooling effect (measured in BTU) to electrical power (in W). It’s for this reason that it is hard to give a generalized answer to this question, but typically, portable air conditioners are less efficient than permanent window units due to their size.

...

DESCRIPTION | SINGLE-HOSE | DUAL-HOSE

Price | Starts at $319.00 | Starts at$449.00

...

Energy Efficient Ratio (EER) | 10 | 11.2

Power Consumption Rate | about $1 a day | Over$1 a day

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Btw, there is some amount of philosophical convergence between this and some recent work I did on critical agential physics; both are trying to understand physics as laws that partially (not fully) predict sense-data starting from the perspective of a particular agent.

It seems like "infra-Bayesianism" may be broadly compatible with frequentism; extending Popper's falsifiability condition to falsify probabilistic (as opposed to deterministic) laws yields frequentist null hypothesis significance testing, e.g. Neyman Pearson; similarly, frequentism also attempts to get guarantees under adversarial assumptions, as previously explained by Jacob Steinhardt.

$1000 USD prize - Circular Dependency of Counterfactuals Thanks for reading all the posts! I'm not sure where you got the idea that this was to solve the spurious counterfactuals problem, that was in the appendix because I anticipated that a MIRI-adjacent person would want to know how it solves that problem. The core problem it's solving is that it's a well-defined mathematical framework in which (a) there are, in some sense, choices, and (b) it is believed that these choices correspond to the results of a particular Turing machine. It goes back to the free will vs determinism paradox, and shows that there's a formalism that has some properties of "free will" and some properties of "determinism". A way that EDT fails to solve 5 and 10 is that it could believe with 100% certainty that it takes$5 so its expected value for $10 is undefined. (I wrote previously about a modification of EDT to avoid this problem.) CDT solves it by constructing physically impossible counterfactuals which has other problems, e.g. suppose there's a Laplace's demon that searches for violations of physics and destroys the universe if physics is violated; this theoretically shouldn't make a difference but it messes up the CDT counterfactuals. It does look like your post overall agrees with the view I presented. I would tend to call augmented reality "metaphysics" in that it is a piece of ontology that goes beyond physics. I wrote about metaphysical free will a while ago and didn't post it on LW because I anticipated people would be allergic to the non-physicalist philosophical language.$1000 USD prize - Circular Dependency of Counterfactuals

It seems like agents in a deterministic universe can falsify theories in at least some sense. Like they take two different weights drop them and see they land at the same time falsifying the fact that heavier objects fall faster

The main problem is that it isn't meaningful for their theories to make counterfactual predictions about a single situation; they can create multiple situations (across time and space) and assume symmetry and get falsification that way, but it requires extra assumptions. Basically you can't say different theories really disagree unless there's some possible world / counterfactual / whatever in which they disagree; finding a "crux" experiment between two theories (e.g. if one theory says all swans are white and another says there are black swans in a specific lake, the cruxy experiment looks in that lake) involves making choices to optimize disagreement.

In the second case, I would suggest that what we need is counterfactuals not agency. That is, we need to be able to say things like, "If I ran this experiment and obtained this result, then theory X would be falsified", not "I could have run this experiment and if I did and we obtained this result, then theory X would be falsified".

Those seem pretty much equivalent? Maybe by agency you mean utility function optimization, which I didn't mean to imply was required.

The part I thought was relevant was the part where you can believe yourself to have multiple options and yet be implemented by a specific computer.

\$1000 USD prize - Circular Dependency of Counterfactuals

I previously wrote a post about reconciling free will with determinism. The metaphysics implicit in Pearlian causality is free will (In Drescher's words: "Pearl's formalism models free will rather than mechanical choice."). The challenge is reconciling this metaphysics with the belief that one is physically embodied. That is what the post attempts to do; these perspectives aren't inherently irreconcilable, we just have to be really careful about e.g. distinguishing "my action" vs "the action of the computer embodying me" in a the Bayes net and distinguishing the interventions on them.

I wrote another post about two alternatives to logical counterfactuals: one says counterfactuals don't exist, one says that your choice of policy should affect your anticipation of your own source code. (I notice you already commented on this post, just noting it for completeness)

And a third post, similar to the first, reconciling free will with determinism using linear logic.

I'm interested in what you think of these posts and what feels unclear/unresolved, I might write a new explanation of the theoretical perspective or improve/extend/modify it in response.

Visible Thoughts Project and Bounty Announcement

How do you think this project relates to Ought? Seems like the projects share a basic objective (having AI predict human thoughts had in the course of solving a task). Ought has more detailed proposals for how the thoughts are being used to solve the task (in terms of e.g. factoring a problem into smaller problems, so that the internal thoughts are a load-bearing part of the computation rather than an annotation that is predicted but not checked for being relevant).

So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.

As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

Christiano, Cotra, and Yudkowsky on AI progress

This section seemed like an instance of you and Eliezer talking past each other in a way that wasn't locating a mathematical model containing the features you both believed were important (e.g. things could go "whoosh" while still being continuous):

[Christiano][13:46]

Even if we just assume that your AI needs to go off in the corner and not interact with humans, there’s still a question of why the self-contained AI civilization is making ~0 progress and then all of a sudden very rapid progress

[Yudkowsky][13:46]

unfortunately a lot of what you are saying, from my perspective, has the flavor of, “but can’t you tell me about your predictions earlier on of the impact on global warming at the Homo erectus level”

you have stories about why this is like totally not a fair comparison

I do not share these stories

[Christiano][13:46]

I don’t understand either your objection nor the reductio

like, here’s how I think it works: AI systems improve gradually, including on metrics like “How long does it take them to do task X?” or “How high-quality is their output on task X?”

[Yudkowsky][13:47]

I feel like the thing we know is something like, there is a sufficiently high level where things go whooosh humans-from-hominids style

[Christiano][13:47]

We can measure the performance of AI on tasks like “Make further AI progress, without human input”

Any way I can slice the analogy, it looks like AI will get continuously better at that task