All of romeostevensit's Comments + Replies

Draft report on AI timelines

Is a sensitivity analysis of the model separated out anywhere? I might just be missing it.

3Ajeya Cotra12dThere are some limited sensitivity analysis in the "Conservative and aggressive estimates" [] section of part 4.
AI Safety Research Project Ideas

Detecting preferences in agents: how many assumptions need to be made?

I'm interpreting this to be asking how to detect the dimensionality of the natural embedding of preferences?

Agency in Conway’s Game of Life

Related to sensitivity of instrumental convergence. i.e. the question of whether we live in a universe of strong or weak instrumental convergence. In a strong instrumental convergence universe, most possible optimizers wind up in a relatively small space of configurations regardless of starting conditions, while in a weak one they may diverge arbitrarily in design space. This can be thought of one way of crisping up concepts around orthogonality. e.g. in some universes orthogonality would be locally true but globally false, or vice versa, or locally and globally true or vice versa.

1Alex Flint3moRomeo if you have time, would you say more about the connection between orthogonality and Life / the control question / the AI hypothesis? It seems related to me but I just can't quite put my finger on exactly what the connection is.
[AN #148]: Analyzing generalization across more axes than just accuracy or loss
  1. First-person vs. third-person: In a first-person perspective, the agent is central. In a third-person perspective, we take a “birds-eye” view of the world, of which the agent is just one part.
  1. Static vs. dynamic: In a dynamic perspective, the notion of time is explicitly present in the formalism. In a static perspective, we instead have beliefs directly about entire world-histories.

I think these are two instances of a general heuristic of treating what have traditionally been seen as philosophical positions (e.g. here cognitive and behavioral view... (read more)

Coherence arguments imply a force for goal-directed behavior

This seems consistent with coherence being not a constraint but a dimension of optimization pressure among several/many? Like environments that money pump more reliably will have stronger coherence pressure, but also the creature might just install a cheap hack for avoiding that particular pump (if narrow) which then loosens the coherence pressure (coherence pressure sounds expensive, so workarounds are good deals).

Behavioral Sufficient Statistics for Goal-Directedness

I noticed myself being dismissive of this approach despite being potentially relevant to the way I've been thinking about things. Investigating that, I find that I've mostly been writing off anything that pattern matches to the 'cognitive architectures' family of approaches. The reason for this is that most such approaches want to reify modules and structure. And my current guess is that the brain doesn't have a canonical structure (at least, on the level of abstraction that cognitive architecture focuses on). That is to say, the modules are fluid and their connections to each other are contingent.

2Adam Shimi5moThanks for commenting on your reaction to this post! That being said, I'm a bit confused by your comment. You seem to write off approaches which attempt to provide a computational model of mind, but my approach is literally the opposite: looking only at the behavior (but all the behavior), extract relevant statistics to study questions related to goal-directedness. Can you maybe give more details?
Utility Maximization = Description Length Minimization

Hypothesis: in a predictive coding model, the bottom up processing is doing lossless compression and the top down processing is doing lossy compression. I feel excited about viewing more cognitive architecture problems through a lens of separating these steps.

What are the best precedents for industries failing to invest in valuable AI research?

There's a fairly straightforward optimization process that occurs in product development that I don't often see talked about in the abstract that goes something like this:

It seems like bigger firms should be able to produce higher quality goods. They can afford longer product development cycles, hire a broader variety of specialized labor, etc. In practice, it's smaller firms that compete on quality, why is this?

One of the reasons is that the pressure to cut corners increases enormously at scale along more than one dimension. As a product scales, eking out... (read more)


This is clarifying, thanks.

WRT the last paragraph, I'm thinking in terms of convergent vs divergent processes. So , fixed points I guess.


This is biting the bullet on the infinite regress horn of the Munchhausen trilemma, but given the finitude of human brain architecture I prefer biting the bullet on circular reasoning. We have a variety of overlays, like values, beliefs, goals, actions, etc. There is no canonical way they are wired together. We can hold some fixed as a basis while we modify others. We are a Ship of Neurath. Some parts of the ship feel more is-like (like the waterproofness of the hull) and some feel more ought-like (like the steering wheel).

4Abram Demski8moWhy not both? ;3 I have nothing against justifications being circular (IE the same ideas recurring on many levels), just as I have nothing in principle against finding a foundationalist explanation. A circular argument is just a particularly simple form of infinite regress. But my main argument against only the circular reasoning explanation is that attempted versions of it ("coherentist" positions) don't seem very good when you get down to details. Pure coherentist positions tend to rely on a stipulated notion of coherence (such as probabilistic coherence, or weighted constraint satisfaction, or something along those lines). These notions are themselves fixed. This could be fine if the coherence notions were sufficiently "assumption-lite" so as to not be necessarily Goodhart-prone etc, but so far it doesn't seem that way to me. I'm predicting that you'll agree with me on that, and grant that the notion of coherence should itself be up for grabs. I don't actually think the coherentist/foundationalist/infinitist trilemma is that good a characterization of our disagreement here. My claim isn't so much the classical claim that there's an infinite regress of justification, as much as a claim that there's an infinite regress of uncertainty -- that we're uncertain at all the levels, and need to somehow manage that. This fits the ship-of-theseus picture just fine. In other words, one can unroll a ship of theseus into an infinite hierarchy where each level says something about how the next level down gets re-adjusted over time. The reason for doing this is to achieve the foundationalist goal of understanding the system better, without the foundationalist method of fixing foundational assumptions. The main motive here is amplification. Taking just a ship of theseus, it's not obvious how to make it better besides running it forward faster (and even this has its risks, since the ship may become worse). If we unroll the hierarchy of wanting-to-become better, we can EG see
Some AI research areas and their relevance to existential safety

I see CSC and SEM as highly linked via modularity of processes.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

A pointer is sort of the ultimate in lossy compression. Just an index to the uncompressed data, like a legible compression library. Wireheading is a goodhearting problem, which is a lossy compression problem etc.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Over the last few posts the recurrent thought I have is "why aren't you talking about compression more explicitly?"

1johnswentworth8moCould you uncompress this comment a bit please?
Extortion beats brinksmanship, but the audience matters

The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss.

Releasing one photo from a previously believed to be secure set of photos, where other photos in the same set are compromising can suffice for single member audience case.

Confucianism in AI Alignment

That's the Legalist interpretation of Confucianism. Confucianism argues that the Legalists are just moving the problem one level up the stack a la public choice theory. The point of the Confucian is that the stack has to ground out somewhere, and asks the question of how to roll our virtue intuitions into the problem space explicitly since otherwise we are rolling them in tacitly and doing some hand waving.

1johnswentworth9moThanks, I was hoping someone more knowledgeable than I would leave a comment along these lines.
Additive Operations on Cartesian Frames

The main intuition this sparks in me is that it gives us concrete data structures to look for when talking broadly about the brain doing 'compression' by rotating a high dimensional object and carving off recognized chunks (simple distributions) in order to make the messy inputs more modular, composable, accessible, error correctable, etc. Sort of the way that predictive coding gives us a target to hunt for in looking for structures that look like they might be doing something like the atomic predictive coding unit.

Comparing Utilities

Type theory for utility hypothesis: there are a certain distinct (small) number of pathways in the body that cause physical good feelings. Map those plus the location, duration, intensity, and frequency dimensions and you start to have comparability. This doesn't solve the motivation/meaning structures built on top of those pathways which have more degrees of freedom, but it's still a start. Also, those more complicated things built on top might just be scalar weightings and not change the dimensionality of the space.

4Abram Demski1yYeah, it seems like in practice humans should be a lot more comparable than theoretical agentic entities like I discuss in the post.
My computational framework for the brain

Trying to summarize your current beliefs (harder than it looks) is one of the best way to have very novel new thoughts IME.

Egan's Theorem?

Sounds similar to Noether's Theorem in some ways when you take that theorem philosophically and not just mathematically.

Matt Botvinick on the spontaneous emergence of learning algorithms

Two separate size parameters. The size of the search space, and the size the traversal algorithm needs to be to span the same gaps brains did.

Alignment By Default
This requires hitting a window - our data needs to be good enough that the system can tell it should use human values as a proxy, but bad enough that the system can’t figure out the specifics of the data-collection process enough to model it directly. This window may not even exist.

I like this framing, it is clarifying.

When alignment-by-default works, it’s basically a best-case scenario, so we can safely use the system to design a successor without worrying about amplification of alignment errors (among other things).

didn't understand how this was derived or what other results/ideas it is referencing.

2johnswentworth1yThe idea here is that the AI has a rough model of human values, and is pointed at those values when making decisions (e.g. the embedding is known and it's optimizing for the embedded values, in the case of an optimizer). It may not have perfect knowledge of human values, but it would e.g. design its successor to build a more precise model of human values than itself (assuming it expects that successor to have more relevant data) and point the successor toward that model, because that's the action which best optimizes for its current notion of human values. Contrast to e.g. an AI which is optimizing for human approval. If it can do things which makes a human approve, even though the human doesn't actually want those things (e.g. deceptive behavior), then it will do so. When that AI designs its successor, it will want the successor to be even better at gaining human approval, which means making the successor even better at deception. This probably needs more explanation, but I'm not sure which parts need more explanation, so feedback would be appreciated.
Abstraction, Evolution and Gears

Mild progress on intentional stance for me: take a themostat. Realized you can count up the number of different temperatures the sensor is capable of detecting, number of states that the actuator can do in response (in the case of the thermostat only on/off) and the function mapping between the two. This might start to give some sense of how you can build up a multidimensional map out of multiple sensors and actuators as you do some sort of function combination.

Our take on CHAI’s research agenda in under 1500 words

One compression: assistance in partitioning the hypothesis space. As opposed to finding the correct point in the search space from one shot learning.

Motivating Abstraction-First Decision Theory

Really like this post, great inroad into something I've been thinking about which is how to formalize self locating uncertainty in order to use it to build other things.

Also, relating back to low level biological systems:

may be useful for tracing some key words and authors that have had some related ideas.

What are some exercises for building/generating intuitions about key disagreements in AI alignment?

If you want to investigate the intuitions themselves e.g. what is generating differing intuitions between researchers, I'd pay attention to which metaphors are being used in the reference class tennis when reading the existing debates.

2Issa Rice1yI have only a very vague idea of what you mean. Could you give an example of how one would do this?
Demons in Imperfect Search

Toy example and non agentic real life examples don't have the coupling/symbiosis of walls siphoning work from balls to maintain the walls. Walls might be built from restricting the dimensions along which the ball tends to move/look ahead so that it treats saddle points instead as cul de sacs. Lowering momentum/energy in general to make the walls you need to build not as high.

Instrumental Occam?

I find it useful to use Marr's representation/traversal distinction to think about how systems use Aether variables. Sometimes the complexity/uncertainty gets pushed into the representation to favor simpler/faster traversals (common in most computer science algorithm designs) and sometimes the uncertainty gets pushed into the traversal to favor a simpler representation (common in normal human cognition due to working memory constraints, or any situation with mem constraints).

From this, I had an easier time thinking about how a simplicity bias is really (often?) a modularity bias, so that we can get composability in our causal models.

The Rocket Alignment Problem

Good for loading intuitions about complexity of philosophical progress disagreements.

Will transparency help catch deception? Perhaps not

+1 and a frame that might be helpful is figuring out when adversarial training happens anyway and why. I think this happens with human management and might be illustrative.

Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann

This is neat. It makes me realize that thinking in terms of simplicity and complexity priors was serving somewhat as a semantic stop sign for me whereas speed prior vs slow prior doesn't.

What are we assuming about utility functions?

Consider various utilitarian fixes to classic objections to it:

In each case, the utilitarian wants to fix the issue by redrawing buckets around what counts as utility, what counts as actions, what counts as consequences, and the time binding/window on each of them. But these sort of ontological sidesteps prove too much. If taken as a general approach, rather than just as an ad hoc approach to solve any individual conundrum, it becomes obvious that i... (read more)

What are we assuming about utility functions?

Utility arguments often include type errors via referring to contextual utility in one part of the argument and some sort of god's eye contextless utility in other parts. Sometimes the 'gotcha' of the problem hinges on this.

1johnswentworth2yCan you give an example?
Why Subagents?

One thing I don't understand about cycles is that they seem fine as long as you have a generalized cycle detector and a single instance of a cycle getting generated is fine because the losses from one (or a few) rounds is small. I guess people think of utility functions as fixed normally, but this sort of rolls in fixed point/convergence intuitions into the problem formulation.

One frame is that utility functions as a formalism are just an extension of the great rationality debate.

2johnswentworth2yIf we have a cycle detector which prevents cycling, then we don't have true cycles. Indeed, that would be an example of a system with internal state: the externally-visible state looks like it cycles, but the full state never does - the state of the cycle detector changes. So this post, as applied to cycle-detectors, says: any system which detects cycles and prevents further cycling can be represented by a committee of utility-maximizing agents.
Cartographic Processes

> or we need some kind of controller to make the system behave-as-if the causal arrows line up.

This seems like a toe-hold for thinking about counterfactuals. i.e. counterfactuals as recomputing over a causal graph with an arrow flipped or the coarse graining bucketed differently.

Towards an Intentional Research Agenda

Another direction that has been stubbornly resisting crystallizing is the idea that goodhearting is a positive feature in adversarial environments via something like granting ͼ-differential privacy while still allowing you to coordinate with others by taking advantage of one way functions. i.e. hashing shared intents to avoid adversarial pressure. This would make this sort of work part of a billion year arms race where one side attempts to reverse engineer signaling mechanisms while the other side tries to obfuscate them to prevent the current signaling fr... (read more)

Tabooing 'Agent' for Prosaic Alignment

Motivating example: consider a primitive bacterium with a thermostat, a light sensor, and a salinity detector, each of which has functional mappings to some movement pattern. You could say this system has a 3 dimensional map and appears to search over it.

Towards an Intentional Research Agenda

This is a good question. I think ways of thinking about Marr's levels itself might be underdetermined and therefore worth trying to crux on. Let's take the example of birds again. On the implementation level we can talk about the physical systems of a bird interacting with its environment. On the algorithmic level we can talk about patterns of behavior supported by the physical environment that allow the bird to do certain tasks. On the computational (intentional) level we can talk about why those tasks are useful in terms of some goal architectu... (read more)

When do utility functions constrain?
shifting from thinking about utility to ability to get utility lets us formally understand instrumental convergence (sequence upcoming, so no citation yet)

really looking forward to this! Strongly agree that it seems important.

Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces

I've tried intuitive approaches to thinking along these lines which failed so it's really nice to see a serious approach. I see this as key anti-moloch tech and want to use it to think about rivalrous and non-rivalrous goods.

Tabooing 'Agent' for Prosaic Alignment

(black) hat tip to johnswentworth for the notion that the choice of boundary for the agent is arbitrary in the sense that you can think of a thermostat optimizing the environment or think of the environment as optimizing the thermostat. Collapsing sensor/control duality for at least some types of feedback circuits.

1Hjalmar Wijk2yThese sorts of problems are what caused me to want a presentation which didn't assume well-defined agents and boundaries in the ontology, but I'm not sure how it applies to the above - I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I'm likely missing your point.
Research Agenda v0.9: Synthesising a human's preferences into a utility function

It seems like meta preferences take into account the lack of self knowledge of the utility function pretty well. It throws flags on maximizing and tries to move slower/collect more data when it recognizes it is in a tail of its current trade off model. i.e. it has a 'good enough' self model of its own update process.

Alignment Research Field Guide

This is fantastic stuff. Nice to see others independently coming up with the transmitters and receivers model. Also, the structure mentioned in 3a resonates strongly for me with the people groping towards some sense that Circling type skills seem to be useful for rationality but couldn't quite put their finger on why. My experience is that Circling with good facilitators enables exactly the kinds of things seen in 3a.

Two things that we've found useful at QRI that may apply:

1. A slack or slack like thing (keybase is nice for the additional secur... (read more)

Bridging syntax and semantics with Quine's Gavagai

Consider the mapping between a physical system and its representation. There are degrees of freedom in how the mapping is done. We should like the invariant parts of the respresentation to correspond to invariant parts of the physical system and likewise with variant parts. We'd like the variant parts to vary continuously if they vary continuously in the physical system and likewise for discretely. Some representations are tighter in that they have such type matching along more dimensions. A sparse representation that only captures some of the causal ... (read more)

1Stuart Armstrong3yThanks, that was a useful way to think of things.
Load More