abramdemski — AI Alignment Forum

Judgements: Merging Prediction & Evidence

Let's look at a specific example: the Allais paradox. (See page 9 of the TDT paper (page 12 of the pdf) for the treatment I'm referencing.)

It is not plausible to me that the commonly-labeled-irrational behavior in the Allais paradox arises from a desire to be money-pumped. It seems more plausible, for example, that it arises from a cognitive heuristic which makes decisions by identifying the most relevant dimensions along which options differ, weighing how significant the various differences feel, and combining those results to make a decision. Moving from 100% probability to 33/34 probability feels significant because we are moving from certainty to uncertainty, whereas the difference in payouts feels relatively uncertain. The reduction in total payout feels insignificant compared to this. In contrast, moving from 34/100 to 33/100 feels insignificant compared to the reduction in total payout.

Of course, this is still consistent with a biases-as-values analysis. EG, we can treat the heuristic weights I mention as values rather than mere heuristics. Or, reaching for a different explanation, we can say that we don't want to feel like a fool in the case that we choose 33/34 and lose, when we could have had certainty. Probabilities are subjective, so no matter how much we're assured 33/34 is the true probability, we can imagine a friend with a different evaluation of the odds who finds our decision foolish. Humans evolved to avoid such criticism. A statement of 100% probability is, in some sense, taken more seriously than a statement of near-100% probability. In that case, if we lose anyway, we can blame the person who told us it was 100%, so we are absolved from any potential feeling of embarrassment. In the 33/100 vs 34/100 version, there is no such effect.

I want to say something like "the optimal resource-maximizing policy is an illusion" though. Like, it is privileging some sort of reference frame. In economics, profit maximization privileges the wellbeing of the shareholders. A more holistic view would treat all parties involved as stakeholders (employees, owners, customers, and even local communities where the company operates) and treat corporate policy as a bargaining problem between those stakeholders. This would better reflect long-term viability of strategies. (Naively profit-maximizing behavior has a tendency to create high turnover in employees, drive away customers, and turn local communities against the company.)

So yes, you can view everything as values, but I would include "resource-maximizing" in that as well.

A further question: what's at stake when you classify something as 'values'?

EG, in the Allais paradox, one thing that's at stake is whether the 'irrational' person should change their answer to be rational.

Alignment Research Field Guide

abramdemski3mo20

The name was by analogy to TEDx, yes. MIRI was running official MIRI workshops and we (Scott Garrabrant, me, and a few others) wanted to run similar events independently. We initially called them "mini miri workshops" or something like that, and MIRI got in touch to ask us not to call them that since it insinuates that MIRI is running them. They suggested "MIRIx" instead.

A simple example of conditional orthogonality in finite factored sets

abramdemski4mo20

I'm trying to understand the second clause for conditional histories better.

The first clause is very intuitive, and in some sense, exactly what I would expect. I understand it as basically saying that drops elements from $h (X)$ which can be inferred from $E$ . Makes a kind of sense!

However, if that were the end of the story, then conditional histories would obviously be the wrong tool for defining conditional orthogonality. Conditional orthogonality is supposed to tell us about conditional independence in the probability distribution. However, we know from causal graphs that conditioning can create dependence. EG, in the bayes net $A \to B \leftarrow C$ , A and C are independent, but if we condition on C, A and B become dependent. Therefore, conditional histories need to grow somehow. The second clause in the definition can be seen as artificially adding things to the history in order to represent that A and C have lost their independence.

What I don't yet see is how to relate these phenomena in detail. I find it surprising that the second clause only depends on E, not on X. It seems important to note that we are not simply adding the history of E^[1] into the answer. Instead, it asks that the history of E itself '''factors''' into the part within $h (X | E)$ and the part outside. If E and X are independent, then only the first clause comes into play. So the implications of the second clause do depend on X, even though the clause doesn't mention X.

So, is there a nice way to see how the second clause adds an "artificial history" to capture the new dependencies which X might gain when we condition on E?

@Scott Garrabrant

^{^}
In this paragraph, I am conflating the set $E \subseteq S$ with the partition ${E, S - E}$ .

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

abramdemski4mo20

This mostly made sense to me. I agree that it is a tricky question with a lot of moving pieces. In a typical RL setting, low entropy does imply low learning, as observed by Cui et al. One reason for this is because exploration is equated with randomness. RL typically works with point-estimates only, so the learning system does not track multiple hypotheses to test between. This prevents deterministic exploration strategies like VoI exploration, which explore based on the potential for gaining information, rather than just randomly.

My main point here is just to point out all these extra assumptions which are needed to make a strict connection between entropy and adaptability, making the observed empirical connection more empirical-only, IE not a connection which holds in all corner cases we can come up with.

However, I may be a bit more prone to think of humans as exploring intelligently than you are, IE, forming hypotheses and taking actions which test them, rather than just exploring by acting randomly.

I also don't buy this part:

And the last piece, entropy being subjective, would be just the point of therapy and some of the interventions described in the other recent RLHF+ papers.

My concern isn't that you're anthropomorphizing the LLM, but rather, that you may be anthropomorphizing it incorrectly. The learned policy may have close to zero entropy, but that doesn't mean that the LLM can predict its own actions perfectly ahead of time from its own subjective perspective. Meaning, my argument that adaptability and entropy are connected is a distinct phenomenon from the one noted by Cui, since the notions of entropy are different (mine being a subjective notion based on the perspective of the agent, and Cui's being a somewhat more objective one based on the randomization used to sample behaviors from the LLM).

(Note: your link for the paper by Cui et al currently points back to this post, instead.)

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

abramdemski4mo20

"A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible." I'm not sure I would call this teacher adaptable. I might call them adapted in the sense that they're functioning well in their current environment, but if the environment changed in some way (so that actions in the current state no longer led to the same range of consequences in later states), they would fail to adapt. (Horney would call this person neurotic but successful.)

So, in this scenario, what makes the connection between higher entropy and higher adaptability? Earlier, I mentioned that lower entropy could spoil exploration, which could harm one's ability to learn. However, the optimal exploration policy (from a bayesian perspective) is actually zero entropy, because it maximizes value of information (whereas introducing randomness won't do that, unless multiple actions happen to be tied for value of information).

The point being that if the environment changes, the teacher doesn't strictly need to introduce entropy into their policy in order to adapt. That's just a common and relatively successful method.

However, it bears mentioning that entropy is of course subjective; we might need to ask from whose perspective we are measuring the entropy. Dice have low entropy from the perspective of a physics computation which can predict how they will land, but high entropy from the perspective of a typical human who cannot. An agent facing a situation they don't know how to think about yet necessarily has high entropy from their own perspective, since they haven't yet figured out what they will do. In this sense, at least, there is a strict connection between adaptability and entropy.

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

abramdemski4mo30

Entropy isn't going to be a perfect measure of adaptability, eg,

Or think back to our teacher who uses a voice of authority, even if it's false, to answer every question: this may have been a good strategy when they were a student teacher. Just like the reasoning LLM in its "early training stage," the probability of choosing a new response (choosing not to people please; choosing to admit uncertainty) drops, i.e., policy entropy collapses, and the LLM or person has developed a rigid habitual behavior.

A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible.

However, the connection you're making does generally make sense. EG, a model which has already eliminated a lot of potential behaviors from its probability distribution isn't going to explore very well for RL training. I also intuitively agree that this is related to alignment problems, although to be clear I doubt that solving this problem alone would avert serious AI risks.

Yoshua Bengio claims to have the solution to this problem: Generative Flow Networks (which also fit into his larger program to avert AI risks). However, I haven't evaluated this in detail. The main claim as I understand it is about training to produce solutions proportional to their reward, instead of training to produce only high-reward outputs.

It feels like a lot of what you're bringing up is tied to a sort of "shallowness" or "short-sightedness" more closely than entropy. EG, always saying yes to go to the bar is low-entropy, but not necessarily lower entropy than the correct strategy (EG, always saying yes unless asked by someone you know to have an alcohol problem, in which case always no). What distinguishes it is short-sightedness (you mention short-term incentives like how the friend reacts immediately in conversation) and a kind of simplicity (always saying yes is a very context-free pattern, easy to learn).

I'm also reminded of Richard Watson's talk Songs of Life and Mind, where he likens adaptive systems to mechanical deformations which are able to snap back when the pressure is released, vs modern neural networks which he likens to crumpled paper or a fallen tower. (See about 28 minutes into the video, although it might not make a lot of sense without more context.)

abramdemski's Shortform

abramdemski6mo20

This sort of approach doesn't make so much sense for research explicitly aiming at changing the dynamics in this critical period. Having an alternative, safer idea almost ready-to-go (with some explicit support from some fraction of the AI safety community) is a lot different from having some ideas which the AI could elaborate.

abramdemski's Shortform

abramdemski6mo20

The pre-training phase is already finding a mesa-optimizer that does induction in context. I usually think of this as something like Solomonoff induction with a good inductive bias, but probably you would expect something more like logical induction. I expect the answer to be somewhere in between.

I don't personally imagine current LLMs are doing approximate logical induction (or approximate solomonoff) internally. I think of the base model as resembling a circuit prior updated on the data. The circuits that come out on top after the update also do some induction of their own internally, but it is harder to think about what form of inductive bias they have exactly (it would seem like a coincidence if it also happened to be well-modeled as a circuit prior, but, it must be something highly computationally limited like that, as opposed to Solomonoff-like).

I hesitate to call this a mesa-optimizer. Although good epistemics involves agency in principle (especially time-bounded epistemics), I think we can sensibly differentiate between mesa-optimizers and mere mesa-induction. But perhaps you intended this stronger reading, in support of your argument. If so, I'm not sure why you believe this. (No, I don't find "planning ahead" results to be convincing -- I feel this can still be purely epistemic in a relevant sense.)

Perhaps it suffices for your purposes to observe that good epistemics involves agency in principle?

Anyway, cutting more directly to the point:

I think you lack imagination when you say

[...] which can realistically compete with modern LLMs would ultimately look a lot like a semi-theoretically-justified modification to the loss function or optimizer of agentic fine-tuning / RL or possibly its scaffolding [...]

I think there are neural architectures close to the current paradigm which don't directly train whole chains-of-thought on a reinforcement signal to achieve agenticness. This paradigm is analogous to model-free reinforcement learning. What I would suggest is more analogous to model-based reinforcement learning, with corresponding benefits to transparency. (Super speculative, of course.)

abramdemski's Shortform

abramdemski6mo29-1

Here's what seem like priorities to me after listening to the recent Dwarkesh podcast featuring Daniel Kokotajlo:

1. Developing the safer AI tech (in contrast to modern generative AI) so that frontier labs have an alternative technology to switch to, so that it is lower cost for them to start taking warning signs of misalignment of their current tech tree seriously. There are several possible routes here, ranging from small tweaks to modern generative AI, to scaling up infrabayesianism (existing theory, totally groundbreaking implementation) to starting totally from scratch (inventing a new theory). Of course we should be working on all routes, but prioritization depends in part on timelines.

I see the game here as basically: look at the various existing demos of unsafety and make a counter-demo which is safer on multiple of these metrics without having gamed the metrics.

2. De-agentify the current paradigm or the new paradigm:

Don't directly train on reinforcement across long chains of activity. Find other ways to get similar benefits.
Move away from a model where the AI is personified as a distinct entity (eg, chatbot model). It's like the old story about building robot arms to help feed disabled people -- if you mount the arm across the table, spoonfeeding the person, it's dehumanizing; if you make it a prosthetic, it's humanizing.
- I don't want AI to write my essays for me. I want AI to help me get my thoughts out of my head. I want super-autocomplete. I think far faster than I can write or type or speak. I want AI to read my thoughts & put them on the screen.
  - There are many subtle user interface design questions associated with this, some of which are also safety issues, eg, exactly what objective do you train on?
- Similarly with image generation, etc.
- I don't necessarily mean brain-scanning tech here, but of course that would be the best way to achieve it.
- Basically, use AI to overcome human information-processing bottlenecks instead of just trying to replace humans. Putting humans "in the loop" more and more deeply instead of accepting/assuming that humans will iteratively get sidelined.

Notes on countermeasures for exploration hacking (aka sandbagging)

abramdemski6mo20

ou need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy - just add a small weight on a uniform prior over tokens like they did in old school atari RL.)

Yeah, what I really had in mind with "avoiding mode collapse" was something more complex, but it seems tricky to spell out precisely.

Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn't be able to do something to avoid exploring it.

It's an interesting point, but where does the "extremely" come from? Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I'm unclear on the rules of the game as you're imagining them.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments