Look at the paper linked for more details ( https://arxiv.org/abs/1712.05812 ).
Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.
But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation
I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.
I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).
Thanks! Lots of useful insights in there.
So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.
Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.
Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.
Did posts on generalised models as a category and how one can see Cartesian frames as generalised models.
Partial probability distribution
A concept that's useful for some of my research: a partial probability distribution.
That's a that defines for some but not all and (with for being the whole set of outcomes).
This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of .
Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has .
Eg: , , w... (read more)
I like it. I'll think about how it fits with my ways of thinking (eg model splintering).
Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.
We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.
I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.
(I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.)
Cheers, that would be very useful.
(I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)
I feel that the whole AI alignment problem can be seen as problems with ontological shifts: https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1
Express, express away _
The connection to AI alignment is combining the different utilities of different entities without extortion ruining the combination, and dealing with threats and acausal trade.
That's a misspelling that's entirely my fault, and has now been corrected.
(1) You say that releasing nude photos is in the blackmail category. But who's the audience?
The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss.
(2) For n=1, m large: Is an example of brinkmanship here a monopolistic buyer who will only choose suppliers giving cutrate prices?
Interesting example that I hadn't really considered. I'd say its fits more under extortion than brinksmanship, though. A small supplier has to sell, or they won't stay in business. If there's a single buyer, "I won't buy ... (read more)
A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.
Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".
that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"
It was "any sort of agent pursuing a reward function".
We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.
I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical... (read more)
Cool, good summary.
Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.
Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).
So we need to figure out if we're in the optimistic or the pessimistic scenario.
My understanding of the OP was that there is a robot [...]
That understanding is correct.
Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?
I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also po... (read more)
modularization is super helpful for simplifying things.
The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).
but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?
Then isn't that just a model at another level, a (labelled) model in the heads of the onlookers?
Thanks! Useful insights in your post, to mull over.
An imminent incoming post on this very issue ^_^
Cool, neat summary.
Very good. A lot of potential there, I feel.
Why do the absolute values cancel?
Because , so you can remove the absolute values.
I also think the pedestrian example illustrates why we need more semantic structure: "pedestrian alive" -> "pedestrian dead" is bad, but "pigeon on road" -> "pigeon in flight" is fine.
I think this shows that the step-wise inaction penalty is time-inconsistent: https://www.lesswrong.com/posts/w8QBmgQwb83vDMXoz/dynamic-inconsistency-of-the-stepwise-inaction-baseline
Thanks! Good insights there. Am reproducing the comment here for people less willing to click through:
I haven't read the literature on "how counterfactuals ought to work in ideal reasoners" and have no opinion there. But the part where you suggest an empirical description of counterfactual reasoning in humans, I think I basically agree with what you wrote.
I think the neocortex has a zoo of generative models, and a fast way of detecting when two are compatible, and if they are, snapping them together like Legos into a larger model.
... (read more)For example, the m
(this is, obviously, very speculative ^_^ )
...which also means that they didn't have an empire to back them up?
Thanks for your research, especially the Afonso stuff. One question for that: were these empires used to gaining/losing small pieces of territory? ie did they really dedicate all their might to getting these ports back, or did they eventually write them off as minor losses not worth the cost of fighting (given Portuguese naval advantages)?
Based on what I recall reading about Pizzaro's conquest, I feel you might be underestimating the importance of horses. It took centuries for European powers to figure out how to break a heavy cavalry charge with infantry; the amerindians didn't have the time to figure it out (see various battles where small cavalry forces routed thousands of troops). Once they had got more used to horses, later Inca forces (though much diminished) were more able to win open battles against the Spanish.
Maybe this was the problem for these empires: they were used to winning
... (read more)My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of "update a distribution over reward functions" has lots of as-yet-unsolved problems.
Ah, ok, I see ^_^ Thanks for making me write this post, though, as it has useful things for other people to see, that I had been meaning to write up for some time.
On your main point: if the prior and updating process are over things that are truly beyond the AI's influence, then there will be no rigging (
... (read more)I agree that for such a system, the optimal policy of the actor is to rig the estimator, and to "intentionally" bias it towards easy-to-satisfy rewards like "the human loves heroin".
The part that confuses me is why we're having two separate systems with different objectives where one system is dumb and the other system is smart.
We don't need to have two separate systems. There's two meaning to your "bias it towards" phrase: the first one is the informal human one, where "the human loves heroin" is clearly a bias. The second is some formal definition
... (read more)I don't think critiques are necessarily bad ^_^
I think that might be a generally good critique, but I don't think it applies to this post (it may apply better to post #3 in the series).
I used "metal with knobs" and "beefy arm" as human-parsable examples, but the main point is detecting when something is out-off-distribution, which relies on the image being different in AI-detectable ways, not on the specifics of the categories I mentioned.
A key AI safety skill is moving back and forth, as needed, between "could we solve problem X if we assume Y?" and "can we assume Y?".
Suggested elsewhere by Max Daniel:
Suggested by Ozzie Gooen:
Have slightly rephrased to include this.
Thanks. I think we mainly agree here.