See something I've written which you disagree with? I'm experimenting with offering cash prizes of up to US$1000 to anyone who changes my mind about something I consider important. Message me our disagreement and I'll tell you how much I'll pay if you change my mind + details :-) (EDIT: I'm not logging into Less Wrong very often now, it might take me a while to see your message -- I'm still interested though)
Thanks for the reply! I'll respond to the "Hold out sensors" section in this comment.
One assumption which seems fairly safe in my mind is that as the operators, we have control over the data our AI gets. (Another way of thinking about it is if we don't have control over the data our AI gets, the game has already been lost.)
Given that assumption, this problem seems potentially solvable
Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get a hint about where I may have placed them. If this is possible, then “Predict the readings of all sensors” need not be much more complex than “Predict the readings of one sensor.”
If the SmartVault learns a policy from data which was all generated prior to the installation of the new sensors, it seems unlikely that policy would deliberately account for the existence of those specific new sensors. We could leave info about your past actions out of the dataset as well.
An alternative approach in response to this bit:
I’m concerned that I will learn a better version of the human simulator which predicts the readings of all sensors and then outputs what a human would infer from the complete set.
The scenario is: we're learning a function F1(A, S1) -> D where A is an action sequence, S1 is readings from the known sensor, and D is a diamond location. Previously we've discussed two functions which both achieve perfect loss on our training data:
D1(A, S1) -- a direct translator which takes A and S1 into account
H1(A, S1) -- a simulation of what a human would believe if they saw A and S1
Let's also consider two other functions:
D2(A, S1, S2) -- a direct translator which takes A, S1, and S2 into account
H2(A, S1, S2) -- a simulation of what a human would believe if they saw A, S1, and S2
Your concern is that there is a third function on the original (A, S1) domain which also achieves perfect loss:
Why would it be bad if gradient descent discovered H1'? Because then when it comes time to learn a policy, we incentivize policies which deceive predicted readings for S2 in addition to S1.
Here's an idea for obtaining a function on the original (A, S1) domain which does not incentivize policies which deceive S2:
Learn a function F2 on the expanded domain (A, S1, S2), using a training set which is carefully constructed so that the only way to achieve perfect loss is to do a good job of taking readings from S2 into account. (For example, deliberately construct scenarios where the readings from S2 are not what you would expect if you were only looking at A and S1, and make ignoring A & S1 in favor of S2 key to labeling those scenarios correctly.) F2 could be closer to either D2 or H2, I don't think it matters much.
Define a function F1_only(A, S1) = F2(A, S1, <hardcoded stream of boring S2 sensor data>).
Now let's use F1_only as the target for learning our policy. I argue a policy has no incentive to deceive S2, because we know that F2 has been optimized to trust its S2 argument over its A and S1 arguments regarding what is going on around S2, and when we call F2 through F1_only, its S2 argument will always be telling it there are no interesting readings coming from S2. So, no bonus points for a policy which tries to fool S2 in addition to S1.
Maybe there is some kind of unintended consequence to this weird setup; I just came up with it and it's still a bit half-baked in my mind. (Perhaps you could make some kind of exotic argument on the basis of inner optimizers and acausal trade between different system components?) But the meta point is there's a lot of room for creativity if you don't anthropomorphize and just think in terms of learning functions on datasets. I think the consequences of the "we control the data our AIs get" assumption could be pretty big if you're willing to grant it.
I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.
Some other thoughts:
I felt like the report was unusually well-motivated when I put my "mainstream ML" glasses on, relative to a lot of alignment work.
ARC's overall approach is probably my favorite out of alignment research groups I'm aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.
Not sure if this is relevant in practice, but... the report talks about Bayesian networks learned via gradient descent. From what I could tell after some quick Googling, it doesn't seem all that common to do this, and it's not clear to me if there has been any work at all on learning the node structure (as opposed to internal node parameters) via gradient descent. It seems like this could be tricky because the node structure is combinatorial in nature and thus less amenable to a continuous optimization technique like gradient descent.
There was recently a discussion on LW about a scenario similar to the SmartVault one here. My proposed solution was to use reward uncertainty -- as applied to the SmartVault scenario, this might look like: "train lots of diverse mappings between the AI's ontology and that of the human; if even one mapping of a situation says the diamond is gone according to the human's ontology, try to figure out what's going on". IMO this general sort of approach is quite promising, interested to discuss more if people have thoughts.
(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)
There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.
Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.
For what it's worth, I often find Eliezer's arguments unpersuasive because they seem shallow. For example:
The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.
This seem like a fuzzy "outside view" sort of argument. (Compare with: "A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways." On the other hand, a causal model of a gun lets you explain which specific gun operations can be deadly and why.)
I'm not saying Eliezer's conclusion is false. I find other arguments for that conclusion much more persuasive, e.g. involving mesa-optimizers, because there is a proposed failure type which I understand in causal/mechanistic terms.
(I can provide other examples of shallow-seeming arguments if desired.)
As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.
Might depend whether the "thought" part comes before or after particular story text. If the "thought" comes after that story text, then it's generated conditional on that text, essentially a rationalization of that text from a hypothetical DM's point of view. If it comes before that story text, then the story is being generated conditional on it.
Personally I think I might go for a two-phase process. Do the task with a lot of transparent detail in phase 1. Summarize that detail and filter out infohazards in phase 2, but link from the summary to the detailed version so a human can check things as needed (flagging links to plausible infohazards). (I guess you could flag links to parts that seemed especially likely to be incorrigible/manipulative cognition, or parts of the summary that the summarizer was less confident in, as well.)
I'm glad you are thinking about this. I am very optimistic about AI alignment research along these lines. However, I'm inclined to think that the strong form of the natural abstraction hypothesis is pretty much false. Different languages and different cultures, and even different academic fields within a single culture (or different researchers within a single academic field), come up with different abstractions. See for example lsusr's posts on the color blue or the flexibility of abstract concepts. (The Whorf hypothesis might also be worth looking into.)
This is despite humans having pretty much identical cognitive architectures (assuming that we can create a de novo AGI with a cognitive architecture as similar to a human brain as human brains are to each other seems unrealistic). Perhaps you could argue that some human-generated abstractions are "natural" and others aren't, but that leaves the problem of ensuring that the human operating our AI is making use of the correct, "natural" abstractions in their own thinking. (Some ancient cultures lacked a concept of the number 0. From our perspective, and that of a superintelligent AGI, 0 is a 'natural' abstraction. But there could be ways in which the superintelligent AGI invents 'natural' abstraction that we haven't yet invented, such that we are living in a "pre-0 culture" with respect to this abstraction, and this would cause an ontological mismatch between us and our AGI.)
But I'm still optimistic about the overall research direction. One reason is if your dataset contains human-generated artifacts, e.g. pictures with captions written in English, then many unsupervised learning methods will naturally be incentivized to learn English-language abstractions to minimize reconstruction error. (For example, if we're using self-supervised learning, our system will be incentivized to correctly predict the English-language caption beneath an image, which essentially requires the system to understand the picture in terms of English-language abstractions. This incentive would also arise for the more structured supervised learning task of image captioning, but the results might not be as robust.)
This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.
Social sciences are a notable exception here. And I think social sciences (or even humanities) may be the best model for alignment--'human values' and 'corrigibility' seem related to the subject matter of these fields.
Anyway, I had a few other comments on the rest of what you wrote, but I realized what they all boiled down to was me having a different set of abstractions in this domain than the ones you presented. So as an object lesson in how people can have different abstractions (heh), I'll describe my abstractions (as they relate to the topic of abstractions) and then explain how they relate to some of the things you wrote.
I'm thinking in terms of minimizing some sort of loss function that looks vaguely like
reconstruction_error + other_stuff
reconstruction_error is a measure of how well we're able to recreate observed data after running it through our abstractions, and
other_stuff is the part that is supposed to induce our representations to be "useful" rather than just "predictive". You keep talking about conditional independence as the be-all-end-all of abstraction, but from my perspective, it is an interesting (potentially novel!) option for the
other_stuff term in the loss function. The same way dropout was once an interesting and novel
other_stuff which helped supervised learning generalize better (making neural nets "useful" rather than just "predictive" on their training set).
The most conventional choice for
other_stuff would probably be some measure of the complexity of the abstraction. E.g. a clustering algorithm's complexity can be controlled through the number of centroids, or an autoencoder's complexity can be controlled through the number of latent dimensions. Marcus Hutter seems to be as enamored with compression as you are with conditional independence, to the point where he created the Hutter Prize, which offers half a million dollars to the person who can best compress a 1GB file of Wikipedia text.
Another option for
other_stuff would be denoising, as we discussed here.
You speak of an experiment to "run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance is low-dimensional". My guess is if the
other_stuff in your loss function consists only of conditional independence things, your representation won't be particularly low-dimensional--your representation will see no reason to avoid the use of 100 practically-redundant dimensions when one would do the job just as well.
Similarly, you speak of "a system which provably learns all learnable abstractions", but I'm not exactly sure what this would look like, seeing as how for pretty much any abstraction, I expect you can add a bit of junk code that marginally decreases the reconstruction error by overfitting some aspect of your training set. Or even junk code that never gets run / other functional equivalences.
The right question in my mind is how much info at a distance you can get for how many additional dimensions. There will probably be some number of dimensions N such that giving your system more than N dimensions to play with for its representation will bring diminishing returns. However, that doesn't mean the returns will go to 0, e.g. even after you have enough dimensions to implement the ideal gas law, you can probably gain a bit more predictive power by checking for wind currents in your box. See the elbow method (though, the existence of elbows isn't guaranteed a priori).
(I also think that an algorithm to "provably learn all learnable abstractions", if practical, is a hop and a skip away from a superintelligent AGI. Much of the work of science is learning the correct abstractions from data, and this algorithm sounds a lot like an uberscientist.)
Anyway, in terms of investigating convergence, I'd encourage you to think about the inductive biases induced by both your loss function and also your learning algorithm. (We already know that learning algorithms can have different inductive biases than humans, e.g. it seems that the input-output surfaces for deep neural nets aren't as biased towards smoothness as human perceptual systems, and this allows for adversarial perturbations.) You might end up proving a theorem which has required preconditions related to the loss function and/or the algorithm's inductive bias.
Another riff on this bit:
This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.
Maybe we could differentiate between the 'useful abstraction hypothesis', and the stronger 'unique abstraction hypothesis'. This statement supports the 'useful abstraction hypothesis', but the 'unique abstraction hypothesis' is the one where alignment becomes way easier because we and our AGI are using the same abstractions. (Even though I'm only a believer in the useful abstraction hypothesis, I'm still optimistic because I tend to think we can have our AGI cast a net wide enough to capture enough useful abstractions that ours are in their somewhere, and this number will be manageable enough to find the right abstractions from within that net--or something vaguely like that.) In terms of science, the 'unique abstraction hypothesis' doesn't just say scientific theories can be useful, it also says there is only one 'natural' scientific theory for any given phenomenon, and the existence of competing scientific schools sorta seems to disprove this.
Anyway, the aspect of your project that I'm most optimistic about is this one:
This raises another algorithmic problem: how do we efficiently check whether a cognitive system has learned particular abstractions? Again, this doesn’t need to be fully general or arbitrarily precise. It just needs to be general enough to use as a tool for the next step.
Since I don't believe in the "unique abstraction hypothesis", checking whether a given abstraction corresponds to a human one seems important to me. The problem seems tractable, and a method that's abstract enough to work across a variety of different learning algorithms/architectures (including stuff that might get invented in the future) could be really useful.
We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.
Are you sure that "episode" is the word you're looking for here?
I'm especially confused because you switched to using the word "timestep" later?
Having an action which modifies the reward on a subsequent episode seems very weird. I don't even see it as being the same agent across different episodes.
Suppose instead of one button, there are two. One is labeled "STOP," and if pressed, it would end the environment but give the agent +1 reward. The other is labeled "DEFERENCE" and, if pressed, gives the previous episode's agent +10 reward but costs -1 reward for the current agent.
Suppose that an agent finds itself existing. What should it do? It might reason that since it knows it already exists, it should press the STOP button and get +1 utility. However, it might be being simulated by its past self to determine if it is allowed to exist. If this is the case, it presses the DEFERENCE button, giving its past self +10 utility and increasing the chance of its existence. This agent has been counterfactually mugged into deferring.
I think as a practical matter, the result depends entirely on the method you're using to solve the MDP and the rewards that your simulation delivers.
...When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.
ASHLEY: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we're trying to do.
I'm not convinced your chess example, where the practical solution resembles the hypercomputer one, is representative. One way to sort a list using a hypercomputer is to try every possible permutation of the list until we discover one which is sorted. I tend to see Solomonoff induction as being cartoonishly wasteful in a similar way.
Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say.
...there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded.
I think it might be useful to distinguish between being aware of oneself in a literal sense, and the term "self-aware" as it is used colloquially / the connotations the term sneaks in.
Some animals, if put in front of a mirror, will understand that there is some kind of moving animalish thing in front of them. The ones that pass the mirror test are the ones that realize that moving animalish thing is them.
There is a lot of content on YouTube about YouTube, so the system will likely become aware of itself in a literal sense. That's not the same as our colloquial notion of "self-awareness".
IMO, it'd be useful to understand the circumstances under which the first one leads to the second one.
My guess is that it works something like this. In order to survive and reproduce, evolution has endowed most animals with an inborn sense of self, to achieve self-preservation. (This sense of self isn't necessary for cognition--if you trip on psychedelics and experience ego death, your brain can still think. Occasionally people will hurt themselves in this state since their self-preservation instincts aren't functioning as normal.)
Colloquial "self-awareness" occurs when an animal looking in the mirror realizes that the thing in the mirror and its inborn sense of self are actually the same thing. Similar to Benjamin Franklin realizing that lightning and electricity are actually the same thing.
If this story is correct, we need not worry much about the average ML system developing "self-awareness" in the colloquial sense, since we aren't planning to endow it with an inborn sense of self.
That doesn't necessarily mean I think Predict-O-Matic is totally safe. See this post I wrote for instance.
I'll respond to the "Predict hypothetical sensors" section in this comment.
First, I want to mention that predicting hypothetical sensors seems likely to fail in fairly obvious ways, e.g. you request a prediction about a sensor that's physically nonexistent and the system responds with a bunch of static or something. Note the contrast with the "human simulator" failure mode, which is much less obvious.
But I also think we can train the system to predict hypothetical sensors in a way that's really useful. As in my previous comment, I'll work from the assumptions (fairly weak IMO) that
We can control the data our systems get.
We are capable of doing regular old supervised learning -- possibly in conjunction with transfer learning that gives the system generic prior knowledge like the meaning of English words, but not specific prior knowledge like details of our situation (unless we want that). Our supervised learning finds a function which maps training examples in X to labels in Y (labels may or may not correspond to "reality").
In particular, these assumptions imply that our system doesn't necessarily need to know whether a sensor it's trying to predict exists physically (or if it would be physically possible to build).
But what if over the course of its operation, the system accidentally learns that a sensor of interest doesn't exist? E.g. because it points a sensor that does exist in the direction of the one that doesn't exist, and there's nothing present. Ideally we could be reasonably confident of good "readings" from the nonexistent sensor even past that point.
To achieve this, we could make use of the "regular old supervised learning" assumption and construct a doctored dataset as follows:
Place camera S1 in the center of the vault, turn it on, and have it noisily rotate 360 degrees to capture a panorama.
Mount audio sensor S2 on the interior wall of the vault, turn both sensors on, and have S1 do the same panorama.
Create a supervised learning problem to predict the readings of S2 gathered in step 2, given S1's panorama from step 1.
If we have prediction problems like this in our dataset, hypotheses which predict that S2 continues behaving as normal even if physically absent will achieve a lower loss, and should be preferentially chosen.
A similar method lets us preferentially choose hypotheses that predict that S2 will emit a special "TAMPERING ACTIVE" signal during tampering, then work as normal once tampering is complete:
Place audio sensor S2 in the corner of the vault and turn it on.
Move a robotic arm right next to S2.
Have the arm tamper with S2 so it no longer hears the sound of footfalls.
Discard S2 and replace it with a fresh new sensor, S2', that has identical factory specifications.
Have someone walk beneath S2' in full view of camera sensor S1.
Splice together a composite audio sensor reading with the following sequence: S2 readings from steps 1 and 2, a "TAMPERING ACTIVE" signal for the duration of step 3, nothing for step 4, and the S2' readings from step 5.
Create a supervised learning problem to predict the composite reading from step 6, given the movements of the robot arm from steps 2 and 3, plus the visual readings from S1 with step 4 snipped out.
The "S2 is vulnerable to tampering" hypothesis predicts that footfalls should not be present in the audio. But since they are present in the audio, it will achieve a higher loss. Failure to predict the emission of the "TAMPERING ACTIVE" signal also gives this hypothesis a higher loss.
It might also be possible to do away with the "TAMPERING ACTIVE" signal somehow, e.g. by making S2 and S2' identical except that the tampering which works on S2 doesn't work on S2', then splicing together readings from the robot arm doing the exact same tamper sequence on both sensors.
Finally, here are a few ideas for preferentially choosing hypotheses that predict accurate readings from technologically impossible sensors.
Supersensor Idea #1
Write a natural language description of a "telepathic sensor". Describe it as being mounted right above a camera sensor S1.
Place boxes X, Y, and Z within the vault beneath the watchful eye of S1.
Turn on S1. Have a person enter the vault, visibly ponder for a moment, and open box Y.
Create a stream of "telepathic sensor" readings. Find a timestamp in the video where the person is pondering and hasn't yet made a movement. Add a reading from the "telepathic sensor" at that timestamp: "The person intends to open box Y".
Supersensor Idea #2
Write a natural language description of an "x-ray vision" sensor which can be pointed in various directions. Describe it as being mounted right above a camera sensor S1.
Place a deck of cards on a table beneath the watchful eye of S1.
Have a person enter the vault, shuffle the deck beneath the table, show a 9 of hearts to S1, and place it face down on top of the deck.
Create a stream of "x-ray vision" sensor readings. Specify that the sensor is initially pointed away, but once the action finishes, the sensor turns towards the table and registers: "The top card of the deck is a 9 of hearts."
Supersensor Idea #3
Write a natural language description of a "nanobot sensor" which beeps whenever a nanobot is present within 10 feet.
Run an action sequence which builds nanobots and moves them within 10 feet of where the "nanobot sensor" is mounted.
Create a stream of "nanobot sensor" readings. Specify that the sensor starts beeping as soon as the nanobots are nearby.
A few more thoughts on translating natural language into sensor readings:
Excise a real sensor type from system's prior knowledge. Ask your system to predict sensor data from a physical instance of this sensor, given a natural language description of its workings plus other readings from the environment. (H/T Romeo Stevens)
Make a weird modified sensor (e.g. a camera sensor which provides an upside down picture). Ask your system to predict readings from the modified sensor, given a natural language description of its modifications plus other readings from the environment.
Anyway, I'm not sure we need to reason about a physically impossible counterfactual or condition on the sensor existing physically. It seems perfectly coherent to ask "what is the answer to this thought experiment" rather than "if this sensor existed, what would it see"? For example, instead of the question "what would an Acme Corp camera mounted here see", consider the question "if the light which passes through a pinhole at these coordinates intersected with a plane at these other coordinates, and the intersections were digitized and formatted the same way Acme Corp cameras format photos, what would be the resulting binary file?"
Humans don't seem to have a lot of trouble performing thought experiments. If the system tries to fit the data with a hypothesis that references existing pretrained conceptual understanding, as I described above, that could give the system an inductive bias towards performing human-like thought experiments. This could be bad if human thought experiments are vulnerable to human deficiencies. It could also be good if we'd like the AI's hypothetical sensors to behave in the same intuitive way our thought experiments do.
One possible concern is hypotheses which reference dataset doctoring. Obviously one could try to excise knowledge of that possibility. Another quick idea is to try & train a classifier to differentiate doctored vs non-doctored SmartVault sequences, and keep improving our fakes until the classifier can't easily tell the difference? Or try to avoid any sort of branching so the system always acts like it's dealing with a doctored dataset when in production? Could even fuzz the live data stream in a way that makes it appear doctored ;-) Finally, to get a sense for the cognitive signature of a doctoring-related hypothesis, one could train the system to solve some problems where the only way to minimize the loss is to think a lot about doctoring. Maybe a classifier which aims to detect the presence of doctoring-related cognition could be useful here.
Another possibility is an alternative hypothesis along the lines of "predict what the operator would want me to predict" -- unclear if that's desirable?