I don't want to get super hung up on this because it's not about anything Yudkowsky has said but:
Consider the whole transformed line of reasoning:
avian flight comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get an entity which flies, that entity must be as close to a bird as birds are to each other.
IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was:
...human intelligence/alignment comes from a lot of factors; you can't just ape one of the factors an
This is a valid point, and that's not what I'm critiquing. I'm critiquing how he confidently dismisses ANNs
I guess I read that as talking about the fact that at the time ANNs did not in fact really work. I agree he failed to predict that would change, but that doesn't strike me as a damning prediction.
Matters would be different if he said in the quotes you cite "you only get these human-like properties by very exactly mimicking the human brain", but he doesn't.
Didn't he? He at least confidently rules out a very large class of modern approaches.
Co...
This comment doesn't really engage much with your post - there's a lot there and I thought I'd pick one point to get a somewhat substantive disagreement. But I ended up finding this question and thought that I should answer it.
But have you ever, even once in your life, thought anything remotely like "I really like being able to predict the near-future content of my visual field. I should just sit in a dark room to maximize my visual cortex's predictive accuracy."?
I think I've been in situations where I've been disoriented by a bunch of random stuff happening and wished that less of it was happening so that I could get a better handle on stuff. An example I vividly recall was being in a history class in high school and being very bothered by the large number of conversations happening around me.
I don't really get your comment. Here are some things I don't get:
I no longer endorse this claim about what the orthogonality thesis says.
But given that good, automated mechanistic hypothesis generation seems to be the only hope for scalable MI, it may be time for TAISIC to work on this in earnest. Because of this, I would argue that automating the generation of mechanistic hypotheses is the only type of MI work TAISIC should prioritize at this point in time.
"Automating" seems like a slightly too high bar here, given how useful human thoughts are for things. IMO, a better frame is that we have various techniques for combining human labour and algorithmic computation to generate hypothes...
I also want to add that I really like the use of the prediction UI in this post.
Ah, you make this point here:
However, in practical applications some optimization pressure against the helper will still leak in, for example if the helper is used to decide whether to abort a training run or change its hyperparameters.
...Imitation learning methods seem less risky, as the optimization pressure is simply to match the empirical distribution of a demonstration dataset. The closest to “reward hacking” in this setting would be overfitting to the dataset, a relatively benign failure mode. There is still some risk of inner optimization objectives arising, which could then be adversarial to other systems (e.g. attempt to hide themselves from transparency tools), but comparatively speaking this is one of the methods with the lowest risk of adversarial failure. [Bolding by DanielFila
Conversely, it is less likely to happen if the optimization process is pursuing a goal that is largely independent from the helper’s output, such as if the helper is an interpretability tool checking for the presence of mesa-optimization.
Maybe I'm missing something obvious here, but for advanced AI systems under monitoring that happen to be misaligned, won't checking for the presence of mesa-optimization be relevant to the main system's goal, in cases where the results of those checks matter for whether to re-train or deploy the AI?
Note that the model does not have black or white stones as a concept, and instead only thinks of the stones as “own’s stones” and “opponent’s stones”, so we can do this without loss of generality.
I'm confused how this can be true - surely the model needs to know which player is black and which player is white to know how to incorporate komi, right?
You can now watch a short video of an excerpt from this episode (an axrpt?)!
Am I right that this algorithm is going to visit each "important" node in once per path from to the output? If so, that could be pretty slow given a densely-connected interpretation, right?
Empirically, human toddlers are able to recognize apples by sight after seeing maybe one to three examples. (Source: people with kids.)
Wait but they see a ton of images that they aren't told contain apples, right? Surely that should count. (Probably not 2^big_number bits tho)
As I understand it, the EA forum sometimes idiosyncratically calls this philosophy [rule consequentialism] "integrity for consequentialists", though I prefer the more standard term.
AFAICT in the canonical post on this topic, the author does not mean "pick rules that have good consequences when I follow them" or "pick rules that have good consequences when everyone follows them", but rather "pick actions such that if people knew I was going to pick those actions, that would have good consequences" (with some unspecified tweaks to cover places where that ...
In reality though, I think people often just believe stuff because people nearby them believe that stuff
IMO, a bigger factor is probably people thinking about topics that people nearby them think about, and having the primary factors that influence their thoughts be the ones people nearby focus on.
One reason that I doubt this story is that "try new things in case they're good" is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you'll probably be a bit scammy even tho you like people and don't want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.
Actually I'm being silly, you don't need ring signatures, just signatures that are associated with identities and also used for financial transfers.
Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.
Here's one way you can do it: Suppose we're doing public key cryptography, and every person is associated with one public key. Then when you write things online you could use a linkable ring signature. That means that you prove that you're using a private key that corresponds to one of the known public keys, and you also produce a hash of your keypair, such that (a) the world can tell you're one of the known public keys but not which public key you are, and (b) the world can tell that the key hash you used corresponds to the public key you 'committed' to when writing the proof.
Relevant quote I just found in the paper "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents":
...The primary measure of an agent’s performance is the score achieved during an episode, namely the undiscounted sum of rewards for that episode. While this performance measure is quite natural, it is important to realize that score, in and of itself, is not necessarily an indicator of AI progress. In some games, agents can maximize their score by “getting stuck” in a loop of “small” rewards, ignoring what human p
Here's a project idea that I wish someone would pick up (written as a shortform rather than as a post because that's much easier for me):
Here is an example story I wrote (that has been minorly edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won't end up wanting to game human approval:
dopamine or RPE or that-which-gets-discounted-and-summed-to-produce-the-return
Those are three pretty different things - the first is a chemical, the second I guess stands for 'reward prediction error', and the third is a mathematical quantity! Like, you also can't talk about the expected sum of dopamine, because dopamine is a chemical, not a number!
Here's how I interpret the paper: stuff in the world is associated with 'rewards', which are real numbers that represent how good the stuff is. Then the 'return' of some period of time is the discounted sum o...
(see also this shortform, which makes a rudimentary version of the arguments in the first two subsections)
Here's my general view on this topic:
I'm not saying "These statements can make sense", I'm saying they do make sense and are correct under their most plain reading.
Re: a possible goal of animals being to optimize the expected sum of future rewards, in the cited paper "rewards" appears to refer to stuff like eating tasty food or mating, where it's assumed the animal can trade those off against each other consistently:
...Decision-making environments are characterized by a few key concepts: a state space..., a set of actions..., and affectively important outcomes (finding cheese, obtaining water,
I think the quotes cited under "The field of RL thinks reward=optimization target" are all correct. One by one:
The agent's job is to find a policy… that maximizes some long-run measure of reinforcement.
Yes, that is the agent's job in RL, in the sense that if the training algorithm didn't do that we'd get another training algorithm (if we thought it was feasible for another algorithm to maximize reward). Basically, the field of RL uses a separation of concerns, where they design a reward function to incentivize good behaviour, and the agent maximizes th...
It looks like this is the 4th post in a sequence - any chance you can link to the earlier posts? (Or perhaps use LW's sequence feature)
I have no idea why I responded 'low' to 2. Does anybody think that's reasonable and fits in with what I wrote here, or did I just mean high?
The method that is normally used for this in the biological literature (including the Kashtan & Alon paper mentioned above), and in papers by e.g. CHAI dealing with identifying modularity in deep modern networks, is taken from graph theory. It involves the measure Q, which is defined as follows:
FWIW I do not use this measure in my papers, but instead use a different graph-theoretic measure. (I also get the sense that Q is more of a network theory thing than a graph theory thing)
I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem
It's also not super clear what you algorithmically do instead - words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.
One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
I think this is way more worrying in the case where you're implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.
...Though [the claim that slightly wrong observation
A future episode might include a brief distillation of that episode ;)
But wait, there can only be so many low-complexity universes, and if they're launching successful attacks, said attacks would be distributed amongst a far far far larger population of more-complex universes.
Can't you just condition on the input stream to affect all the more-complex universes, rather than targetting a single universe? Specifically: look at the input channel, run basically-Solomonoff-induction yourself, then figure out which universe you're being fed inputs of and pick outputs appropriately. You can't be incredibly powerful this way, sinc...
That said, this sequence is tricky to understand and I'm bad at it! I look forward to brave souls helping to digest it for the community at large.
I interviewed Vanessa here in an attempt to make this more digestible: it hopefully acts as context for the sequence, rather than a replacement for reading it.
One thing Carl notes is that a variety of areas where AI could contribute a lot to the economy are currently pretty unregulated. But I think there's a not-crazy story where once you are within striking range of making an area way more efficient with computers, then the regulation hits. I'm not sure how to evaluate how right that is (e.g. I don't think it's the story of housing regulation), but just wanted it said.
Ah, that makes sense - thanks!
Anders Sandberg could tell us what fraction of the reachable universe is being lost per minute, which would tell us how much more surety it would need to expect to gain by waiting another minute before acting.
From Ord (2021):
Each year the affectable universe will only shrink in volume by about one part in 5 billion.
So, since there are about 5e5 minutes in a year, you lose about 1 part in 5e5 * 5e9 = 3e15 every minute.
Also: what is a diamondoid bacterium?
Then, in my lower-bound concretely-visualized strategy for how I would do it, the AI either proliferates or activates already-proliferated tiny diamondoid bacteria and everybody immediately falls over dead during the same 1-second period
Dumb question: how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?
(I wouldn't normally ask these sorts of questions since...
Expected return in a particular environment/distribution? Or not? If not, then you may be in a deployment context where you aren't updating the weights anymore and so there is no expected return
I think you might be misunderstanding this? My take is that "return" is just the discounted sum of future rewards, which you can (in an idealized setting) think of as a mathematical function of the future trajectory of the system. So it's still well-defined even when you aren't updating weights.
I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it's talking about. The problem is just that it's not general enough to handle all possible ways of training a model using machine learning.
GPT-3 was trained using supervised learning, which I would have thought was a pretty standard way of training a model using machine learning. What training scenarios do you think the Risks from Learned Optimization terminology can handle, and what's the difference between those and the way GPT-3 was trained?
What changed your mind about Chaitin's constant?
It's true! Altho I think of putting something up on arXiv as a somewhat lower bar than 'publication' - that paper has a bit of work left.
I really like the art!
OK I think this is a typo, from the proof of prop 10 where you deal with condition 5:
Thus .
I think this should be .
From def 16:
... if for all
Should I take this to mean "if for all and "?
[EDIT: no, I shouldn't, since and are both subsets of ]
To tie up this thread: I started writing a more substantive response to a section but it took a while and was difficult and I then got invited to dinner, so probably won't get around to actually writing it.