I see some value in the framing of "general intelligence" as a binary property, but it also doesn't quite feel as though it fully captures the phenomenon. Like, it would seem rather strange to describe GPT4 as being a 0 on the general intelligence scale.
I think maybe a better analogy would be to consider the sum of a geometric sequence.
Consider the sum for a few values of r as it increases at a steady rate.
0.5 - 2a
0.6 - 2.5a
0.7 - 3.3a
0.8 - 5a
0.9 - 10a
1 - Diverges to infinity
What we see then is quite significant returns to increases in r and then a sudden d...
So I've thought about this argument a bit more and concluded that you are correct, but also that there's a potential fix to get around this objection.
I think that it's quite plausible that an agent will have an understanding of its decision mechanism that a) let's it know it will take the same action in both counterfactuals b) won't tell it what action it will take in this counterfactual before it makes the decision.
And in that case, I think it makes sense to conclude that the Omega's prediction depends on your action such that paying gives you the $10,000...
Thanks for your response. There's a lot of good material here, although some of these components like modules or language seem less central to agency, at least from my perspective. I guess you might see these are appearing slightly down the stack?
Summary: John describes the problems of inner and outer alignment. He also describes the concept of True Names - mathematical formalisations that hold up under optimisation pressure. He suggests that having a "True Name" for optimizers would be useful if we wanted to inspect a trained system for an inner optimiser and not risk missing something.
He further suggests that the concept of agency breaks down into lower-level components like "optimisation", "goals", "world models", ect. It would be possible to make further arguments about how these lower-level concepts are important for AI safety.
This might be worth a shot, although it's not immediately clear that having such powerful maths provers would accelerate alignment more than capabilities. That said, I have previously wondered myself whether there is a need to solve embedded agency problems or whether we can just delegate that to a future AGI.
Oh wow, it's fascinating to see someone actually investigating this proposal. (I had a similar idea, but only posted it in the EA meme group).
Thanks for the extra detail!
(Actually, I was reading a post by Mark Xu which seems to suggest that the TradingAlgorithms have access to the price history rather than the update history as I suggested above)
My understanding after reading this is that TradingAlgorithms generate a new trading policy after each timestep (possibly with access to the update history, but I'm unsure). Is this correct? If so, it might be worth clarifying this, even though it seems clearer later.
Interesting, I think this clarifies things, but the framing also isn't quite as neat as I'd like.
I'd be tempted to redefine/reframe this as follows:
• Outer alignment for a simulator - Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren't actually literal?
• Inner alignment for a simulator - Training a simulator to perfectly simulate the assigned character
• Outer alignment for characters - fi...
I thought this was a really important point, although I might be biased because I was finding it confusing how some discussions were talking about the gradient landscape as though it could be modified and not clarifying the source of this (for example, whether they were discussing reinforcement learning).
...First off, the base loss landscape of the entire model is a function that's the same across all training steps, and the configuration of the weights selects somewhere on this loss landscape. Configuring the weights differently can put the mod
In the section: "The role of naturalized induction in decision theory" a lot of variables seem to be missing.
(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups.
I expect that it has also happened to an extent with animals as well. I wonder if anyone has ever looked into this.
converge to
Converge to 1? (Context is "9. Non-Dogmatic...").
Anyway, thanks so much for writing this! I found this to be a very useful resource.
It seems strange to treat ontological crises as a subset of embedded world-models, as it seems as though a Cartesian agent could face the same issues?
UDT doesn't really counter my claim that Newcomb-like problems are problems in which we can't ignore that our decisions aren't independent of the state of the world when we make that decision, even though in UDT we know less. To make this clear in the example of Newcomb's, the policy we pick affects the prediction which then affects the results of the policy when the decision is made. UDT isn't ignoring the fact that our decision and the state of the world are tied together, even if it possibly represents it in a different fashion. The UDT algorithm takes ...
Good point.
(That said, it seems like to useful check to see what the optimal policy will do. And if someone believes it won't achieve the optimal policy, it seems useful to try to understand the barrier that stops that. I don't feel quite clear on this yet).
My initial thoughts were:
Having thought about it a bit more, I think I managed to resolve the tension. It seems that if at least one of the actions is positive utility, then the s...
Strongly agreed. I do worry that most people on LW have a bias towards formalisation even when it doesn't add very much.
I guess it depends on the specific alignment approach being taken, such as whether you're trying to build a sovereign or an assistant. Assuming the latter, I'll list some philosophical problems that seem generally relevant:
Minor correction
But then, in the small fraction of worlds where we survive, we simulate lots and lots of copies of that AI where it instead gets reward 0 when it attempts to betray us!
The reward should be negative rather than 0.
Regarding the AI not wanting to cave to threats, there's a sense in which the AI is also (implicitly) threatening us, so it might not apply. (Defining what counts as a "threat" is challenging).
This seems wrong to me in large part because the AI safety community and EA community more broadly have been growing independent of increased interest in AI
Agreed, this is one of the biggest considerations missed, in my opinion, by people who think accelerating progress was good. (TBH, if anyone was attempting to accelerate progress to reduce AI risk, I think that they were trying to be too clever by half; or just rationalisting).
I guess I would lean towards saying that once powerful AI systems exist, we'll need powerful aligned systems relatively fast in order to develop against them, otherwise we'll be screwed. In other words, AI arms race dynamics push us towards a world where systems are deployed with an insufficient amount of testing and this provides one path for us to fall victim to an AI system that you might have expected iterative design to catch.
I would love to see you say why you consider these bad ideas. Obvious such AI's could be unaligned themselves or is it more along the lines of these assistants needing a complete model of human values to be truly useful?
Speedup on evolution?
Maybe? Might work okayish, but doubt the best solution is that speculative.
As in, you could score some actions, but then there isn't a sense in which you "can" choose one according to any criterion.
I've noticed that issue as well. Counterfactuals are more a convenient model/story than something to be taken literally. You've grounded decision by taking counterfactuals to exist a priori. I ground them by noting that our desire to construct counterfactuals is ultimately based on evolved instincts and/or behaviours so these stories aren't just arbitrary stories but a way in which we can leverage the lessons that have been instilled in us by evolution. I'm curious, given this explanation, why do we still need choices to be actual?
Let A be some action. Consider the statement: "I will take action A". An agent believing this statement may falsify it by taking any action B not equal to A. Therefore, this statement does not hold as a law. It may be falsified at will.
If you believe determinism then an agent can sometimes falsify it, sometimes not.
I think it's quite clear how shifting ontologies could break a specification of values. And sometimes you just need a formalisation, any formalisation, to play around with. But I suppose it depends more of the specific details of your investigation.
I strongly disagree with your notion of how privileging the hypothesis works. It's not absurd to think that techniques for making AIXI-tl value diamonds despite ontological shifts could be adapted for other architectures. I agree that there are other examples of people working on solving problems within a formalisation that seem rather formalisation specific, but you seem to have cast the net too wide.
I tend to agree that burning up the timeline is highly costly, but more because Effective Altruism is an Idea Machine that has only recently started to really crank up. There's a lot of effort being directed towards recruiting top students from uni groups, but these projects require time to pay off.
...I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its egg
I guess the problem with this test is that the kinds of people who could do this tend to be busy, so they probably can't do this with so little notice.
Regarding the point about most alignment work not really addressing the core issue: I think that a lot of this work could potentially be valuable nonetheless. People can take inspiration from all kinds of things and I think there is often value in picking something that you can get a grasp on, then using the lessons from that to tackle something more complex. Of course, it's very easy for people to spend all of their time focusing on irrelevant toy problems and never get around to making any progress on the real problem. Plus there are costs with adding more voices into the conversation as it can be tricky for people to distinguish the signal from the noise.
If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world's computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort
I had the exact same thought. My guess would be that Eliezer might say that since the AI is maximising if the generalisation function misses even one action of this sort as something that we should exclude that we're screwed.
I tend to value a longer timeline more than a lot of other people do. I guess I see EA and AI Safety setting up powerful idea machines that get more powerful when they are given more time to gear up. A lot more resources have been invested into EA field-building recently, but we need time for these investments to pay off. At EA London this year, I gained a sense that AI Safety movement building is only now becoming its own thing; and of course it'll take time to iterate to get it right, then time for people to pass through the programs, then time for...
How large do you expect Conjecture to become? What percent of people do you expect to be working on the product and what percentage to be working on safety?
Random idea: A lot of people seem discouraged from doing anything about AI Safety because it seems like such a big overwhelming problem.
What if there was a competition to encourage people to engage in low-effort actions towards AI safety, such as hosting a dinner for people who are interested, volunteering to run a session on AI safety for their local EA group, answering a couple of questions on the stampy wiki, offering to proof-read a few people’s posts or offering a few free tutorial sessions to aspiring AI Safety Researchers.
I think there’s a dec...
Thoughts on the introduction of Goodhart's. Currently, I'm more motivated by trying to make the leaderboard, so maybe that suggests that merely introducing a leaderboard, without actually paying people, would have had much the same effect. Then again, that might just be because I'm not that far off. And if there hadn't been the payment, maybe I wouldn't have ended up in the position where I'm not that far off.
I guess I feel incentivised to post a lot more than I would otherwise, but especially in the comments rather than the posts since if you post a lot o...
If we have an algorithm that aligns an AI with X values, then we can add human values to get an AI that is aligned with human values.
On the other hand, I agree that it doesn't really make sense to declare an AI safe in the abstract, rather than in respect to say human values. (Small counterpoint: in order to be safe, it's not just about alignment, you also need to avoid bugs. This can be defined without reference to human values. However, this isn't sufficient for safety).
I suppose this works as a criticism of approaches like quantisers or impact-minimisation which attempt abstract safety. Although I can't see any reason why it'd imply that it's impossible to write an AI that can be aligned with arbitrary values.
If you think this is financially viable, then I'm fairly keen on this, especially if you provide internships and development opportunities for aspiring safety researchers.
In science and engineering, people will usually try very hard to make progress by standing on the shoulders of others. The discourse on this forum, on the other hand, more often resembles that of a bunch of crabs in a bucket.
Hmm... Yeah, I certainly don't think that there's enough collaboration or appreciation of the insights that other approaches may provide.
Any thoughts on how to encourage a healthier dynamic.
The object-level claims here seem straightforwardly true, but I think "challenges with breaking into MIRI-style research" is a misleading way to characterize it. The post makes it sound like these are problems with the pipeline for new researchers, but really these problems are all driven by challenges of the kind of research involved.
There's definitely some truth to this, but I guess I'm skeptical that there isn't anything that we can do about some of these challenges. Actually, rereading I can see that you've conceded this towards the end of your post. I...
Even if the content is proportional, the signal-to-noise ratio will still be much higher for those interested in MIRI-style research. This is a natural consequence of being a niche area.
When I said "might not have the capacity to vet", I was referring to a range of orgs.
I would be surprised if the lack of papers didn't have an effect as presumably, you're trying to highlight high-quality work and people are more motivated to go the extra yard when trying to get published because both the rewards and standards are higher.
Just some sort of official & long-term& OFFLINE study program that would teach some of the previous published MIRI research would be hugely beneficial for growing the AF community.
Agreed.
At the last EA global there was some sort of AI safety breakout session. There were ~12 tables with different topics. I was dismayed to discover that almost every table was full with people excitingly discussing various topics in prosaic AI alignment and other things the AF table had just 2 (!) people.
Wow, didn't realise it was that little!
...I have spoken with MIRI p
Unclear. Some things that might be involved
I might add that I know a number of people interested in AF who feel somewhat afloat/find it difficult to contribute. Feels a bit like a waste of talent
Sounds really cool. Would be useful to have some idea of the kind of time you're planning to pick so that people in other timezones can make a call about whether or not to apply.