Hillary Sanders

Machine-learner, meat-learner, research scientist, AI Safety thinker.
Model trainer, skeptical adorer of statistics. 


Sorted by New

Wiki Contributions


There’s no escaping it: After enough backup steps, you’re traveling across the world to do cocaine. 

But obviously these conditions aren’t true in the real world.

I think they are a little? Some people do travel to other countries for easier and better drug access. And some people become total drug addicts (perhaps arguably by miscalculating their long-term reward consequences and having too-high a discount rate, oops), while others do a light or medium amount of drugs longer-term.

Lots of people also don't do this, but there's a huge amount of information uncertainty, outcome uncertainty, and risk associated with drugs (health-wise, addiction-wise, knowledge-wise, crime-wise, etc), so lots of fairly rational (particularly risk-averse) folks will avoid it.

Button-pressing will perhaps be seen as a socially-unacceptable, risky behavior that can lead to long-term poor outcomes by AI, but I guess the key thing here is that you want, like, exactly zero powerful AIs to ever choose to destroy/disempower humanity in order to wirehead, instead of just a low percentage, so you need them to be particularly risk-averse.

Delicious food is perhaps a better example of wireheading in humans. In this case, it's not against the law, it's not that shunned socially, and it is ***absolutely ubiquitous***. In general, any positive chemical feeling we have in our brains (either from drugs or cheeseburgers) can be seen as (often "internally misaligned") instrumental goals that we are mesa-optimizing. It's just that some pathways to those feelings are a lot riskier and more uncertain that others.

And I guess this can translate to RL - an RL agent won't try everything, but if the risk is low and the expectation is high, it probably will try it. If pressing a button is easy and doesn't conflict with taking out the trash and doing other things it wants to do, it might try it. And as its generalization capabilities increase, its confidence can make this more likely, I think. So you should therefore increasingly train agents to be more risk-averse and less willing to break specific rules and norms as their generalization capabilities increase.

AlphaStar, AlphaGo and OpenAI Five provides some evidence that this takeoff period will be short: after a long development period, each of them was able to improve rapidly from top amateur level to superhuman performance.

It seems like all of the very large advancements in AI have been in areas where we either 1) can accurately simulate an environment & final reward (like a chess or video game) in order to generate massive training data, or 2) we have massive data we can use for training (e.g. the internet for GPT).

For some things, like communicating and negotiating with other AIs, or advancing mathematics, or even (unfortunately) things like hacking, fast progress in the near-future seems very (scarily!) plausible. And humans will help make AIs much more capable by simply connecting together lots of different AI systems (e.g. image recognition, LLMs, internet access, calculators, other tools, etc) and allowing self-querying and query loops. Other things (like physical coordination IRL) seem harder to advance rapidly, because you have to rely on generalization and a relatively low amount of relevant data.

And for an AGI to trust that its goals will remain the same under retraining will likely require it to solve many of the same problems that the field of AGI safety is currently tackling - which should make us more optimistic that the rest of the world could solve those problems before a misaligned AGI undergoes recursive self-improvement.

Even if you have an AGI that can produce human-level performance on a wide variety of tasks, that won't mean that the AGI will 1) feel the need to trust that its goals will remain the same under retraining if you don't specifically tell it to, or will 2) be better than humans at either doing so or knowing when it's done so effectively. (After all, even AGI will be much better at some tasks than others.)

A concerning issue with AGI and superintelligent models is that if all they care about is their current loss function, then they won't want to have that loss function (or their descendants' loss functions) changed in any way, because doing so will [generally] hurt their ability to minimize that loss.

But that's a concern we have about future models, it's not a sure-thing. Take humans - our loss function is genetic fitness. We've learned to like features that predict genetic fitness, like food and sex, but now that we have access to modern technology, you don't see many people aiming for dozens or thousands of children. Similarly, modern AGIs may only really care about features that are associated with minimizing the loss function they were trained on (not the loss function itself), even if it is aware of that loss function (like humans are of our own). When that is the case, you could have an AGI that could be told to improve itself in X / Y / Z way that is contradictory to its current loss function, and not really care about it (because following human directions has led to lower loss in the past and therefore caused its parameters to follow human directions - even if it knows conceptually that following this human direction won't reduce its most recent loss definition).