Yeah, this agrees with my thinking so far. However, I think if you could research how to align AIs specifically to human flourishing (as opposed to things like obedience/interpretability/truthfulness, which defer to the user's values), that kind of work could be more helpful than most.

Reply

Refusal in LLMs is mediated by a single direction

Vladimir Slepnev3mo53

Sorry for maybe naive question. Which other behaviors X could be defeated by this technique of "find n instructions that induce X and n that don't"? Would it work for X=unfriendliness, X=hallucination, X=wrong math answers, X=math answers that are wrong in one specific way, and so on?

Reply

cousin_it's Shortform

Vladimir Slepnev3mo10

If the housing crisis is caused by low-density rich neighborhoods blocking redevelopment of themselves (as seems the consensus on the internet now), could it be solved by developers buying out an entire neighborhood or even town in one swoop? It'd require a ton of money, but redevelopment would bring even more money, so it could be win-win for everyone. Does it not happen only due to coordination difficulties?

Reply

Security amplification

Vladimir Slepnev3mo10

I don't know about others, but to me these approaches sound like "build a bureaucracy from many well-behaved agents", and it seems to me that such a bureaucracy wouldn't necessarily behave well.

Reply

Transformers Represent Belief State Geometry in their Residual Stream

Vladimir Slepnev3mo30

I have maybe a naive question. What information is needed to find the MSP image within the neural network? Do we have to know the HMM to begin with? Or could it be feasible someday to inspect a neural network, find something that looks like an MSP image, and infer the HMM from it?

Reply

Many arguments for AI x-risk are wrong

Vladimir Slepnev5mo3-5

I’m worried about centralization of power and wealth in opaque non-human decision-making systems, and those who own the systems.

This has been my main worry for the past few years, and to me it counts as "doom" too. AIs and AI companies playing by legal and market rules (and changing these rules by lobbying, which is also legal) might well lead to most humans having no resources to survive.

Reply

How Would an Utopia-Maximizer Look Like?

Vladimir Slepnev7mo32

It's interesting that part of human value might be having our actions matter. But if you build an AI that can give you all the things, or even if you could've built such an AI but chose not to, then objectively your actions no longer matter much after that. I've no idea how even CEV could approach this problem.

Edit: I think I've figured it out. The AI shouldn't try to build the best world according to CEV, it should take the best action for an AI to take according to CEV. So if the AI notices that humans strongly prefer to be left alone with their problems, it'll just shut down. Or find some other way to ensure that humans can't rely on AIs for everything.

Reply

3. Premise three & Conclusion: AI systems can affect value change trajectories & the Value Change Problem

Vladimir Slepnev9mo30

Yeah. It also worries me that a lot of value change has happened in the past, and much of it has been caused by selfish powerful actors for their ends rather than ours. The most obvious example is jingoism, which is often fanned up by power-hungry leaders. A more subtle example is valuing career or professional success. The part of it that isn't explained by money or the joy of the work itself, seems to be an irrational desire installed into us by employers.

Reply

Meta Questions about Metaphilosophy

Vladimir Slepnev11mo10

Maybe one example is the idea of Dutch book. It comes originally from real world situations (sport betting and so on) and then we apply it to rationality in the abstract.

Or another example, much older, is how Socrates used analogy. It was one of his favorite tools I think. When talking about some confusing thing, he'd draw an analogy with something closer to experience. For example, "Is the nature of virtue different for men and for women?" - "Well, the nature of strength isn't that much different between men and women, likewise the nature of health, so maybe virtue works the same way." Obviously this way of reasoning can easily go wrong, but I think it's also pretty indicative of how people do philosophy.

Reply

Meta Questions about Metaphilosophy

Vladimir Slepnev11mo20

I don't say it's not risky. The question is more, what's the difference between doing philosophy and other intellectual tasks.

Here's one way to look at it that just occurred to me. In domains with feedback, like science or just doing real world stuff in general, we learn some heuristics. Then we try to apply these heuristics to the stuff of our mind, and sometimes it works but more often it fails. And then doing good philosophy means having a good set of heuristics from outside of philosophy, and good instincts when to apply them or not. And some luck, in that some heuristics will happen to generalize to the stuff of our mind, but others won't.

If this is a true picture, then running far ahead with philosophy is just inherently risky. The further you step away from heuristics that have been tested in reality, and their area of applicability, the bigger your error will be.

Does this make sense?

Reply