gallabytes — AI Alignment Forum

Evaluating the historical value misspecification argument

To pick a toy example, you can use text as a bottleneck to force systems to "think out loud" in a way which will be very directly interpretable by a human reader, and because language understanding is so rich this will actually be competitive with other approaches and often superior.

I'm sure you can come up with more ways that the existence of software that understands language and does ~nothing else makes getting computers to do what you mean easier than if software did not understand language. Please think about the problem for 5 minutes. Use a clock.

Evaluating the historical value misspecification argument

gallabytes2y2-18

Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.

Everyone agrees that you assumed superintelligence would understand everything humans understand and more. The dispute is entirely about the things that you encounter before superintelligence. In general it seems like the world turned out much more gradual than you expected and there's information to be found in what capabilities emerged sooner in the process.

Where does ADT Go Wrong?

gallabytes8y00

When considering an embedder $F$ , in universe $U$ , in response to which SADT picks policy $π$ , I would be tempted to apply the following coherence condition:

$E [F (π)] = E [F (D D T)] = E [U]$

(all approximately of course)

I'm not sure if this would work though. This is definitely a necessary condition for reasonable counterfactuals, but not obviously sufficient.

A new proposal for logical counterfactuals

gallabytes9y00

By censoring I mean a specific technique for forcing the consistency of a possibly inconsistent set of axioms.

Suppose you have a set of deduction rules $D$ over a language $ℓ$ . You can construct a function $f_{D} : P (ℓ) \to P (ℓ)$ that takes a set of sentences $S$ and outputs all the sentences that can be proved in one step using $D$ and the sentences in $S$ . You can also construct a censored $f_{D}^{'}$ by letting $f_{D}^{'} (S) = {ϕ | ϕ \in f_{D} (S) \land \neg ϕ \notin S}$ .

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments