[Note: this one, steelman, and feedback on proposals all have very similar input spaces. I think I would ideally mix them as one in an actual product, but I'm keeping them separate for now]
Input:
... (read more)Currently AI systems are prone to bias and unfairness which is unaligned with our values. I w
It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…
I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.
Ya, I was even planning on trying:
[post/blog/paper] rohinmshah karma: 100 Planned summary for the Alignment Newsletter: \n>
Then feed that input to.
Planned opinion:
to see if that has some higher-quality summaries.
For those with math backgrounds not already familiar with InfraBayes (maybe people share the post with their math-background friends), can there be specifics for context? Like:
If you have experience with topology, functional analysis, measure theory, and convex analysis then...
Or
You can get a good sense of InfraBayes from [this post] or [this one]
Or
A list of InfraBayes posts can be found here.
No, "why" is correct. See the rest of the sentence:
Write out all the counter-arguments you can think of, and repeat
It's saying assume it's correct, then assume it's wrong, and repeat. Clever arguers don't usually devil advocate themselves.
How do transcriptions typically handle images? They're pretty important for this talk. You could embed the images in the text as it progresses?
Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I'm unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it.
Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)... (read more)
Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?
Cultural accumulation and google, but that's mimicking someone who's already figured it out. How about the person who first figured out eg crop growth? Could be scientific method, but also just random luck which then caught on.
Additionally, sometimes it's just applying the same hammers to different nails or finding new nails, which means that there are general patterns (hammers) that can be applied to many diffe
Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:
Any suggestions for the format in future weeks? Or a criticism of the idea in general?
I'm available for co-working to discuss any post or potential project on interpretability or if you'd like someone to bounce ideas off of. My calendly link is here, I'm available all week at many times, and I won't take more than 2 meetings in a day, but I'll email you within the day to reschedule if that happens.
Do you want to co-work? Please include your availability and way to contact you (I personally recommend calendly)
What are research directions you want discussed? Is there a framework or specific project you think would further transparency and interpretability?
Summary & Thoughts:
Define’s corrigibility as “agent’s willingness to let us change it’s policy w/o incentivized to manipulate us”. Separates terms to define:
For most optimal policies, correcting it in the way we want is a small minority. If correcting leads to more optimal policies, it’s then optimal to manipulate us into “correcting it”. So we can’t get strict-corrigibility with... (read more)
The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that?
In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.
Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could've done a better job at not coming across tha... (read more)
I've updated my meeting times to meet more this week if you'd like to sign up for a slot? (link w/ a pun) , and from his comment, I'm sure diffractor would also be open to meeting.
I will point out that there's a confusion in terms that I noticed in myself of corrigibility meaning either "always correctable" and "something like CEV", though we can talk that over a call too:)
I think we're pretty good at avoiding semantic arguments. The word "corrigible" can (and does) mean different things to different people on this site. Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement. We can taboo the word corrigible.
This has actually already happened in the document with corrigible either meaning:
Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.
Potential topics: what other topics besides corrigibility could we collaborate on in future weeks? Also, are we able to poll users for topics in site?
Update: I am available this week until Saturday evening at this calendly link(though I will close the openings if a large number of people sign up) I am available all Saturday Dec 4th (calendly link will allow you to see your time zone). We can read and discuss posts, do tasks together, or whatever you want. Previous one-on-one conversations with members of the community have gone really well.There’s not a person here I haven’t enjoyed getting to know, so do feel free to click that link and book a time!
Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc).
The agent could then manipulate whoever’s in charge of giving the “hand-of-god” optimal action.
I do think the “reducing uncertainty” captures something relevant, and turntrout’s outside view post (huh, guess I can’t make links on mobile, so here: https://www.lesswrong.com/posts/BMj6uMuyBidrdZkiD/corrigibility-as-outside-view) grounds out uncertainty to be “how wrong am I about the true reward of many different people I could be helping out?”
I don't think I understand the question. Can you rephrase?
Your example actually cleared this up for me as well! I wanted an example where the inequality failed even if you had an involution on hand.
You write
This point may seem obvious, but cardinality inequality is insufficient in general. The set copy relation is required for our results
Could you give a toy example of this being insufficient (I'm assuming the "set copy relation" is the "B contains n of A" requiring)?
How does the "B contains n of A" requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).
Table 1 of the paper (pg. 3) is a very nice visual of the different settings.
For the "Theorem: Training retargetability criterion", where f(A, u) >= its involution, what would be the case where it's not greater/equal to it's involution? Is this when the options in B are originally more optimal?
Also, that theorem requires each involution to be greater/equal than the original. Is this just to get a lower bound on the n-multiple or do less-than involutions not add anything?
So a highly competent preference helps predict those preferences. But I'm confused on how "violating one-sided competent preferences" makes sense with Goodhart's law.
As an example, "Prefer 2 bananas over 1" can be very competent if it correctly predicts preference in a wide range of scenarios (eg different parts of the day, after anti-banana propaganda,etc) with incompetent meaning it's prediction is wrong (max entropy or opposite of correct?). Assuming it's competent, what does violating this preference mean? That the AI predicted 1 banana over 2 or that the simple rule "Prefers 2 over 1" didn't actually apply?
This post (or sequence of posts) not only gave me a better handle on impact and what that means for agents, but it also is a concrete example of de-confusion work. The execution of the explanations gives an "obvious in hindsight" feeling, with "5-minute timer"-like questions which pushed me to actually try and solve the open question of an impact measure. It's even inspired me to apply this approach to other topics in my life that had previously confused me; it gave me the tools and a model to follow.
And, the illustrations are pretty fun and engaging, too.
b) seems right. I'm unsure what (a) could mean (not much overhead?).
I feel confused to think about decomposability w/o considering the capabilities of the people I'm handing the tasks off to. I would only add:
By "smart", assume they can notice confusion, google, and program
since that makes the capabilities explicit.
If you only had access to people who can google, program, and notice confusion, how could you utilize that to make conceptual progress on a topic you care about?
Decomposable: Make a simple first person shooter. Could be decomposed into creating asset models, and various parts of the actual code can be decomposed (input-mapping, getting/dealing damage).
Non-decomposable: Help me write an awesome piano song. Although this can be decomposed, I don't expect anyone to have the skills required (and acquiring the skills requires too much overhead).
Let's operationalize "too much overhead" to mean "takes more than 10 hours to do useful, meaningful tasks".
The first one. As long as you can decompose the open problem into tractable, bite-sized pieces, it's good.
Vanessa mentioned some strategies that might generalize to other open problems: group decomposition (we decide how to break a problem up), programming to empirically verify X, and literature reviews.
I don't know (partially because I'm unsure who would stay and leave).
If you didn't take math background that in consideration and wrote a proposal (saying "requires background in real analysis" or ...), then that may push out people w/o that background but also attract people with that background.
As long as pre-reqs are explicit, you should go for it.
I wonder how much COVID got people to switch to working on Biorisks.
What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them.
I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.