All of Logan Riggs's Comments + Replies

Frame for Take-Off Speeds to inform compute governance & scaling alignment

I wonder how much COVID got people to switch to working on Biorisks.

What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them.

I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.

Prize for Alignment Research Tasks

[Note: this one, steelman, and feedback on proposals all have very similar input spaces. I think I would ideally mix them as one in an actual product, but I'm keeping them separate for now]

Task: Obvious EA/Alignment Advice

  • Context: There are several common mental motions that the EA community does which are usefully applied to alignment. Ex. "Differential Impact", "Counterfactual Impact", "Can you clearly tell a story on how this reduces x-risk?", and "Truly Optimizing for X". A general "obvious advice" is useful for general capabilities as well, but this i
... (read more)
Prize for Alignment Research Tasks

Task: Steelman Alignment proposals

  • Context: Some alignment research directions/proposals have a kernel of truth to them. Steelmanning these ideas to find the best version of it may open up new research directions or, more likely, make the pivot to alignment research easier. On the latter, some people are resistant to change their research direct, and a steelman will only slightly change the topic while focusing on maximizing impact. This would make it easier to convince these people to change to a more alignment-related direction.
  • Input Type: A general resea
... (read more)
Prize for Alignment Research Tasks

Task: Feedback on alignment proposals

  • Context: Some proposals for a solution to alignment are dead ends or have common criticisms. Having an easy way of receiving this feedback on one's alignment proposal can prevent wasted effort as well as furthering the conversation on that feedback.
  • Input Type: A proposal for a solution to alignment or a general research direction
  • Output Type: Common criticisms or arguments for dead ends for that research direction

Instance 1

Input:

Currently AI systems are prone to bias and unfairness which is unaligned with our values. I w

... (read more)
Productive Mistakes, Not Perfect Answers

It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…

I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.

2Joe_Collman1mo
For sure I agree that the researcher knowing these things is a good start - so getting as many potential researchers to grok these things is important. My question is about which ideas researchers should focus on generating/elaborating given that they understand these things. We presumably don't want to restrict thinking to ideas that may overcome all these issues - since we want to use ideas that fail in some respects, but have some aspect that turns out to be useful. Generating a broad variety of new ideas is great, and we don't want to be too quick in throwing out those that miss the target. The thing I'm unclear about is something like: What target(s) do I aim for if I want to generate the set of ideas with greatest value? I don't think that "Aim for full alignment solution" is the right target here. I also don't think that "Aim for wacky long-shots" is the right target - and of course I realize that Adam isn't suggesting this. (we might find ideas that look like wacky long-shots from outside, but we shouldn't be aiming for wacky long-shots) But I don't have a clear sense of what target I would aim for (or what process I'd use, what environment I'd set up, what kind of people I'd involve...), if my goal were specifically to generate promising ideas (rather than to work on them long-term, or to generate ideas that I could productively work on). Another disanalogy with previous research/invention... is that we need to solve this particular problem. So in some sense a history of: [initially garbage-looking-idea] ---> [important research problem solved] may not be relevant. What we need is: [initially garbage-looking-idea generated as attempt to solve x] ---> [x was solved] It's not good enough if we find ideas that are useful for something, they need to be useful for this. I expect the kinds of processes that work well to look different from those used where there's no fixed problem.
A survey of tool use and workflows in alignment research

Ya, I was even planning on trying:

[post/blog/paper] rohinmshah karma: 100 Planned summary for the Alignment Newsletter: \n> 

Then feed that input to.

Planned opinion:

to see if that has some higher-quality summaries. 

Job Offering: Help Communicate Infrabayesianism

For those with math backgrounds not already familiar with InfraBayes (maybe people share the post with their math-background friends), can there be specifics for context? Like:

If you have experience with topology, functional analysis, measure theory, and convex analysis then...

Or

You can get a good sense of InfraBayes from [this post] or [this one]

Or

A list of InfraBayes posts can be found here.

How I Formed My Own Views About AI Safety

No, "why" is correct. See the rest of the sentence:

Write out all the counter-arguments you can think of, and repeat
 

It's saying assume it's correct, then assume it's wrong, and repeat. Clever arguers don't usually devil advocate themselves.

The Big Picture Of Alignment (Talk Part 1)

How do transcriptions typically handle images? They're pretty important for this talk. You could embed the images in the text as it progresses?

The Big Picture Of Alignment (Talk Part 1)

Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I'm unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it.

Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)... (read more)

The Big Picture Of Alignment (Talk Part 1)

Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?

Cultural accumulation and google, but that's mimicking someone who's already figured it out. How about the person who first figured out eg crop growth? Could be scientific method, but also just random luck which then caught on. 

Additionally, sometimes it's just applying the same hammers to different nails or finding new nails, which means that there are general patterns (hammers) that can be applied to many diffe

... (read more)
The Big Picture Of Alignment (Talk Part 1)

Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:

  • Vast majority don't sort (or are even compilable)
  • The vast majority of programs that "look like they work", don't (eg "forgot a semicolon", "didn't account for an already sorted list", etc)
  • Generalizing: the vast majority of programs that pass [Unit tests, compiles, human says "looks good to me", simple], don't work. 
    • Could be incomprehensible, pass several unit tests, but still fail in weird edge case
... (read more)
Solving Interpretability Week

Any suggestions for the format in future weeks? Or a criticism of the idea in general?

Solving Interpretability Week

I'm available for co-working to discuss any post or potential project on interpretability or if you'd like someone to bounce ideas off of. My calendly link is here, I'm available all week at many times, and I won't take more than 2 meetings in a day, but I'll email you within the day to reschedule if that happens.

Solving Interpretability Week

Do you want to co-work? Please include your availability and way to contact you (I personally recommend calendly)

1Evan R. Murphy5mo
I'm interested in trying a co-work call sometime but won't have time for it this week. Thanks for sharing about Shay in this post. I had not heard of her before, what a valuable resource/way she's helping the cause of AI safety. (As for contact, I check my LessWrong/Alignment Forum inbox for messages regularly.)
1Logan Riggs Smith5mo
I'm available for co-working to discuss any post or potential project on interpretability or if you'd like someone to bounce ideas off of. My calendly link is here [https://calendly.com/elriggs/chat?back=1&month=2021-12], I'm available all week at many times, and I won't take more than 2 meetings in a day, but I'll email you within the day to reschedule if that happens.
Solving Interpretability Week

What are research directions you want discussed? Is there a framework or specific project you think would further transparency and interpretability? 

Corrigibility Can Be VNM-Incoherent

Summary & Thoughts:

Define’s corrigibility as “agent’s willingness to let us change it’s policy w/o incentivized to manipulate us”. Separates terms to define:

  1. Weakly-corrigible to policy change pi - if there exists an optimal policy where not disabling is optimal.
  2. Strictly-corrigible - if all optimal policies don’t disable correction.

For most optimal policies, correcting it in the way we want is a small minority. If correcting leads to more optimal policies, it’s then optimal to manipulate us into “correcting it”. So we can’t get strict-corrigibility with... (read more)

Solve Corrigibility Week

The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that?

In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.

Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could've done a better job at not coming across tha... (read more)

1Koen Holtman6mo
Yes, by calling this site a "community of philosophers", I roughly mean that at the level of the entire community, nobody can agree that progress is being made. There is no mechanism for creating a community-wide agreement that a problem has been solved. You give three specific examples of progress above. From his recent writings, it is clear that Yudkowsky does not believe, like you do, that any contributions posted on this site in the last few years have made any meaningful progress towards solving alignment. You and I may agree that some or all of the above three examples represent some form of progress, but you and I are not the entire community here, Yudkowsky is also part of it. On the last one of your three examples, I feel that 'mesa optimizers' is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people's definitions or models as a frame of reference. They'd rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about. I am sensing from your comments that you believe that, with more hard work and further progress on understanding alignment, it will in theory be possible to make this community agree, in future, that certain alignment problems have been solved. I, on the other hand, do not believe that it is possible to ever reach that state of agreement in this community, because the debating rules of philosophy apply here. Philosophers are always allowed to disagree based on strongly held intuitive beliefs that they cannot be expected to explain any further. The type of agreement you seek is only possible in a sub-community which is willing to use more strict rules of
Solve Corrigibility Week

I've updated my meeting times to meet more this week if you'd like to sign up for a slot? (link w/ a pun) , and from his comment, I'm sure diffractor would also be open to meeting. 

I will point out that there's a confusion in terms that I noticed in myself of corrigibility meaning either "always correctable" and "something like CEV", though we can talk that over a call too:)

Solve Corrigibility Week

I think we're pretty good at avoiding semantic arguments. The word "corrigible" can (and does) mean different things to different people on this site. Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement. We can taboo the word corrigible.

This has actually already happened in the document with corrigible either meaning:

  1. Correctable all the time regardless
  2. Correctable up until the point where the agent actually knows how to achieve your values better than you (related to intent alignment and
... (read more)
1Koen Holtman6mo
Indeed this can resolve disagreement among a small sub-group of active participants. This is an important tool if you want to make any progress. The point I was trying to make is about what is achievable for the entire community, not what is achievable for a small sub-group of committed participants. The community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved, or whether some sub-group has made meaningful progress on it. To make the same point in another way: the forces which introduce disagreeing viewpoints andlinguistic entropy [https://www.lesswrong.com/posts/MiYkTp6QYKXdJbchu/disentangling-corrigibility-2015-2021#Linguistic_entropy] to this forum are stronger than the forces that push towards agreement and clarity. My thinking about how strong these forces are has been updated recently, by the posting of a whole sequence of Yudkowsky conversations [https://www.lesswrong.com/s/n945eovrA3oDueqtq] and also this one [https://www.lesswrong.com/posts/CpvyhFy9WvCNsifkY/discussion-with-eliezer-yudkowsky-on-agi-interventions] . In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read. I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.
Solve Corrigibility Week
  • Google docs is kind of weird because I have to trust people won't spam suggestions. I also may need to keep up with allowing suggestions on a consistent basis. I would want this hosted on LW/AlignmentForum, but I do really like in-line commenting and feeling like there's less of a quality-bar to meet. I'm unsure if this is just me.
  • Walled Garden group discussion block time: have a block of ~4-16 hours using Walled Garden software. There could be a flexible schedule with schelling points to coordinate meeting up. For example, if someone wants to give a talk
... (read more)
Solve Corrigibility Week

Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.

1Logan Riggs Smith6mo
* Google docs is kind of weird because I have to trust people won't spam suggestions. I also may need to keep up with allowing suggestions on a consistent basis. I would want this hosted on LW/AlignmentForum, but I do really like in-line commenting and feeling like there's less of a quality-bar to meet. I'm unsure if this is just me. * Walled Garden group discussion block time: have a block of ~4-16 hours using Walled Garden software. There could be a flexible schedule with schelling points to coordinate meeting up. For example, if someone wants to give a talk on a specific corrigibly research direction and get live feedback/discussion, they can schedule a time to do so. * Breaking up the task comment. Technically the literature review, summaries, extra thoughts is a “task” to do. I do want broken down tasks that many people could do, though what may end up happening is whoever wants a specific task done ends up doing it themselves. Could also have “possible research directions” as a high-level comment.
Solve Corrigibility Week
  • Timelines and forecasting 
  • Goodhart’s law
  • Power-seeking
  • Human values
  • Learning from human feedback
  • Pivotal actions
  • Bootstrapping alignment 
  • Embedded agency 
  • Primer on language models, reinforcement learning, or machine learning basics 
    • This ones not really on-topic, but I do see value in a more “getting up to date” focus where experts can give talks or references to learn things (eg “here’s a tutorial for implementing a small GPT-2”). Though I could just periodically ask LW questions on whatever topic ends up interesting me at the moment. Though,
... (read more)
Solve Corrigibility Week

Potential topics: what other topics besides corrigibility could we collaborate on in future weeks? Also, are we able to poll users for topics in site?

2Logan Riggs Smith6mo
* Timelines and forecasting * Goodhart’s law * Power-seeking * Human values * Learning from human feedback * Pivotal actions * Bootstrapping alignment * Embedded agency * Primer on language models, reinforcement learning, or machine learning basics * This ones not really on-topic, but I do see value in a more “getting up to date” focus where experts can give talks or references to learn things (eg “here’s a tutorial for implementing a small GPT-2”). Though I could just periodically ask LW questions on whatever topic ends up interesting me at the moment. Though, I could do my own Google search, but I feel there’s some community value here that won’t be gained. Like learning and teaching together makes it easier for the community to coordinate in the future. Plus connections bonuses.
Solve Corrigibility Week

Update: I am available this week until Saturday evening at this calendly link(though I will close the openings if a large number of people sign up) I am available all Saturday Dec 4th (calendly link will allow you to see your time zone). We can read and discuss posts, do tasks together, or whatever you want. Previous one-on-one conversations with members of the community have gone really well.There’s not a person here I haven’t enjoyed getting to know, so do feel free to click that link and book a time!

Solve Corrigibility Week

Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc).

1Diffractor6mo
Availability: Almost all times between 10 AM and PM, California time, regardless of day. Highly flexible hours. Text over voice is preferred, I'm easiest to reach on Discord. The LW Walled Garden can also be nice.
1Logan Riggs Smith6mo
Update: I am available this week until Saturday evening at this calendly link [https://calendly.com/elriggs/chat](though I will close the openings if a large number of people sign up) I am available all Saturday Dec 4th [https://calendly.com/elriggs/solving-corrigibility-day] (calendly link will allow you to see your time zone). We can read and discuss posts, do tasks together, or whatever you want. Previous one-on-one conversations with members of the community have gone really well.There’s not a person here I haven’t enjoyed getting to know, so do feel free to click that link and book a time!
Corrigibility Can Be VNM-Incoherent

The agent could then manipulate whoever’s in charge of giving the “hand-of-god” optimal action.

I do think the “reducing uncertainty” captures something relevant, and turntrout’s outside view post (huh, guess I can’t make links on mobile, so here: https://www.lesswrong.com/posts/BMj6uMuyBidrdZkiD/corrigibility-as-outside-view) grounds out uncertainty to be “how wrong am I about the true reward of many different people I could be helping out?”

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

I don't think I understand the question. Can you rephrase?

Your example actually cleared this up for me as well! I wanted an example where the inequality failed even if you had an involution on hand. 

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

You write

This point may seem obvious, but cardinality inequality is insufficient in general. The set copy relation is required for our results

Could you give a toy example of this being insufficient (I'm assuming the "set copy relation" is the "B contains n of A" requiring)?

How does the "B contains n of A" requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes). 

2Alex Turner6mo
A:={(1 0 0)} B:={(0 .3 .7), (0 .7 .3)} Less opaquely, see the technical explanation for this counterexample [https://www.lesswrong.com/s/fSMbebQyR4wheRrvk/p/6DuJxY8X45Sco4bS2#When_is_Seeking_POWER_Convergently_Instrumental_] , where the right action leads to two trajectories, and up leads to a single one. For this, I think we need to zoom out to a causal DAG (w/ choice nodes) picture of the world, over some reasonable abstractions. It's just too unnatural to pick out deception subgraphs in an MDP, as far as I can tell, but maybe there's another version of the argument. If the AI cares about things-in-the-world, then if it were a singleton it could set many nodes to desired values independently. For example, the nodes might represent variable settings for different parts of the universe—what's going on in the asteroid belt, in Alpha Centauri, etc. But if it has to work with other agents (or, heaven forbid, be subjugated by them), it has fewer degrees of freedom in what-happens-in-the-universe. You can map copies of the "low control" configurations to the "high control" configurations several times, I think. (I think it should be possible to make precise what I mean by "control", in a way that should fairly neatly map back onto POWER-as-average-optimal-value.) So this implies a push for "control." One way to get control is manipulation or deception or other trickery, and so deception is one possible way this instrumental convergence "prophecy" could be fulfilled.
Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

Table 1 of the paper (pg. 3) is a very nice visual of the different settings.

For the "Theorem: Training retargetability criterion", where f(A, u) >= its involution, what would be the case where it's not greater/equal to it's involution? Is this when the options in B are originally more optimal?

Also, that theorem requires each involution to be greater/equal than the original. Is this just to get a lower bound on the n-multiple or do less-than involutions not add anything?

3Alex Turner6mo
I don't think I understand the question. Can you rephrase? Less-than involutions aren't guaranteed to add anything. For example, iff(a)=1 iff a goes left and 0 otherwise, any involutions to plans going right will be 0, and all orbits will unanimously agree that left is greater f-value.
Competent Preferences

So a highly competent preference helps predict those preferences. But I'm confused on how "violating one-sided competent preferences" makes sense with Goodhart's law. 

As an example, "Prefer 2 bananas over 1" can be very competent if it correctly predicts preference in a wide range of scenarios (eg different parts of the day, after anti-banana propaganda,etc) with incompetent meaning it's prediction is wrong (max entropy or opposite of correct?). Assuming it's competent, what does violating this preference mean? That the AI predicted 1 banana over 2 or that the simple rule "Prefers 2 over 1" didn't actually apply?

2Charlie Steiner7mo
By "violate a preference," I mean that the preference doesn't get satisfied - so if the human competently prefers 2 bananas but only got 1 banana, their preference has been violated. But maybe you mean something along the lines of "If competent preferences are really broadly predictive, then wouldn't it be even more predictive to infer the preference 'the human prefers 2 bananas except when the AI gives them 1', since that would more accurately predict how many bananas the humans gets? This would sort of paint us into a corner where it's hard to violate competent preferences as defined." My response would be that competence is based off of how predictive and efficient the model is (just to reiterate, preferences live inside a model of the world), not how often you get what you want. Even if you never get 2 bananas and have only gotten 1 banana your entire life, a model that predicts that you want 2 bananas can still be competent if the hypothesis of you wanting 2 bananas helps explain how you've reacted to your life as a 1-banana-getter.
Reframing Impact

This post (or sequence of posts) not only gave me a better handle on impact and what that means for agents, but it also is a concrete example of de-confusion work. The execution of the explanations gives an "obvious in hindsight" feeling, with "5-minute timer"-like questions which pushed me to actually try and solve the open question of an impact measure. It's even inspired me to apply this approach to other topics in my life that had previously confused me; it gave me the tools and a model to follow.

And, the illustrations are pretty fun and engaging, too.

What's a Decomposable Alignment Topic?

b) seems right. I'm unsure what (a) could mean (not much overhead?).

I feel confused to think about decomposability w/o considering the capabilities of the people I'm handing the tasks off to. I would only add:

By "smart", assume they can notice confusion, google, and program

since that makes the capabilities explicit.

What's a Decomposable Alignment Topic?

If you only had access to people who can google, program, and notice confusion, how could you utilize that to make conceptual progress on a topic you care about?

Decomposable: Make a simple first person shooter. Could be decomposed into creating asset models, and various parts of the actual code can be decomposed (input-mapping, getting/dealing damage).

Non-decomposable: Help me write an awesome piano song. Although this can be decomposed, I don't expect anyone to have the skills required (and acquiring the skills requires too much overhead).

Let's operationalize "too much overhead" to mean "takes more than 10 hours to do useful, meaningful tasks".

4Raymond Arnold2y
Am I correct that the real generating rule here is something like "I have a group of people who'd like to work on some alignment open problems, and want a problem that is a) easy to give my group, and b) easy to subdivide once given to my group?"
What's a Decomposable Alignment Topic?

The first one. As long as you can decompose the open problem into tractable, bite-sized pieces, it's good.

Vanessa mentioned some strategies that might generalize to other open problems: group decomposition (we decide how to break a problem up), programming to empirically verify X, and literature reviews.

2Abram Demski2y
I'm unclear on how to apply this filter. Can you give an example of what you mean by decomposable, and an example of not? (Perhaps not from alignment.)
What's a Decomposable Alignment Topic?

I don't know (partially because I'm unsure who would stay and leave).

If you didn't take math background that in consideration and wrote a proposal (saying "requires background in real analysis" or ...), then that may push out people w/o that background but also attract people with that background.

As long as pre-reqs are explicit, you should go for it.