Wiki Contributions

Comments

I agree the term AGI is rough and might be more misleading than it's worth in some cases. But I do quite strongly disagree that current models are 'AGI' in the sense most people intend.

Examples of very important areas where 'average humans' plausibly do way better than current transformers:

  • Most humans succeed in making money autonomously. Even if they might not come up with a great idea to quickly 10x $100 through entrepreneurship, they are able to find and execute jobs that people are willing to pay a lot of money for. And many of these jobs are digital and could in theory be done just as well by AIs. Certainly there is a ton of infrastructure built up around humans that help them accomplish this which doesn't really exist for AI systems yet, but if this situation was somehow equalized I would very strongly bet on the average human doing better than the average GPT-4-based agent. It seems clear to me that humans are just way more resourceful, agentic, able to learn and adapt etc. than current transformers are in key ways.
  • Many humans currently do drastically better on the METR task suite (https://github.com/METR/public-tasks) than any AI agents, and I think this captures some important missing capabilities that I would expect an 'AGI' system to possess. This is complicated somewhat by the human subjects not being 'average' in many ways, e.g. we've mostly tried this with US tech professionals and the tasks include a lot of SWE, so most people would likely fail due to lack of coding experience.
  • Take enough randomly sampled humans and set them up with the right incentives and they will form societies, invent incredibly technologies, build productive companies etc. whereas I don't think you'll get anything close to this with a bunch of GPT-4 copies at the moment

I think AGI for most people evokes something that would do as well as humans on real-world things like the above, not just something that does as well as humans on standardized tests.

ARC evals has only existed since last fall, so for obvious reasons we have not evaluated very early versions. Going forward I think it would be valuable and important to evaluate models during training or to scale up models in incremental steps.

As someone who has been feeling increasingly skeptical of working in academia I really appreciate this post and discussion on it for challenging some of my thinking here. 

I do want to respond especially to this part though, which seems cruxy to me:

Furthermore, it is a mistake to simply focus on efforts on whatever timelines seem most likely; one should also consider tractability and neglectedness of strategies that target different timelines. It seems plausible that we are just screwed on short timelines, and somewhat longer timelines are more tractable. Also, people seem to be making this mistake a lot and thus short timelines seem potentially less neglected.

I suspect this argument pushes in the other direction. On longer timelines the amount of effort which will eventually get put toward the problem is much greater. If the community continues to grow at the current pace, then 20 year timeline worlds might end up seeing almost 1000x as much effort put toward the problem in total than 5 year timeline worlds. So neglectedness considerations might tell us that impacts on 5 year timeline worlds are 1000x more important than impacts on 20 year timeline worlds. This is of course mitigated by the potential for your actions to accrue more positive knock-on effects over 20 years, for instance very effective field building efforts could probably overcome this neglectedness penalty in some cases. But in terms of direct impacts on different timeline scenarios this seems like a very strong effect.

On the tractability point, I suspect you need some overly confident model of how difficult alignment turns out to be for this to overcome the neglectedness penalty. E.g. Owen Cotton-Barret suggests here using a log-uniform prior for the difficulty of unknown problems, which (unless you think alignment success in short timelines is essentially impossible) would indicate that tractability is constant. Using a less crude approximation we might use something like a log-normal distribution for the difficulty of solving alignment, where we see overall decreasing returns to effort unless you have extremely low variance (implying you know almost exactly which OOM of effort is enough to solve alignment) or extremely low probability of success by default (<< 1%). 

Overall my current guess is that tractability/neglectedness pushes toward working on short timelines, and gives a penalty to delayed impact of perhaps 10x per decade (20x penalty from neglectedness, compensated by a 2x increase in tractability). 

If you think that neglectedness/tractability overall pushes toward targeting impact toward long timelines then I'd be curious to see that spelled out more clearly (e.g. as a distribution over the difficulty of solving alignment that implies some domain of increasing returns to effort, or some alternative way to model this). This seems very important if true.  

These sorts of problems are what caused me to want a presentation which didn't assume well-defined agents and boundaries in the ontology, but I'm not sure how it applies to the above - I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I'm likely missing your point.

Theron Pummer has written about this precise thing in his paper on Spectrum Arguments, where he touches on this argument for "transitivity=>comparability" (here notably used as an argument against transitivity rather than an argument for comparability) and its relation to 'Sorites arguments' such as the one about sand heaps.

Personally I think the spectrum arguments are fairly convincing for making me believe in comparability, but I think there's a wide range of possible positions here and it's not entirely obvious which are actually inconsistent. Pummer even seemed to think rejecting transitivity and comparability could be a plausible position and that the math could work out in nice ways still.

Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.

I've historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out 'for free'. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly improve its corrigibility - as long as the pointer really is correct. I agree though that this solution doesn't seem stable to mistakes in the 'pointing', which is very concerning and makes me start to lean toward something more like act-based corrigibility being safer.

I'm still very pessimistic about indifference corrigibility though, in that it still seems extremely fragile/low-measure-in-agent-space. I think maybe I'm stuck imagining complex/unnatural indifference, as in finding agents indifferent to whether a stop-button is pressed, and that my intuition might change if I spend more time thinking about examples like myopia or world-model <-> world interaction, where the indifference seems to have more 'natural' boundaries in some sense.

I really like this model of computation and how naturally it deals with counterfactuals, surprised it isn't talked about more often.

This raises the issue of abstraction - the core problem of embedded agency.

I'd like to understand this claim better - are you saying that the core problem of embedded agency is relating high-level agent models (represented as causal diagrams) to low-level physics models (also represented as causal diagrams)?

I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those).

I'm quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding 'agentic' models from this framework is:

  1. Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them.
  2. If we believe then things which generalize very competently are likely to have agent-like internal architecture.
  3. Having a selection criteria or model-space/prior which actively pushes away from such agent-like architectures could then help push away from things which generalize too broadly.

I think my main problem with this argument is that step 3 might make step 2 invalid - it might be that if you actively punish agent-like architecture in your search then you will break the conditions that made 'too broad generalization' imply 'agent-like architecture', and thus end up with things that still generalize very broadly (with all the downsides of this) but just look a lot weirder.

This seems too optimistic/trusting. See Ontology identification problem, Modeling distant superintelligences, and more recently The “Commitment Races” problem.

Thanks for the links, I definitely agree that I was drastically oversimplifying this problem. I still think this task might be much simpler than the task of trying to understand the generalization of some strange model whose internal working we don't even have a vocabulary to describe.