Takeaways from safety by default interviews

Last year, several researchers at AI Impacts (primarily Robert Long and I) interviewed prominent researchers inside and outside of the AI safety field who are relatively optimistic about advanced AI being developed safely. These interviews were originally intended to focus narrowly on reasons for optimism, but we ended up covering a variety of topics, including AGI timelines, the likelihood of current techniques leading to AGI, and what the right things to do in AI safety are right now. 

We talked to Ernest Davis, Paul Christiano, Rohin Shah, Adam Gleave, and Robin Hanson.

Here are some more general things I personally found noteworthy while conducting these interviews. For interview-specific summaries, check out our Interviews Page.

Relative optimism in AI often comes from the belief that AGI will be developed gradually, and problems will be fixed as they are found rather than neglected.

All of the researchers we talked to seemed to believe in non-discontinuous takeoff.1 Rohin gave ‘problems will likely be fixed as they come up’ as his primary reason for optimism,2 Adam3 and Paul4 both mentioned it as a reason.

Relatedly, both Rohin5 and Paul6 said one thing that could update their views was gaining information about how institutions relevant to AI will handle AI safety problems– potentially by seeing them solve relevant problems, or by looking at historical examples.

I think this is a pretty big crux around the optimism view; my impression is that MIRI researchers generally think that 1) the development of human-level AI will likely be fast and potentially discontinuous and 2) people will be incentivized to hack around and redeploy AI when they encounter problems. See Likelihood of discontinuous progress around the development of AGI for more on 1). I think 2) could be a fruitful avenue for research; in particular, it might be interesting to look at recent examples of people in technology, particularly ML, correcting software issues, perhaps when they’re against their short-term profit incentives. Adam said he thought the AI research community wasn’t paying enough attention to building safe, reliable, systems.7

Many of the arguments I heard around relative optimism weren’t based on inside-view technical arguments.

This isn’t that surprising in hindsight, but it seems interesting to me that though we interviewed largely technical researchers, a lot of their reasoning wasn’t based particularly on inside-view technical knowledge of the safety problems. See the interviews for more evidence of this, but here’s a small sample of the not-particularly-technical claims made by interviewees:

  • AI researchers are likely to stop and correct broken systems rather than hack around and redeploy them.8
  • AI has and will progress via a cumulation of lots of small things rather than via a sudden important insight.9
  • Many technical problems feel intractably hard in the way that AI safety feels now, and still get solved within ~10 years.10
  • Evolution baked very little into humans; babies learn almost everything from their experiences in the world.11

My instinct when thinking about AGI is to defer largely to safety researchers, but these reasons felt noteworthy to me in that they seemed like questions that were perhaps better answered by economists or sociologists (or for the latter case, neuroscientists) than safety researchers. I really appreciated Robin’s efforts to operationalize and analyze the second claim above.

(Of course, many of the claims were also more specific to machine learning and AI safety.)

There are lots of calls for individuals with views around AI risk to engage with each other and understand the reasoning behind  fundamental disagreements. 

This is especially true around views that MIRI have, which many optimistic researchers reported not having a good understanding of.

This isn’t particularly surprising, but there was a strong universal and unprompted theme that there wasn’t enough engagement around AI safety arguments. Adam and Rohin both said they had a much worse understanding than they would like of others viewpoints.12 Robin13 and Paul14 both pointed to some existing but meaningful unfinished debate in the space.

By Asya Bergal

Notes

  1. Paul Christiano: https://sideways-view.com/2018/02/24/takeoff-speeds/

    Rohin Shah: “I don’t know, in a world where fast takeoff is true, lots of things are weird about the world, and I don’t really understand the world. So I’m like, “Shit, it’s quite likely something goes wrong.” I think the slow takeoff is definitely a crux. Also, we keep calling it slow takeoff and I want to emphasize that it’s not necessarily slow in calendar time. It’s more like gradual. … Yeah. And there’s no discontinuity between… you’re not like, “Here’s a 2X human AI,” and a couple of seconds later it’s now… Not a couple of seconds later, but like, “Yeah, we’ve got 2X AI,” for a few months and then suddenly someone deploys a 10,000X human AI. If that happened, I would also be pretty worried. It’s more like there’s a 2X human AI, then there’s like a 3X human AI and then a 4X human AI. Maybe this happens from the same AI getting better and learning more over time. Maybe it happens from it designing a new AI system that learns faster, but starts out lower and so then overtakes it sort of continuously, stuff like that.”
    — From Conversation with Rohin Shah

    Adam Gleave: “I don’t see much reason for AI progress to be discontinuous in particular. So there’s a lot of empirical records you could bring to bear on this, and it also seems like a lot of commercially valuable interesting research applications are going to require solving some of these problems. You’ve already seen this with value learning, that people are beginning to realize that there’s a limitation to what we can just write a reward function down for, and there’s been a lot more focus on imitation learning recently. Obviously people are solving much narrower versions of what the safety community cares about, but as AI progresses, they’re going to work on broader and broader versions of these problems.”
    — From Conversastion with Adam Gleave

    Robin Hanson: “That argument was a very particular one, that this would appear under a certain trajectory, under a certain scenario. That was a scenario where it would happen really fast, would happen in a very concentrated place in time, and basically once it starts, it happens so fast, you can’t really do much about it after that point. So the only chance you have is before that point. … But I was doubting that scenario. I was saying that that wasn’t a zero probability scenario, but I was thinking it was overestimated by him and other people in that space. I still think many people overestimate the probability of that scenario. Over time, it seems like more people have distanced themselves from that scenario, yet I haven’t heard as many substitute rationales for why we should do any of this stuff early.”
    — From Conversation with Robin Hanson
  2. “The first one I had listed is that continual or gradual or slow takeoff, whatever you want to call it, allows you to correct the AI system online. And also it means that AI systems are likely to fail in not extinction-level ways before they fail in extinction-level ways, and presumably we will learn from that and not just hack around it and fix it and redeploy it.”
    — From Conversation with Rohin Shah
  3. “And then, I also have optimism that yes, the AI research community is going to try to solve these problems. It’s not like people are just completely disinterested in whether their systems cause harm, it’s just that right now, it seems to a lot of people very premature to work on this. There’s a sense of ‘how much good can we do now, where nearer to the time there’s going to just be naturally 100s of times more people working on the problem?’. I think there is still value you can do now, in laying the foundations of the field, but that maybe gives me a bit of a different perspective in terms of thinking, ‘What can we do that’s going to be useful to people in the future, who are going to be aware of this problem?’ versus ‘How can I solve all the problems now, and build a separate AI safety community?’.”
    — From Conversation with Adam Gleave

  4. “Before we get to resources or people, I think one of the basic questions is, there’s this perspective which is fairly common in ML, which is like, ‘We’re kind of just going to do a bunch of stuff, and it’ll probably work out’. That’s probably the basic thing to be getting at. How right is that?

    This is the bad view of safety conditioned on– I feel like prosaic AI is in some sense the worst– seems like about as bad as things would have gotten in terms of alignment. Where, I don’t know, you try a bunch of shit, just a ton of stuff, a ton of trial and error seems pretty bad. Anyway, this is a random aside maybe more related to the previous point. But yeah, this is just with alignment. There’s this view in ML that’s relatively common that’s like, we’ll try a bunch of stuff to get the AI to do what we want, it’ll probably work out. Some problems will come up. We’ll probably solve them. I think that’s probably the most important thing in the optimism vs pessimism side.”
    — From Conversation with Paul Christiano
  5. “I think I could imagine getting more information from either historical case studies of how people have dealt with new technologies, or analyses of how AI researchers currently think about things or deal with stuff, could change my mind about whether I think the AI community would by default handle problems that arise, which feels like an important crux between me and others.”
    — From Conversation with Rohin Shah
  6. “One can learn… I don’t know very much about any of the relevant institutions, I may know a little bit. So you can imagine easily learning a bunch about them by observing how well they solve analogous problems or learning about their structure, or just learning better about the views of people. That’s the second category.”
    — From Conversation with Paul Christiano
  7. “And then, if I look at the state of the art in AI, there’s a number of somewhat worrying trends. We seem to be quite good at getting very powerful superhuman systems in narrow domains when we can specify the objective that we want quite precisely. So AlphaStar, AlphaGo, OpenAI Five, these systems are very much lacking in robustness, so you have some quite surprising failure modes. Mostly we see adversarial examples in image classifiers, but some of these RL systems also have somewhat surprising failure modes. This seems to me like an area the AI research community isn’t paying much attention to, and I feel like it’s almost gotten obsessed with producing flashy results rather than necessarily doing good rigorous science and engineering. That seems like quite a worrying trend if you extrapolate it out, because some other engineering disciplines are much more focused on building reliable systems, so I more trust them to get that right by default.

    Even in something like aeronautical engineering where safety standards are very high, there are still accidents in initial systems. But because we don’t even have that focus, it doesn’t seem like the AI research community is going to put that much focus on building safe, reliable systems until they’re facing really strong external or commercial pressures to do so. Autonomous vehicles do have a reasonably good safety track record, but that’s somewhere where it’s very obvious what the risks are. So that’s kinda the sociological argument, I guess, for why I don’t think that the AI research community is going to solve all of the safety problems as far ahead of time as I would like.”
    — From Conversation with Adam Gleave

  8. See footnotes 2 – 4 above.
  9. “I think a lot of the case for the discontinuous progress argument falls on there being sudden insight that dropped into place, and it doesn’t seem to me like that’s what’s happening in AI, it’s more just a cumulation of lots of bags of tricks and a lot of compute. I also don’t see there being bags of compute falling out of the sky. Maybe if there was another AI winter, leading to a hardware overhang, then you might see sudden progress when AI gets funding again. But right now a lot of very well-funded organizations are spending billions of dollars on compute, including developing new application-specific integrated circuits for AI, so we’re going to be very close to the physical limits there anyway.”
    — From Conversation with Adam Gleave

    “A key idea here would be we’re getting AI progress over time, and how lumpy it is, is extremely directly relevant to these estimates.

    For example, if it was maximally lumpy, if it just shows up at one point, like the Foom scenario, then in that scenario, you kind of have to work ahead of time because you’re not sure when. There’s a substantial… if like, the mean is two centuries, but that means in every year there’s a 1-in-200 chance. There’s a half-a-percent chance next year. Half-a-percent is pretty high, I guess we better do something, because what if it happens next year?

    Okay. I mean, that’s where extreme lumpiness goes. The less lumpy it is, then the more that the variance around that mean is less. It’s just going to take a long time, and it’ll take 10% less or 10% more, but it’s basically going to take that long. The key question is how lumpy is it reasonable to expect these sorts of things. I would say, “Well, let’s look at how lumpy things have been. How lumpy are most things? Even how lumpy has computer science innovation been? Or even AI innovation?”

    I think those are all relevant data sets. There’s general lumpiness in everything, and lumpiness of the kinds of innovation that are closest to the kinds of innovation postulated here. I note that one of our best or most concrete measures we have of lumpiness is citations. That is, we can take for any research idea, how many citations the seminal paper produces, and we say, “How lumpy are citations?”

    Interestingly, citation lumpiness seems to be field independent. Not just time independent, but field independent. Seems to be a general feature of academia, which you might have thought lumpiness would vary by field, and maybe it does in some more fundamental sense, but as it’s translated into citations, it’s field independent. And of course, it’s not that lumpy, i.e. most of the distribution of citations is papers with few citations, and the few papers that have the most citations constitute a relatively small fraction of the total citations.

    That’s what we also know for other kinds of innovation literature. The generic innovation literature says that most innovation is lots of little things, even though once in a while there are a few bigger things. For example, I remember there’s this time series of the best locomotive at any one time. You have that from 1800 or something. You can just see in speed, or energy efficiency, and you see this point—.

    It’s not an exactly smooth graph. On the other hand, it’s pretty smooth. The biggest jumps are a small fraction of the total jumpiness. A lot of technical, social innovation is, as we well understand, a few big things, matched with lots of small things. Of course, we also understand that big ideas, big fundamental insights, usually require lots of complementary, matching, small insights to make it work.”
    — From Conversation with Robin Hanson
  10. “The basic argument would be like, 1) On paper I don’t think we yet have a good reason to feel doomy. And I think there’s some basic research intuition about how much a problem– suppose you poke at a problem a few times, and you’re like ‘Agh, seems hard to make progress’. How much do you infer that the problem’s really hard? And I’m like, not much. As a person who’s poked at a bunch of problems, let me tell you, that often doesn’t work and then you solve in like 10 years of effort. …

    Like most of the time, if I’m like, ‘here’s an algorithms problem’, you can like– if you just generate some random algorithms problems, a lot of them are going to be impossible. Then amongst the ones that are possible, a lot of them are going to be soluble in a year of effort and amongst the rest, a lot of them are going to be soluble in 10 or a hundred years of effort. It’s just kind of rare that you find a problem that’s soluble– by soluble, I don’t just mean soluble by human civilization, I mean like, they are not provably impossible– that takes a huge amount of effort.

    It normally… it’s less likely to happen the cleaner the problem is. There just aren’t many very clean algorithmic problems where our society worked on it for 10 years and then we’re like, ‘Oh geez, this still seems really hard.’”
    — From Conversation with Paul Christiano
  11. “People would claim that babies have lots of inductive biases, I don’t know that I buy it. It seems like you can learn a lot with a month of just looking at the world and exploring it, especially when you get way more data than current AI systems get. For one thing, you can just move around in the world and notice that it’s three dimensional.

    Another thing is you can actually interact with stuff and see what the response is. So you can get causal intervention data, and that’s probably where causality becomes such an ingrained part of us. So I could imagine that these things that we see as core to human reasoning, things like having a notion of causality or having a notion, I think apparently we’re also supposed to have as babies an intuition about statistics and like counterfactuals and pragmatics.

    But all of these are done with brains that have been in the world for a long time, relatively speaking, relative to AI systems. I’m not actually sure if I buy that this is because we have really good priors.”
    — From Conversation with Rohin Shah
  12. “I guess, I don’t know if this is really useful, but I do wish I had a better sense of what other people in the safety community and outside of it actually thought and why they were working on it, so I really appreciate you guys doing these interviews because it’s useful to me as well. I am generally a bit concerned about lots of people coming to lots of different conclusions regarding how pessimistic we should be, regarding timelines, regarding the right research agenda. 

    I think disagreement can be healthy because it’s good to explore different areas. The ideal thing would be for us to all converge to some common probability distribution and we decide we’re going to work on different areas. But it’s very hard psychologically to do this, to say, ‘okay, I’m going to be the person working on this area that I think isn’t very promising because at the margin it’s good’– people don’t work like that. It’s better if people think, ‘oh, I am working on the best thing, under my beliefs’. So having some diversity of beliefs is good. But it bothers me that I don’t know why people have come to different conclusions to me. If I understood why they disagree, I’d be happier at least.”
    — From Conversation with Adam Gleave

    “Slow takeoff versus fast takeoff…. I feel like MIRI still apparently believes in fast takeoff. I don’t have a clear picture of these reasons, I expect those reasons would move me towards fast takeoff. … Yeah, there’s a lot of just like.. MIRI could say their reasons for believing things and that would probably cause me to update. Actually, I have enough disagreements with MIRI that they may not update me, but it could in theory update me.”
    — From Conversation with Rohin Shah

  13. “My experience is that I’ve just written on this periodically over the years, but I get very little engagement. Seems to me there’s just a lack of a conversation here. Early on, Eliezer Yudkowsky and I were debating, and then as soon as he and other people just got funding and recognition from other people to pursue, then they just stopped engaging critics and went off on pursuing their stuff.

    Which makes some sense, but these criticisms have just been sitting and waiting. Of course, what happens periodically is they are most eager to engage the highest status people who criticize them. So periodically over the years, some high-status person will make a quip, not very thought out, at some conference panel or whatever, and they’ll be all over responding to that, and sending this guy messages and recruiting people to talk to him saying, “Hey, you don’t understand. There’s all these complications.”

    Which is different from engaging the people who are the longest, most thoughtful critics. There’s not so much of that going on. You are perhaps serving as an intermediary here. But ideally, what you do would lead to an actual conversation. And maybe you should apply for funding to have an actual event where people come together and talk to each other. Your thing could be a preliminary to get them to explain how they’ve been misunderstood, or why your summary missed something; that’s fine. If it could just be the thing that started that actual conversation it could be well worth the trouble.”
    — From Conversation with Robin Hanson
  14. “And I don’t know, I mean this has been a project that like, it’s a hard project. I think the current state of affairs is like, the MIRI folk have strong intuitions about things being hard. Essentially no one in… very few people in ML agree with those, or even understand where they’re coming from. And even people in the EA community who have tried a bunch to understand where they’re coming from mostly don’t. Mostly people either end up understanding one side or the other and don’t really feel like they’re able to connect everything. So it’s an intimidating project in that sense. I think the MIRI people are the main proponents of the everything is doomed, the people to talk to on that side. And then in some sense there’s a lot of people on the other side who you can talk to, and the question is just, who can articulate the view most clearly? Or who has most engaged with the MIRI view such that they can speak to it?”
    — From Conversation with Paul Christiano