Accuracy of AI Predictions

It is unclear how informative we should expect expert predictions about AI timelines to be. Individual predictions are undoubtedly often off by many decades, since they disagree with each other. However their aggregate may still be quite informative. The main potential reason we know of to doubt the accuracy of expert predictions is that experts are generally poor predictors in many areas, and AI looks likely to be one of them. However we have not investigated how accurate ‘poor’ is, or whether AI really is such a case.

Predictions of AI timelines are likely to be biased toward optimism by roughly decades, especially if they are voluntary statements rather than surveys, and especially if they are from populations selected for optimism. We expect these factors account for less than a decade and around two decades’ difference in median predictions respectively.

Support

Considerations regarding accuracy

A number of reasons have been suggested for distrusting predictions about AI timelines:

  • Models of areas where people predict well
    Research has produced a characterization of situations where experts predict well and where they do not. See table 1 here. AI appears to fall into several classes that go with worse predictions. However we have not investigated this evidence in depth, or the extent to which these factors purportedly influence prediction quality.
  • Expert predictions are generally poor
    Experts are notoriously poor predictors. However our impression is that this is because of their disappointing inability to predict some things well, rather than across the board failure. For instance, experts can predict the Higgs boson’s existence, outcomes of chemical reactions, and astronomical phenomena. So the question falls back to where AI falls in the spectrum of expert predictability, discussed in the last point.
  • Disparate predictions
    One sign that AI predictions are not very accurate is that they differ over a range of a century or so. This strongly suggests that many individual predictions are inaccurate, though not that the aggregate distribution is uninformative.
  • Similarity of old and new predictions
    Older predictions seem to form a fairly similar distribution to more recent predictions, except for very old predictions. This is weak evidence that new predictions are not strongly affected by evidence, and are therefore more likely to be inaccurate.
  • Similarity of expert and lay opinions
    Armstrong and Sotala found that expert and non-expert predictions look very similar.1 This finding is in doubt at the time of writing, due to errors in the analysis. If it were true, this would be weak evidence against experts having relevant expertise, since if they did, this might cause a difference with the opinions of lay-people. Note that it may also not, if the laypeople go to experts for information.
  • Predictions are about different things and often misinterpreted
    Comments made around predictions of human-level AI suggest that predictors are sometimes thinking about different events as ‘AI arriving’.2 Even when they are predictions about the same event, ‘prediction’ can mean different things. One person might ‘predict’ the year when they think human-level AI is more likely than not, while another ‘predicts’ the year that AI seems almost certain.

This list is not necessarily complete.

Purported biases

A number of biases have been posited to affect predictions of human-level AI:

  • Selection biases from optimistic experts
    Becoming an expert is probably correlated with independent optimism about the field, and experts make most of the credible predictions. We expect this to push median estimates earlier by less than a few decades.
  • Biases from short-term predictions being recorded
    There are a few reasons to expect recorded public predictions to be biased toward shorter timescales. Overall these probably make public statements less than a decade more optimistic.
  • Maes-Garreau law
    The Maes-Garreau law is a posited tendency for people to predict important technologies not long before their own likely death. It probably doesn’t afflict predictions of human-level AI substantially.
  • Fixed period bias
    There is a stereotype that people tend to predict AI in 20-30 years. There is weak evidence of such a tendency around 20 years, though little evidence that this is due to a bias (that we know of).

Conclusions

AI appears to exhibit several qualities characteristic of areas that people are not good at predicting. Individual AI predictions appear to be inaccurate by many decades in virtue of their disagreement. Other grounds for particularly distrusting AI predictions seem to offer weak evidence against them, if any. Our current guess is that AI predictions are less reliable than many kinds of prediction, though still potentially fairly informative.

Biases toward early estimates appear to exist, as a result of optimistic people becoming experts, and optimistic predictions being more likely to be published for various reasons. These are the only plausible substantial biases we know of.

Publication biases toward shorter predictions

We expect predictions that human-level AI will come sooner to be recorded publicly more often, for a few reasons. Public statements are probably more optimistic than surveys because of such effects. The difference appears to be less than a decade, for median predictions.

Support

Plausible biases

Below we outline five reasons for expecting earlier predictions to be stated and publicized more than later ones. We do not know of compelling reasons to expect longer term predictions to be publicized more, unless they are so distant as to also fit under the first bias discussed below.

Bias from not stating the obvious

In many circumstances, people are disproportionately likely to state beliefs that they think others do not hold. For example, “homeopathy works” gets more Google hits than “homeopathy doesn’t work”, though this probably doesn’t reflect popular beliefs on the matter. Making public predictions seems likely to be a circumstance with this character. Predictions are often made in books and articles which are intended to be interesting and surprising, rather than by people whose job it is to report on AI forecasts regardless of how far away they are. Thus we expect people with unusual positions on AI timelines to be more likely to state them. This should produce a bias toward both very short and very long predictions being published.

Bias from the near future being more concerning

Artificial intelligence will arguably be hugely important, whether as a positive or negative influence on the world. Consequently, people are motivated to talk about its social implications. The degree of concern motivated by impending events tends to increase sharply with proximity to the event. Thus people who expect human-level AI in a decade will tend to be more concerned about it than people who expect human-level AI to take a century, and so will talk about it more. Similarly, publishers are probably more interested in producing books and articles making more concerning claims.

Bias from ignoring reverse predictions

If you search for people predicting AI by a given date, you can get downwardly biased estimates by taking predictions from sources where people are asked about certain specific dates, and respond that AI will or will not have arrived by that date. If people respond ‘AI will arrive by X’ and ‘AI will not arrive by X’ as appropriate, the former can look like ‘predictions’ while the latter do not.

This bias affected some data in the MIRI dataset, though we have tried to minimize it now. For example, this bet (“By 2029 no computer – or “machine intelligence” – will have passed the Turing Test.”) is interpreted in the above collection as Kurzweil making a prediction, but not as Kapor making a prediction. It also contained several estimates of 70 years, taken from a group who appear to have been asked whether AI would come within 70 years, much later, or never. The ‘within 70 years’ estimates are recorded as predictions, while the others ignored, producing ’70 years’ estimates, almost regardless of the overall opinions of the group surveyed. In a population of people with a range of beliefs, this method of recording predictions would produce ‘predictions’ largely determined by which year was asked about.

Bias from unavoidably ignoring reverse predictions

The aforementioned bias arises from an error that can be avoided in recording data, where predictions and reverse predictions are available. However similar types of bias may exist more subtly. Such bias could arise where people informally volunteer opinions in a discussion about some period in the future. People with shorter estimates who can make a positive statement might feel more as though they have something to say, while those who believe there will not be AI at that time do not. For instance, suppose ten people write books about the year 2050, and each predicts AI in a different decade in the 21st Century. Those who predict it prior to 2050 will mention it, and be registered as a prediction of before 2050. Those who predict it after 2050 will not mention it, and not be registered as making a prediction. This could also be hard to avoid if predictions reach you through a filter of others registering them as predictions.

Selection bias from optimistic experts

Main article: Selection bias from optimistic experts

Some factors that cause people to make predictions about AI are likely to correlate with expectations of human-level AI arriving sooner. Experts are better positioned to make credible predictions about their field of expertise than more distant observers are. However since people are more likely to join a field if they are more optimistic about progress there, we might expect their testimony to be biased toward optimism.

Measuring these biases

These forms of bias (except the last) seem to us as if they should be much weaker in survey data than voluntary statements, for the following reasons:

  • Surveys come with a default of answering questions, so one does not need a strong reason or social justification for doing so (e.g. having a surprising claim, or wanting to elicit concern).
  • One can assess whether a survey ignores reverse predictions, and there appears to be little risk of invisible reverse predictions.
  • Participation in surveys is mostly determined before the questions are viewed, for a large number of questions at once. This allows less opportunity for views on the question to affect participation.
  • Participation in surveys is relatively cheap, so people who care little about expressing any particular view are likely to participate for reasons of orthogonal incentives, whereas costly communications (such as writing a book) are likely to be sensible only for those with a strong interest in promoting a specific message.
  • Participation in surveys is usually anonymous, so relatively unsatisfactory for people who particularly want to associate with a specific view, further aligning the incentives of those who want to communicate with those who don’t care.
  • Much larger fractions of people participate in surveys when requested than volunteer predictions in highly publicized arenas, which lessens the possibility for selection bias.

We think publication biases such as those described here are reasonably likely on theoretical grounds. We are also not aware of other reasons to expect surveys and statements to differ in their optimism about AI timelines. Thus we can compare the predictions of statements and surveys to estimate the size of these biases. Survey data appears to produce median predictions of human-level AI somewhat later than similar public statements do: less than a decade, at a very rough estimate. Thus we think some combination of these biases probably exist, and introduce less than a decade of error to median estimates.

Implications

Accuracy of AI predictions: AI predictions made in statements are probably biased toward being early, by less than a decade. This suggests both that predictions overall are probably slightly earlier than they would be otherwise, and surveys should be trusted more relative to statements (though there may be other considerations there).
Collecting data: When collecting data about AI predictions, it is important to avoid introducing bias by recording opinions that AI is before some date while ignoring opinions that it is after that date.
MIRI dataset: The earlier version of the MIRI dataset is somewhat biased due to ignoring reverse predictions, however this has been at least partially resolved.

Selection bias from optimistic experts

Experts on AI probably systematically underestimate time to human-level AI, due to a selection bias. The same is more strongly true of AGI experts. The scale of such biases appears to be decades. Most public AI predictions are from AI and AGI researchers, so this bias is relevant to interpreting these predictions.

Details

Why we expect bias

We can model a person’s views on AI timelines as being influenced both by their knowledge of AI and other somewhat independent factors, such as their general optimism and their understanding of technological history. People who are initially more optimistic about progress in AI seem more likely to enter the field of AI than those who are less so. Thus we might expect experts in AI to be selected for being optimistic, for reasons independent of their expertise. Similarly, AI researchers presumably enter the subfield of AGI more if they are optimistic about human-level intelligence being feasible soon.

This means expert predictions should tend to be more optimistic than they would if they were made by random people who became well informed, and thus are probably overall too optimistic (setting aside any other biases we haven’t considered).

This reason to expect bias only applies to the extent that predictions are made based on personal judgments, rather than explicit procedures that can be verified to avoid such biases. However predictions in AI appear to be very dependent on such judgments. Thus we expect some bias toward earlier predictions from AI experts, and more so from AGI experts. How large such biases might be is unclear however.

Empirical evidence for bias

Analysis of the MIRI dataset supports a selection bias existing. Median people working in AGI are around two decades more optimistic than median AI researchers from outside AGI. Those in AI are more optimistic again than ‘others’, and futurists are slightly more optimistic than even AGI researchers, though these are less clear due to small and ambiguous samples. In sum, the groups do make different predictions in the directions that we would expect as a result of such bias.

However it is hard to exclude expertise as an explanation for these differences, so this does not strongly imply that there are biases. There could also be biases that are not caused by selection effects, such as wishful thinking, planning fallacy, or self-serving bias. There may also be other plausible explanations we haven’t considered.

Since there are several plausible reasons for the differences we see here, and few salient reasons to expect effects in the opposite direction (expertise could go either way), the size of the selection biases in question are probably at most as large as the gaps between the predictions of the groups. That is, roughly two decades between AI and AGI researchers, and another several decades between AI researchers and others. Part of this span should be a bias of the remaining group toward being too pessimistic, but in both cases the remaining groups are much larger than the selected group, so most of the bias should be in the selected group.

Effects of group biases on predictions

People being selected into groups such as ‘AGI researchers’ based on their optimism does not in itself introduce a bias. The problem arises when people from different groups start making different numbers of predictions. In practice, they do. Among the predictions we know of, most are from AI researchers, and a large fraction of those are from AGI researchers. Of surveys we have recorded, 80% target AI or AGI researchers, and around half of them target AGI researchers in particular. Statements in the MIRI dataset since 2000 include 13 from AGI researchers, 16 from AI researchers, 6 from futurists, and 6 from others. This suggests we should expect aggregated predictions from surveys and statements to be optimistic, by roughly decades.

Conclusions

It seems likely that AI and AGI researchers’ predictions exhibit a selection bias toward being early, based on reason to expect such a bias, the large disparity between AI and AGI researchers’ predictions (while AI researchers seem likely to be optimistic if anything), and the consistency between the distributions we see and those we would expect under the selection bias explanation for disagreement. Since AI and AGI researchers are heavily represented in prediction data, predictions are likely to be biased toward optimism, by roughly decades.

 

Relevance

Accuracy of AI predictions: many AI timeline predictions come from AI researchers and AGI researchers, and people interested in futurism. If we want to use these predictions to estimate AI timelines, it is valuable to know how biased they are, so we can correct for such biases.

Detecting relevant expertise: if the difference between AI and AGI researcher predictions is not due to bias, then it suggests one group had additional information. Such information would be worth investigating.

Group Differences in AI Predictions

AGI researchers appear to expect human-level AI substantially sooner than other AI researchers. The difference ranges from about five years to at least about sixty years as we move from highest percentiles of optimism to the lowest. Futurists appear to be around as optimistic as AGI researchers. Other people appear to be substantially more pessimistic than AI researchers.

Details

MIRI dataset

We categorized predictors in the MIRI dataset as AI researchers, AGI researchers, Futurists and Other. We also interpreted their statements into a common format, roughly corresponding to the first year in which the person appeared to be suggesting that human-level AI was more likely than not (see ‘minPY’ described here).

Recent (since 2000) predictions are shown in the figure below. Those made by people from the subfield of AGI tend to be decades more optimistic than those at the same percentile of optimism in AI. The difference ranges from about five years to at least about sixty years as we move from highest percentiles of optimism to the lowest. Those who work in AI tend to be at least a decade more optimistic than ‘others’, at any percentile of optimism within their group. Futurists are about as optimistic as AGI researchers.

Note that these predictions were made over a period of at least 12 years, rather than at the same time.

xxx

Figure 1: Cumulative probability of AI being predicted (minPY), for various groups, for predictions made after 2000. See here.

Median predictions are shown below (these are also minPY predictions as defined on the MIRI dataset page, calculated from ‘cumulative distributions’ sheet in updated dataset spreadsheet also available there).

 Median AI predictions  AGI  AI  Futurist  Other  All
 Early (pre-2000) (warning: noisy)  1988  2031  2036  2025
 Late (since 2000)  2033  2051  2031  2101  2042

FHI survey data

The FHI survey results suggest that people’s views are not very different if they work in computer science or other parts of academia. We have not investigated this evidence in more detail.

Implications

Biases from optimistic predictors and information asymmetries: Differences of opinion among groups who predict AI suggest that either some groups have more information, or that biases exist in some of the groups. Either of these is valuable to know about, so that we can either look into the additional information, or try to correct for the biases.

The Maes-Garreau Law

The Maes-Garreau law posits that people tend to predict exciting future technologies toward the end of their lifetimes. It probably does not hold for predictions of human-level AI.

Clarification

From Wikipedia:

The Maes–Garreau law is the statement that “most favorable predictions about future technology will fall within the Maes–Garreau point”, defined as “the latest possible date a prediction can come true and still remain in the lifetime of the person making it”. Specifically, it relates to predictions of a technological singularity or other radical future technologies.

The law was posited by Kevin Kelly, here.

Evidence

In the MIRI dataset, age and predicted time to AI are very weakly anti-correlated, with a correlation of -0.017. That is, older people expect AI very slightly sooner than others. This suggests that if the Maes-Garreau law applies to human-level AI predictions, it is very weak, or is being masked by some other effect. Armstrong and Sotala also interpret an earlier version of the same dataset as evidence against the Maes-Garreau law substantially applying, using a different method of analysis.

Earlier, smaller, informal analyses find evidence of the law, but in different settings. According to Rodney Brooks (according to Kevin Kelly), Pattie Maes observed this effect strongly in a survey of public predictions of human uploading:

[Maes] took as many people as she could find who had publicly predicted downloading of consciousness into silicon, and plotted the dates of their predictions, along with when they themselves would turn seventy years old. Not too surprisingly, the years matched up for each of them. Three score and ten years from their individual births, technology would be ripe for them to download their consciousnesses into a computer. Just in the nick of time! They were each, in their own minds, going to be remarkably lucky, to be in just the right place at the right time.

However, according to Kelly, the data was not kept.

Kelly did another small search for predictions of the singularity, which appears to only support a very weakened version of the law: many people predict AI within their lifetime.

The hypothesized reason for this relationship is that people would like to believe they will personally avoid death. If this is true, we might expect the relation to apply much more strongly to predictions of events which might fairly directly save a person from death. Human uploading and the singularity are such events, while human-level AI does not appear to be. Thus it is plausible that this law does apply to some technological predictions, but not human-level AI.

Implications

Evidence about wishful thinking: the Maes-Garreau law is a relatively easy to check instance of a larger class of hypotheses to do with AI predictions being directed by wishful thinking. If wishful thinking were a large factor in AI predictions, this would undermine accuracy because it is not related to when human-level AI will appear. That the Maes-Garreau law doesn’t seem to hold is evidence against wishful thinking being a strong determinant of AI predictions. Further evidence might be obtained by observing the correlation between belief that human-level AI will be positive for society and belief that it will come soon.

AI Timeline predictions in surveys and statements

Surveys seem to produce median estimates of time to human-level AI which are on the order of a decade later than those produced from voluntary public statements.

Details

We compared several surveys to predictions made by similar groups of people in the MIRI AI predictions dataset, and found that surveys tend to give estimates on the order of a decade later.

Stuart Armstrong and Kaj Sotala make another such comparison here, finding that the survey data is more pessimistic than other data (in terms of delay to AI from the time of prediction), as predicted by the relevant biases. Note that the non-survey data they compare with is from across time, whereas the survey data is from 1973, so the expectation for them to look the same is weaker. However in the MIRI dataset, very early predictions tend to be more optimistic than later predictions, if anything, making this difference all the more surprising.

Relevance

Accuracy of AI predictionssome biases which probably exist in public statements about AI predictions are likely to be smaller or not apply in survey data. For instance, public statements are probably more likely to be made by people who believe they have surprising or interesting views, whereas this should much less influence answers to a survey question once someone is taking a survey. Thus comparing data from surveys and voluntary statements can tell us about the strength of such biases. Given that median survey predictions are rarely more than a decade later than similar statements, and survey predictions seem unlikely to be strongly biased in this way, median statements are probably less than a decade early as a result of this bias.

MIRI AI Predictions Dataset

The MIRI AI predictions dataset is a collection of public predictions about human-level AI timelines. We edited the original dataset, as described below. Our dataset is available here, and the original here.

Interesting features of the dataset include:

  • The median dates at which people’s predictions suggest AI is less likely than not and more likely than not are 2033 and 2037 respectively.
  • Predictions made before 2000 and after 2000 are distributed similarly, in terms of time remaining when the prediction is made
  • Six predictions made before 1980 were probably systematically sooner than predictions made later.
  • AGI researchers appear to be more optimistic than AI researchers.
  • People predicting AI in public statements (in the MIRI dataset) predict earlier dates than demographically similar survey takers do.
  • Age and predicted time to AI are almost entirely uncorrelated: r = -.017.

Details

History of the dataset

We got the original MIRI dataset from here. According to the accompanying post, the Machine Intelligence Research Institute (MIRI) commissioned Jonathan Wang and Brian Potter to gather the data. Kaj Sotala and Stuart Armstrong analyzed and categorized it (their categories are available in both versions of the dataset). It was used in the papers Armstrong and Sotala 2012 and Armstrong and Sotala 2014. We modified the dataset, as described below. Our version is here.

Our changes to the dataset

These are changes we made to the dataset:

  • There were a few instances of summary results from large surveys included as single predictions – we removed these because survey medians and individual public predictions seem to us sufficiently different to warrant considering separately.
  • We removed entries which appeared to be duplications of the same data, from different sources.
  • We removed predictions made by the same individual within less than ten years.
  • We removed some data which appeared to have been collected in a biased fashion, where we could not correct the bias.
  • We removed some entries that did not seem to be predictions about general artificial intelligence
  • We may have removed some entries for other similar reasons
  • We added some predictions we knew of which were not in the data.
  • We fixed some small typographic errors.

Deleted entries can be seen in the last sheet of our version of the dataset. Most have explanations in one of the last few columns.

We continue to change the dataset as we find predictions it is missing, or errors in it. The current dataset may not exactly match the descriptions on this page.

How did our changes matter?

Implications of the above changes:

  • The dataset originally had 95 predictions; our version has 65 at last count.
  • Armstrong and Sotala transformed each statement into a ‘median’ prediction. In the original dataset, the mean ‘median’ was 2040 and the median ‘median’ 2030. After our changes, the mean ‘median’ is 2046 and the median ‘median’ remains at 2030. The means are highly influenced by extreme outliers.
  • We have not evaluated Armstrong and Sotala’s findings in the updated dataset. One reason is that their findings are mostly qualitative. For instance, it is a matter of judgment whether there is still ‘a visible difference’ between expert and non-expert performance. Our judgment may differ from those authors anyway, so it would be unclear whether the change in data changed their findings. We address some of the same questions by different methods.

minPY and maxIY predictions

People say many slightly different things about when human-level AI will arrive. We interpreted predictions into a common format: one or both of a claim about when human-level AI would be less likely than not, and a claim about when human-level AI would be more likely than not. Most people didn’t explicitly use such language, so we interpreted things roughly, as closely as we could. For instance, if someone said ‘AI will not be here by 2080’ we would interpret this as AI being less likely to exist than not by that date.

Throughout this page, we use ‘minimum probable year’ (minPY) to refer to the minimum time when a person is interpreted as stating that AI is more likely than not. We use ‘maximum improbable year’ (maxIY) to refer to the maximum time when a person is interpreted as stating that AI is less likely than not. To be clear, these are not necessarily the earliest and latest times that a person holds the requisite belief – just the earliest and latest times that is implied by their statement. For instance, if a person says ‘I disagree that we will have human-level AI in 2050’, then we interpret this as a maxIY prediction of 2050, though they may well also believe AI is less likely than not in 2065 also. We would not interpret this statement as implying any minPY. We interpreted predictions like ‘AI will arrive in about 2045’ as 2045 being the date at which AI would become more likely than not, so both minPY and a maxIY of 2045.

This is different to the ‘median’ interpretation Armstrong and Sotala provided. Which is not necessarily to disagree with their measure: as Armstrong points out, it is useful to have independent interpretations of the predictions. Both our measure and theirs could mislead in different circumstances. People who say ‘AI will come in about 100 years’ and ‘AI will come within about 100 years’ probably don’t mean to point to estimates 50 years apart (as they might be seen to in Armstrong and Sotala’s measure). On the other hand, if a person says ‘AI will obviously exist before 3000AD’ we will record it as ‘AI is more likely than not from 3000AD’ and it may be easy to forget that in the context this was far from the earliest date at which they thought AI was more likely than not.

 Original A&S ‘median’  Updated A&S ‘median’ minPY  maxIY
 Mean 2040 2046 2067 2067
 Median 2030 2030 2037 2033

Table 1: Summary of mean and median AI predictions under different interpretations

As shown in Table 1, our median dates are a few years later than Armstrong & Sotala’s original or updated dates, and only four years from one another.

Categories used in our analysis

Timing

‘Early’ throughout refers to before 2000. ‘Late’ refers to 2000 onwards. We split the predictions in this way because often we are interested in recent predictions, and 2000 is a relatively natural recent cutoff. We chose this date without conscious attention to the data beyond the fact that there have been plenty of predictions since 2000.

Expertise

We categorized people as ‘AGI’, ‘AI’, ‘futurist’ and ‘other’ as best we could, according to their apparent research areas and activities. These are ambiguous categories, but the ends to which we put such categorization do not require that they be very precise.

Findings

Basic statistics

The median minPY is 2037 and median maxIY is 2033 (see  ‘Basic statistics’ sheet). The mean minPY is 2067, which is the same as the mean maxIY (see ‘Basic statistics’ sheet). These means are fairly meaningless, as they are influenced greatly by a few extreme outliers. Figure 1 shows the distribution of most of the predictions.

xxx

Figure 1: minPY (‘AI after’) and maxIY (‘No AI till’) predictions(from ‘Basic statistics’ sheet)

The following figures shows the fraction of predictors over time who claimed that human-level AI is more likely to have arrived by that time than not (i.e. minPY predictions). The first is for all predictions, and the second for predictions since 2000. The first graph is hard to meaningfully interpret, because the predictions were made in very different volumes at very different times. For instance, the small bump on the left is from a small number of early predictions. However it gives a rough picture of the data.

xxx

Figure 2: Fraction of all minPY predictions which say AI will have arrived, over time (From ‘Cumulative distributions’ sheet).

xxx

Figure 3: Fraction of late minPY predictions (made since 2000) which say AI will have arrived, over time (From ‘Cumulative distributions’ sheet).

Remember that these are dates from which people claimed something like AI being more likely than not. Such dates are influenced not only by what people believe, but also by what they are asked. If a person believes that AI is more likely than not by 2020, and they are asked ‘will there be AI in 2060’ they will respond ‘yes’ and this will be recorded as a prediction of AI being more likely than not after 2060. The graph is thus an upper bound for when people predict AI is more likely than not. That is, the graph of when people really predict AI with 50 percent confidence keeps somewhere to the left of the one in figures 2 and 3.

Similarity of predictions over time

In general, early and late predictions are distributed fairly similarly over the years following them. For minPY predictions, the correlation between the date of a prediction and number of years until AI is predicted from that time is 0.13 (see ‘Basic statistics’ sheet). Figure 5 shows the cumulative probability of AI being predicted over time, by late and early predictors. At a glance, they are surprisingly similar. The largest difference between the fraction of early and of late people who predict AI by any given distance in the future is about 15% (see ‘Predictions over time 2’ sheet). A difference this large is fairly likely by chance. However most of the predictions were made within twenty years of one another, so it is not surprising if they are similar.

The six very early predictions do seem to be unusually optimistic. They are all below the median 30 years, which would have a 1.6% probability of occurring by chance.

Figures 4-7 illustrate the same data in different formats.

xxx

Figure 4: Time left until minPY predictions, by date when they were made. (From ‘Basic statistics’ sheet)

Figure 5: Cumulative probability of AI being predicted (minPY) different distances out for early and late predictors (From ‘Predictions over time 2’ sheet)

xxx

Figure 6: Fraction of minPY predictions at different distances in the future, for early and late predictors (From ‘Predictions over time’ sheet)

Early vs Late CDF (1)

Figure 7: Cumulative probability of AI being predicted by a given date, for early and late predictors (minPY). (From ‘Cumulative distributions’ sheet)

Groups of participants

Associations with expertise and enthusiasm
Summary

AGI people in this dataset are generally substantially more optimistic than AI people. Among the small number of futurists and others, futurists were optimistic about timing, and others were pessimistic.

Details

We classified the predictors as AGI researchers, (other) AI researchers, Futurists and Other, and calculated CDFs of their minPY  predictions, both for early and late predictors. The figures below show a selection of these. Recall that ‘early’ and ‘late’ correspond to before and after 2000.

As we can see in figure 8, Late AGI predictors are substantially more optimistic than late AI predictors: for almost any date this century, at least 20% more AGI people predict AI by then. The median late AI researcher minPY is 18 years later than the median AGI researcher minPY. We haven’t checked whether this is partly caused by predictions by AGI researchers having been made earlier.

There were only 6 late futurists, and 6 late ‘other’ (compared to 13 and 16 late AGI and late AI respectively), so the data for these groups is fairly noisy. Roughly, late futurists in the sample were more optimistic than anyone, while late ‘other’ were more pessimistic than anyone.

There were no early AGI people, and only three early ‘other’. Among seven early AI and eight early futurists, the AI people predicted AI much earlier (70% of early AI people predict AI before any early futurists do), but this seems to be at least partly explained by the early AI people being concentrated very early, and people predicting AI similar distances in the future throughout time.

xxx

Figure 8: Cumulative probability of AI being predicted over time, for late AI and late AGI predictors.(See ‘Cumulative distributions’ sheet)

Figure 9: Cumulative probability of AI being predicted over time, for all late groups. (See ‘Cumulative distributions’ sheet)

 Median minPY predictions  AGI  AI  Futurist  Other  All
 Early (warning: noisy)  –  1988  2031  2036  2024
 Late  2033  2051  2030  2101  2042

Table 2: Median minPY predictions for all groups, late and early. There were no early AGI predictors.

Statement makers and survey takers
Summary

Surveys seem to produce later median estimates than similar individuals making public statements do. We compared some of the surveys we know of to the demographically similar predictors in the MIRI dataset. We expected these to differ because predictors in the MIRI dataset are mostly choosing to making public statements, while survey takers are being asked, relatively anonymously, for their opinions. Surveys seem to produce median dates on the order of a decade later than statements made by similar groups.

Details

We expect surveys and voluntary statements to be subject to different selection biases. In particular, we expect surveys to represent a more even sample of opinion, while voluntary statements to be more strongly concentrated among people with exciting things to say or strong agendas. To learn about the difference between these groups, and thus the extent of any such bias, we below compare median predictions made in surveys to median predictions made by people from similar groups in voluntary statements.

Note that this is rough: categorizing people is hard, and we have not investigated the participants in these surveys more than cursorily. There are very few ‘other’ predictors in the MIRI dataset. The results in this section are intended to provide a ballpark estimate only.

Also note that while both sets of predictions are minPYs, the survey dates are often the actual median year that a person expects AI, whereas the statements could often be later years which the person happens to be talking about.

Survey Primary participants  Median minPY prediction in comparable statements in the MIRI data  Median in survey  Difference
 Kruel (AI researchers)  AI  2051  2062 +11
 Kruel (AGI researchers)  AGI 2033  2031 -2
 AGI-09  AGI  2033  2040 +7
 FHI  AGI/other  2033-2062  2050 in range
 Klein  Other/futurist  2030-2062  2050 in range
 AI@50  AI/Other  2051-2062  2056 in range
 Bainbridge  Other  2062  2085 +23

Table 3: median predictions in surveys and statements from demographically similar groups.

Note that the Kruel interviews are somewhere between statements and surveys, and are included in both data.

It appears that the surveys give somewhat later dates than similar groups of people making statements voluntarily. Around half of the surveys give later answers than expected, and the other half are roughly as expected. The difference seems to be on the order of a decade. This is what one might naively expect in the presence of a bias from people advertising their more surprising views.

Relation of predictions and lifespan

Age and predicted time to AI are very weakly anti-correlated: r = -.017 (see Basic statistics sheet, “correlation of age and time to prediction”). This is evidence against a posited bias to predict AI within your existing lifespan, known as the Maes-Garreau Law.

Brain performance in TEPS

We can use Traversed Edges Per Second (TEPS) to measure a computer’s ability to communicate information internally. We can also estimate the human brain’s communication performance in terms of TEPS, and use this to meaningfully compare brains to computers. We estimate that the human brain performs around  0.18 – 6.4 * 1014 TEPS. This is within an order of magnitude more than existing supercomputers.

At current prices for TEPS, we estimate that it costs around $4,700 – $170,000/hour to perform at the level of the brain. Our best guess is that ‘human-level’ TEPS performance will cost less than $100/hour in seven to fourteen years.

Motivation: why measure the brain in TEPS?

Why measure communication?

Performance benchmarks such as floating point operations per second (FLOPS) and millions of instructions per second (MIPS) mostly measure how fast a computer can perform individual operations. However a computer also needs to move information around between the various components performing operations.1 This communication takes time, space and wiring, and so can substantially affect overall performance of a computer, especially on data intensive applications. Consequently when comparing computers it is useful to have performance metrics that emphasize communication as well as ones that emphasize computation. When comparing computers to the brain, there are further reasons to be interested in communication performance, as we shall see below.

Communication is a plausible bottleneck for the brain

In modern high performance computing, communication between and within processors and memory is often a significant cost.2 3 4 5 Our impression is that in many applications it is more expensive than performing individual bit operations, making operations per second a less relevant measure of computing performance.

We should expect computers to become increasingly bottlenecked on communication as they grow larger, for theoretical reasons. If you scale up a computer, it requires linearly more processors, but superlinearly more connections for those processors to communicate with one another quickly. And empirically, this is what happens: the computers which prompted the creation of the TEPS benchmark were large supercomputers.

It’s hard to estimate the relative importance of computation and communication in the brain. But there are some indications that communication is an important expense for the human brain as well. A substantial part of the brain’s energy is used to transmit action potentials along axons rather than to do non-trivial computation.6 Our impression is also that the parts of the brain responsible for communication (e.g. axons) comprise a substantial fraction of the brain’s mass. That substantial resources are spent on communication suggests that communication is high value on the margin for the brain. Otherwise, resources would likely have been directed elsewhere during our evolutionary history.

Today, our impression is that networks are typically implemented on single machines because communication between processors is otherwise very expensive. But the power of individual processors is not increasing as rapidly as costs are falling, and even today it would be economical to use thousands of machines if doing so could yield human-level AI. So it seems quite plausible that communication will become a very large bottleneck as neural networks scale further.

In sum, we suspect communication is a bottleneck for the brain for three reasons: the brain is a large computer, similar computing tasks tend to be bottlenecked in this way, and the brain uses substantial resources on communication.

If communication is a bottleneck for the brain, this suggests that it will also be a bottleneck for computers with similar performance to the brain. It does not strongly imply this: a different kind of architecture might be bottlenecked by different factors.

Cost-effectiveness of measuring communication costs

It is much easier to estimate communication within the brain than to estimate computation. This is because action potentials seem to be responsible for most of the long-distance communication7, and their information content is relatively easy to quantify. It is much less clear how many ‘operations’ are being done in the brain, because we don’t know in detail how the brain represents the computations it is doing.

Another issue that makes computing performance relatively hard to evaluate is the potential for custom hardware. If someone wants to do a lot of similar computations, it is possible to design custom hardware which computes much faster than a generic computer. This could happen with AI, making timing estimates based on generic computers too late. Communication may also be improved by appropriate hardware, but we expect the performance gains to be substantially smaller. We have not investigated this question.

Measuring the brain in terms of communication is especially valuable because it is a relatively independent complement to estimates of the brain’s performance based on computation. Moravec, Kurzweil and Sandberg and Bostrom have all estimated the brain’s computing performance, and used this to deduce AI timelines. We don’t know of estimates of the total communication within the brain, or the cost of programs with similar communication requirements on modern computers. These an important and complementary aspect of the cost of ‘human-level’ computing hardware.

TEPS

Traversed edges per second (TEPS) is a metric that was recently developed to measure communication costs, which were seen as neglected in high performance computing.8 The TEPS benchmark measures the time required to perform a breadth-first search on a large random graph, requiring propagating information across every edge of the graph (either by accessing memory locations associated with different nodes, or communicating between different processors associated with different nodes).9  You can read about the benchmark in more detail at the Graph 500 site.

TEPS as a meaningful way to compare brains and computers

Basic outline of how to measure a brain in TEPS

Though a brain cannot run the TEPS benchmark, we can roughly assess the brain’s communication ability in terms of TEPS. The brain is a large network of neurons, so we can ask how many edges between the neurons (synapses) are traversed (transmit signals) every second. This is equivalent to TEPS performance in a computer in the sense that the brain is sending messages along edges in a graph. However it differs in other senses. For instance, a computer with a certain TEPS performance can represent many different graphs and transmit signals in them, whereas we at least do not know how to use the brain so flexibly. This calculation also makes various assumptions, to be discussed shortly.

One important interpretation of the brain’s TEPS performance calculated in this way is as a lower bound on communication ability needed to simulate a brain on a computer to a level of detail that included neural connections and firing. The computer running the simulation would need to be traversing this many edges per second in the graph that represented the brain’s network of neurons.

Assumptions

Most relevant communication is between neurons

The brain could be simulated at many levels of detail. For instance, in the brain, there is both communication between neurons and communication within neurons. We are considering only communication between neurons. This means we might underestimate communication taking place in the brain.

Our impression is that essentially all long-distance communication in the brain takes place between neurons, and that such long-distance communication is a substantial fraction of the brain’s communication. The reasons for expecting communication to be a bottleneck—that the brain spends much matter and energy on it; that it is a large cost in large computers; and that algorithms which seem similar to the brain tend to suffer greatly from communication costs—also suggest that long distance communication alone is a substantial bottleneck.

Traversing an edge is relevantly similar to spiking

We are assuming that a computer traversing an edge in a graph (as in the TEPS benchmark) is sufficient to functionally replicate a neuron spiking. This might not be true, for instance if the neuron spike sends more information than the edge traversal. This might happen if there were more perceptibly different times each second at which the neuron could send a signal.

We could usefully refine the current estimate by measuring the information contained in neuron spikes and traversed edges. However we expect this to make a difference of less than about a factor of two. Action potentials don’t appear to transfer a lot more information than edge traversals in the TEPS benchmark. Also, in general, increasing time resolution only increases the information contained in a signal logarithmically. That is, if neurons can send signals at twice as many different times, this only adds one bit of information to their message.

Distributions of edges traversed don’t make a material difference

The distribution of edges traversed in the brain is presumably quite different from the one used in the TEPS benchmark. We are ignoring this, assuming that it doesn’t make a large difference to the number of edges that can be traversed. This might not be true, if for instance the ‘short’ connections in the brain are used more often. We know of no particular reason to expect this, but it would be a good thing to check in future.

Graph characteristics are relevantly similar

Graphs vary in how many nodes they contain, how many connections exist between nodes, and how the connections are distributed. If these parameters are quite different for the brain and the computers tested on the TEPS benchmark, we should be more wary interpreting computer TEPS performance as equivalent to what the brain does. For instance, if the brain consisted of a very large number of nodes with very few connections, and computers could perform at a certain level on much smaller graphs with many connections, then even if the computer could traverse as many edges per second, it may not be able to carry out the edge traversals that the brain is doing.

However graphs with different numbers of nodes are more comparable than they might seem. Ten connected nodes with ten links each can be treated as one node with around ninety links. The links connecting the ten nodes are a small fraction of those acting as outgoing links, so whether the central ‘node’ is really ten connected nodes should make little difference to a computer’s ability to deal with the graph. The most important parameters are the number of edges and the number of times they are traversed.

We can compare the characteristics of brains and graphs in the TEPS benchmark. The TEPS benchmark uses graphs with up to 2 * 1012 nodes,10 while the human brain has around 1011 nodes (neurons). Thus the human brain is around twenty times smaller (in terms of nodes) than the largest graphs used in the TEPS benchmark.

The brain contains many more links than the TEPS benchmark graphs. TEPS graphs appear to have average degree 32 (that is, each node has 32 links on average),11 while the brain apparently has average degree around 3,600 – 6,400.12

The distribution of connections in the brain and the TEPS benchmark are probably different. Both are small-world distributions, with some highly connected nodes and some sparsely connected nodes, however we haven’t compared them in depth. The TEPS graphs are produced randomly, which should be a particularly difficult case for traversing edges in them (according to our understanding). If the brain has more local connections, traversing edges in it should be somewhat easier.

We expect the distribution of connections to make a small difference. In general, the time required to do a breadth first search depends linearly on the number of edges, and doesn’t depend on degree. The TEPS benchmark is essentially a breadth first search, so we should expect it basically have this character. However in a physical computer, degree probably matters somewhat. We expect that in practice that the cost scales with edges * log(edges), because the difficulty of traversing each edge should scale with log(edges) as edges become more complex to specify. A graph with more local connections and fewer long-distance connections is much like a smaller graph, so that too should not change difficulty much.

How many TEPS does the brain perform?

We can calculate TEPS performed by the brain as follows:

TEPS = synapse-spikes/second in the brain

= Number of synapses in the brain * Average spikes/second in synapses

≈ Number of synapses in the brain * Average spikes/second in neurons

1.8-3.2 x 10^14  *  0.1-2 

= 0.18 – 6.4 * 10^14

That is, the brain performs at around 18-640 trillion TEPS.

Note that the average firing rate of neurons is not necessarily equal to the average firing rate in synapses, even though each spike involves both a neuron and synapses. Neurons have many synapses, so if neurons that fire faster tend to have more or less synapses than slower neurons, the average rates will diverge. We are assuming here that average rates are similar. This could be investigated further.

For comparison, the highest TEPS performance by a computer is 2.3 * 10^13 TEPS (23 trillion TEPS)13, which according to the above figures is within the plausible range of brains (at the very lower end of the range).

Implications

That the brain performs at around 18-640 trillion TEPS means that if communication is in fact a major bottleneck for brains, and also for computer hardware functionally replicating brains, then existing hardware can probably already perform at the level of a brain, or at least at one thirtieth of that level.

Cost of ‘human-level’ TEPS performance

We can also calculate the price of a machine equivalent to a brain in TEPS performance, given current prices for TEPS:

Price of brain-equivalence = TEPS performance of brain * price of TEPS

TEPS performance of brain/billion * price of GTEPS

= 0.18 – 6.4 * 10^14/10^9 * $0.26/hour

= $0.047 – 1.7 * 10^5/hour

= $4,700 – $170,000/hour

For comparison, supercomputers seem to cost around $2,000-40,000/hour to run, if we amortize their costs across three years.14 So the lower end of this range is within what people pay for computing applications (naturally, since the brain appears to be around as powerful as the largest supercomputers, in terms of TEPS). The lower end of the range is still about 1.5 orders of magnitude more than what people regularly pay for labor. Though the highest paid CEOs appear to make at least $12k/hour.15

Timespan for ‘human-level’ TEPS to arrive

Our best guess is that TEPS/$ grows by a factor of ten every four years, roughly. Thus for computer hardware to compete on TEPS with a human who costs $100/hour should take about seven to thirteen years.16 We are fairly unsure of the growth rate of TEPS however.


 

One Comment

  1. AI Impacts – The AI Impacts Blog Says :

    2015-04-01 at 7:27 AM

    […] Articles […]