This is the first in a sequence of articles outlining research which could help forecast AI development.
Concrete research projects are in boxes. ∑5 ∆8 means we guess the project will take (very) roughly five hours, and we rate its value (very) roughly 8/10.
Most projects could be done to very different degrees of depth, or at very different scales. Our time cost estimates correspond to a size that we would be likely to intend if we were to do the project. Value estimates are merely ordinal indicators of worth, based on our intuitive sense, and unworthy of being taken very seriously.
1. How does AI progress depend on hardware and software?
At a high level, AI improves when people make better software, when they can run it on better hardware, when they gather bigger, better training sets, etc. This makes present-day hardware and software progress a natural place to look for evidence about when advanced AI will arrive. In order to interpret any such data however, it is important to know how these pieces fit together. For instance, is the progress we see now mostly driven by hardware progress, or software progress? Can the same level of performance usually be achieved by widely varying mixtures of hardware and software? Does progress on software depend on progress on hardware?
It is important to understand the relationship between hardware, software and AI for several reasons. If hardware progress is the main driver of AI progress, then quite different evidence would tell us about AI timelines than if software is the main driver. Thus different research is valuable, and different timelines are likely. Many people base their AI predictions on hardware progress, while others decline to, so it would be broadly useful to know whether one should. We also expect understanding here to be generally useful.
So we think research in this direction seems valuable. We also think several projects seem tractable. Yet little appears to have been done in this direction. Thus this topic seems a high priority.
1.1 How does AI progress depend qualitatively on hardware and software progress?
For instance, will human-level AI appear when we have both a certain amount of hardware, and certain developments in software? Or can hardware and software substitute for one another? Substitution seems a natural model of the relationship between hardware and software, since anecdotally many tasks can be done by low quality software and lots of hardware, or by high quality software and less hardware. However the extent of this is unclear. This kind of model is also not commonly used in estimating AI timelines, so judging whether it should be might be a useful contribution. Having a good model would also bear on the priority of other research directions. As far as we know, this issue has received almost no attention. It seems moderately tractable.
1.1.A Evaluate qualitative models of the relationships between hardware, software and AI ∑30 ∆5
One way to approach the question of qualitative relationships is to assume some model, and work on projects such as those in 1.2 that measure quantitative details of the model, then revise the model if the measurements don’t make sense in it. Before that step, we might spend a short time detailing plausible models, and examining empirical and theoretical evidence we might already have, or could cheaply find. If we were going to follow up with empirical research, we would think about what evidence we would expect the research to reveal, given alternative models.
For instance, we find the hardware-software indifference curve model described briefly above (and outlined better in a blog post) plausible. Here are some ways it might be inadequate, that we might consider in evaluating it:
- ‘Hardware’ and ‘software’ are not sufficiently measurable entities for a ‘level’ of each in some domain to produce a stable level of performance.
- Performance depends strongly on other factors, e.g. exactly what kind of hardware and software progress you make, unique details of the software being developed, training data available.
- Different problem types, and different performance metrics on them have different kinds of behavior
- There are ‘indifference curves’ in a sense but they are not sufficiently consistent to be worth reasoning about.
- Humanity’s technological progress is not well characterized by an expanding rectangle of feasible hardware and software levels, but more as a complicated region of feasible combinations.
1.2 How much do marginal hardware and software improvements alter AI performance?
As mentioned above, this question is key to determining which other investigations are worthwhile. Naturally, it could also change our timelines substantially. Thus this question seems thus important to resolve. We think the projects here are particularly tractable, though not particularly cheap. For all of these projects, we would probably choose a specific set of benchmarks on particular problems to focus on. We might do multiple of these projects on the same set of benchmarks, to trace a more complete picture.
1.2.A Search for natural experiments combining modern hardware and early software approaches or vice versa. ∑80 ∆7
For instance, we might find early projects with very large hardware budgets, or recent projects with intentionally restricted hardware. Where these were tested on commonly used benchmarks, we can use them to map out the broad contributions of hardware and software to progress. For instance, if very small chess programs today run better than old chess programs which used similar (but then normal) amounts of hardware, then the difference between them can be attributed to improving software, roughly.
1.2.B Apply a modern understanding of software to early hardware ∑2,000 ∆9
Choose a benchmark problem that people worked on in the past, e.g. in the 1980s. Use a modern understanding of AI to solve the problem again, still using 1980’s hardware. Compare this to how researchers did in the 1980’s. This project requires substantial time from at least one AI researcher. Ideally they would spend a similar amount of effort as the past researchers did, so it may be worth choosing a problem where it is known that an achievable level of effort was applied in the past.
1.2.C Apply early software understanding to modern hardware ∑2,000 ∆8
Using contemporary hardware and a 1970’s or 1980’s understanding of connectionism, observe the extent to which a modern AI researcher (or student) could replicate contemporary performance on benchmark AI problems. This project is relatively expensive, among those we are describing. It requires substantial time from collaborators with a historically accurate minimal understanding of AI. Students may satisfy this role well, if their education is incomplete in the right ways. One might compare to the work of similar students who had also learned about modern methods.
1.2.D Measure marginal effects of hardware and software in existing performance trends ∑100 ∆8
Often the same software can be used with modest changes in hardware, so changes in performance from hardware over small margins can be measured. Improved software is also often written to be run on the same hardware as earlier software, so changes in performance from software alone can be measured over moderate margins. Thus we can often estimate these marginal changes from looking at existing performance measurements.
We can also look at overall progress over time on some applications, and factor out what we know about hardware or software change, assuming it is close to the marginal values measured by the above methods. For instance, we can see how much individual Go programs improve with more hardware, and then we can look at longer term improvements in computer Go, and guess how much of that improvement came from hardware, given our earlier estimate of marginal improvement from hardware. In general these estimates will be less valid over larger distances, as the impact of hardware or software diverge from their marginal impact, and because arbitrary of hardware and software can’t generally be combined without designing the software to make use of the hardware. Grace 2013 includes some work this project.
1.2.E Interview AI researchers on the relative importance of hardware and software in driving the progress they have seen. ∑20 ∆7
AI researchers likely have firsthand experience regarding how hardware and software contribute to overall progress within the vicinity of their own work. This project will probably give relatively noisy estimates, but is very cheap compared to others described here. One could just ask for views on this question, and supporting anecdotes, or devise a more structured questionnaire beforehand.
1.3 How do hardware and software progress interact?
Do hardware and software progress relatively independently, or for instance do advances in hardware encourage advances in software? This might change how we generally expect software progress to proceed, and what combinations of hardware and software we expect to first produce human-level AI. We are likely to get some information about this from other projects looking at historical performance data e.g. 1.2.D. For instance, if overall progress is generally proportional to hardware progress, even as hardware progress varies, then this would be suggestive. Below are further possibilities.
1.3.A Find natural experiments ∑80 ∆4
Search for performance data from cases where hardware being used for an application was largely constant then shifted upward at some point. Such cases are probably hard to find, and hard to interpret when found. However, a short search for them may be worthwhile.
1.3.B Interview researchers ∑20 ∆7
If hardware tends to affect software research, it is likely that researchers notice this, and can talk about it. This seems a cheap and effective method of learning qualitatively about the topic. This project should probably be combined with 1.2.E.
1.3.C Consider plausible models ∑10 ∆5
This is a short theoretical project that would benefit from being done in concert with 1.3.B (interview researchers), since researchers probably have a relatively good understanding of which models are plausible, and we are likely to ask better questions of them if we have thought about the topic. This project should probably be combined with 1.1.A.