Published 19 Oct 2020
Progress in computer image classification performance took:
- Over 14 years to reach the level of an untrained human
- 3 years to pass from untrained human level to trained human level
- 5 years to continue from trained human to current performance (2020)
Details
Metric
ImageNet1 is a large collection of images organized into a hierarchy of noun categories. We looked at ‘top-5 accuracy’ in categorizing images. In this task, the player is given an image, and can guess five different categories that the image might represent. It is judged as correct if the image is in fact in any of those five categories.
Human performance milestones
Beginner level
We used Andrej Karpathy’s interface2 for doing the ImageNet top-5 accuracy task ourselves, and asked a few friends to do it. Five people did it, with performances ranging from 74% to 89%, with a median performance of 81%.
This was not a random sample of people, and conditions for taking the test differed. Most notably, there was no time limit, so time allocated was set by patience for trying to marginally improve guesses.
Trained human-level
ImageNet categorization is not a popular activity for humans, so we do not know what highly talented and trained human performance would look like. The best relatively high human performance measure we have comes from Russakovsky et al, who report on performance of two ‘expert annotators’, who they say learned many of the categories. 3 The better performing annotator there had a 5.1% error rate.4
AI achievement of human milestones
Earliest attempt
The ImageNet database was released in 2009.5. An annual contest, the ImageNet Large Scale Visual Recognition Challenge, began in 2010.6
In the 2010 contest, the best top-5 classification performance had 28.2% error.7
However image classification broadly is older. Pascal VOC was a similar previous contest, which ran from 2005.8 We do not know when the first successful image classification systems were developed. In a blog post, Amidi & Amidi point to LeNet as pioneering work in image classification9, and it appears to have been developed in 1998.10
Beginner level
The first entrant in the ImageNet contest to perform better than our beginner level benchmark was SuperVision (commonly known as AlexNet) in 2012, with a 15.3% error rate.11
Superhuman level
In 2015 He et al apparently achieved a 4.5% error rate, slightly better than our high human benchmark.12
Current level
According to paperswithcode.com, performance has continued to climb, to 2020, though slower than earlier.13
Times for AI to cross human-relative ranges
Given the above dates, we have:
Range | Start | End | Duration (years) |
First attempt to beginner level | <1998 | 2012 | >14 |
Beginner to superhuman | 2012 | 2015 | 3 |
Above superhuman | 2015 | >2020 | >5 |
Primary author: Rick Korzekwa
Notes
- “ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently we have an average of over five hundred images per node. “
“ImageNet.” Accessed October 19, 2020. http://www.image-net.org/.
- Karpathy, Andrej. “Ilsvrc.” Accessed October 19, 2020. https://cs.stanford.edu/people/karpathy/ilsvrc/.
- ‘Therefore, in evaluating the human accuracy we relied primarily on expert annotators who learned to recognize a large portion of the 1000 ILSVRC classes. During training, the annotators labeled a few hundred validation images for practice and later switched to the test set images’
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575.
- “Annotator A1 evaluated a total of 1500 test set images. The GoogLeNet classication error on this sample was estimated to be 6.8% (recall that the error on full test set of 100,000 images is 6.7%, as shown in Table 7). The human error was estimated to be 5.1%.”
Also see Table 9Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575.
- “They presented their database for the first time as a poster at the 2009 Conference on Computer Vision and Pattern Recognition (CVPR) in Florida.”
“ImageNet.” In Wikipedia, September 9, 2020. https://en.wikipedia.org/w/index.php?title=ImageNet&oldid=977585441. - “…The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has been running annually for five years (since 2010) and has become the standard benchmark for large-scale object recognition.”
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575.
- See table 6.
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575.
- “The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection.”
Everingham, Mark, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. “The Pascal Visual Object Classes (VOC) Challenge.” International Journal of Computer Vision 88, no. 2 (June 2010): 303–38. https://doi.org/10.1007/s11263-009-0275-4.
- See section ‘LeNet’.
“The Evolution of Image Classification Explained.” Accessed October 19, 2020. https://stanford.edu/~shervine/blog/evolution-image-classification-explained#lenet. - “LeNet is a convolutional neural network structure proposed by Yann LeCun et al. in 1998.”
“LeNet.” In Wikipedia, June 19, 2020. https://en.wikipedia.org/w/index.php?title=LeNet&oldid=963418885.
- “We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%”Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 25, edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, 1097–1105. Curran Associates, Inc., 2012. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
Also, see Table 6 for a list of other entrants:
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575. - “Our 152-layer ResNet has a single-model top-5 validation error of 4.49%.”
Also see Table 4
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” ArXiv:1512.03385 [Cs], December 10, 2015. http://arxiv.org/abs/1512.03385. - See figure:
“Papers with Code – ImageNet Benchmark (Image Classification).” Accessed October 19, 2020. https://paperswithcode.com/sota/image-classification-on-imagenet.