Time for AI to cross the human performance range in ImageNet image classification

Published 19 Oct 2020

Progress in computer image classification performance took:

Over 14 years to reach the level of an untrained human
3 years to pass from untrained human level to trained human level
5 years to continue from trained human to current performance (2020)

Contents

Details

Metric

ImageNet¹ is a large collection of images organized into a hierarchy of noun categories. We looked at ‘top-5 accuracy’ in categorizing images. In this task, the player is given an image, and can guess five different categories that the image might represent. It is judged as correct if the image is in fact in any of those five categories.

Human performance milestones

Beginner level

We used Andrej Karpathy’s interface² for doing the ImageNet top-5 accuracy task ourselves, and asked a few friends to do it. Five people did it, with performances ranging from 74% to 89%, with a median performance of 81%.

This was not a random sample of people, and conditions for taking the test differed. Most notably, there was no time limit, so time allocated was set by patience for trying to marginally improve guesses.

Trained human-level

ImageNet categorization is not a popular activity for humans, so we do not know what highly talented and trained human performance would look like. The best relatively high human performance measure we have comes from Russakovsky et al, who report on performance of two ‘expert annotators’, who they say learned many of the categories. ³ The better performing annotator there had a 5.1% error rate.⁴

AI achievement of human milestones

Earliest attempt

The ImageNet database was released in 2009.⁵. An annual contest, the ImageNet Large Scale Visual Recognition Challenge, began in 2010.⁶

In the 2010 contest, the best top-5 classification performance had 28.2% error.⁷

However image classification broadly is older. Pascal VOC was a similar previous contest, which ran from 2005.⁸ We do not know when the first successful image classification systems were developed. In a blog post, Amidi & Amidi point to LeNet as pioneering work in image classification⁹, and it appears to have been developed in 1998.¹⁰

Beginner level

The first entrant in the ImageNet contest to perform better than our beginner level benchmark was SuperVision (commonly known as AlexNet) in 2012, with a 15.3% error rate.¹¹

Superhuman level

In 2015 He et al apparently achieved a 4.5% error rate, slightly better than our high human benchmark.¹²

Current level

According to paperswithcode.com, performance has continued to climb, to 2020, though slower than earlier.¹³

Times for AI to cross human-relative ranges

Given the above dates, we have:

Range	Start	End	Duration (years)
First attempt to beginner level	<1998	2012	>14
Beginner to superhuman	2012	2015	3
Above superhuman	2015	>2020	>5

Primary author: Rick Korzekwa

Notes

“ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently we have an average of over five hundred images per node. “

“ImageNet.” Accessed October 19, 2020. http://www.image-net.org/.
Karpathy, Andrej. “Ilsvrc.” Accessed October 19, 2020. https://cs.stanford.edu/people/karpathy/ilsvrc/.
‘Therefore, in evaluating the human accuracy we relied primarily on expert annotators who learned to recognize a large portion of the 1000 ILSVRC classes. During training, the annotators labeled a few hundred validation images for practice and later switched to the test set images’

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575.
“Annotator A1 evaluated a total of 1500 test set images. The GoogLeNet classication error on this sample was estimated to be 6.8% (recall that the error on full test set of 100,000 images is 6.7%, as shown in Table 7). The human error was estimated to be 5.1%.”

Also see Table 9

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575.
“They presented their database for the first time as a poster at the 2009 Conference on Computer Vision and Pattern Recognition (CVPR) in Florida.”

“ImageNet.” In Wikipedia, September 9, 2020. https://en.wikipedia.org/w/index.php?title=ImageNet&oldid=977585441.
“…The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has been running annually for five years (since 2010) and has become the standard benchmark for large-scale object recognition.”

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575.
See table 6.

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575.
“The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection.”

Everingham, Mark, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. “The Pascal Visual Object Classes (VOC) Challenge.” International Journal of Computer Vision 88, no. 2 (June 2010): 303–38. https://doi.org/10.1007/s11263-009-0275-4.
See section ‘LeNet’.

“The Evolution of Image Classification Explained.” Accessed October 19, 2020. https://stanford.edu/~shervine/blog/evolution-image-classification-explained#lenet.
“LeNet is a convolutional neural network structure proposed by Yann LeCun et al. in 1998.”

“LeNet.” In Wikipedia, June 19, 2020. https://en.wikipedia.org/w/index.php?title=LeNet&oldid=963418885.
“We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%”

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 25, edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, 1097–1105. Curran Associates, Inc., 2012. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

Also, see Table 6 for a list of other entrants:

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” ArXiv:1409.0575 [Cs], January 29, 2015. http://arxiv.org/abs/1409.0575.
“Our 152-layer ResNet has a single-model top-5 validation error of 4.49%.”

Also see Table 4

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” ArXiv:1512.03385 [Cs], December 10, 2015. http://arxiv.org/abs/1512.03385.
See figure:

“Papers with Code – ImageNet Benchmark (Image Classification).” Accessed October 19, 2020. https://paperswithcode.com/sota/image-classification-on-imagenet.

Details

Metric

Human performance milestones

Beginner level

Trained human-level

AI achievement of human milestones

Earliest attempt

Beginner level

Superhuman level

Current level

Times for AI to cross human-relative ranges

Notes

Related Articles

Historical economic growth trends

Human-Level AI

Possible Empirical Investigations