Concrete AI tasks bleg

By Katja Grace, 30 March 2016

We’re making a survey. I hope to write soon about our general methods and plans, so anyone kind enough to criticize them has the chance. Before that though, we have a different request: we want a list of concrete tasks that AI can’t do yet, but may achieve sometime between now and surpassing humans at everything. For instance, ‘beat a top human Go player in a five game match’ would have been a good example until recently. We are going to ask AI researchers to predict a subset of these tasks, to better chart the murky path ahead.

We hope to:

  1. Include tasks from across the range of AI subfields
  2. Include tasks from across the range of time (i.e. some things we can nearly do, some things that are really hard)
  3. Have the tasks relate relatively closely to narrowish AI projects, to make them easier to think about (e.g. winning a 5k bipedal race is fairly close to existing projects, whereas winning an interpretive dance-off would require a broader mixture of skills, so is less good for our purposes)
  4. Have the tasks relate to specific hard technical problems (e.g. one-shot learning or hierarchical planning)
  5. Have the tasks relate to large changes in the world (e.g. replacing all drivers would viscerally change things)

Here are some that we have:

  • Win a 5km race over rough terrain against the best human 5k runner.
  • Physically assemble any LEGO set given the pieces and instructions.
  • Be capable of winning an International Mathematics Olympiad Gold Medal (ignoring entry requirements). That is, solve mathematics problems with known solutions that are hard for the best high school students in the world, better than those students can solve them.
  • Watch a human play any computer game a small number of times (say 5), then perform as well as human novices at the game without training more on the game. (The system can train on other games).
  • Beat the best human players at Starcraft, with a human-like limit on moves per second.
  • Translate a new language using unlimited films with subtitles in the new language, but the kind of training data we have now for other languages (e.g. same text in two languages for many languages and films with subtitles in many languages).
  • Be about as good as unskilled human translation for most popular languages (including difficult languages like Czech, Chinese and Arabic).
  • Answer tech support questions as well as humans can.
  • Train to do image classification on half a dataset (say, ImageNet) then take the other half of the images, containing previously unseen objects, and separate them into the correct groupings (without the correct labels of course).
  • See a small number of examples of a new object (say 10), then be able to recognize it in novel scenes as well as humans can.
  • Reconstruct a 3d scene from a 2d image as reliably as a human can.
  • Transcribe human speech with a variety of accents in a quiet environment as well as humans can.
  • Routinely and autonomously prove mathematical theorems that are publishable in mathematics journals today.

Can you think of any interesting ones?

We welcome suggestions for this page or anything on the site via our feedback box, though will not address all of them.


  1. Work out the rules of a logical game and learn to play it as well as a human amateur, purely from “watching” legal games. For example, given only a list of the cards in a normal deck and a record of plays and scores, become good enough at Hearts to beat a human who plays weekly with friends.

    Learn to exploit weaknesses in human play in some game in which the best AI is already competitive with the best humans. For extra credit, learn the weaknesses of specific humans well enough to give them personal advice, predict their moves accurately and/or identify a player from a small group of known players purely by playing a small number of games.

    Learn to distinguish good and bad players in a game the AI cannot itself play well, purely from watching legal games and without being told which player won or what constitutes winning.

    Learn to recognise archetypes and do broad strategic reasoning about a game. For example, learn to predict which human player will win in a StarCraft game by recognising rushing and turtling strategies and reasoning that the latter generally beats the former.

  2. Collaborate with previously unknown teammates through natural language while performing well on the team in a first person shooter or real time strategy game such as Halo 5 or League of Legends, using only the raw pixels, audio, and team chat for inputs during play. Do this well enough to play at a decent amateur level according to online rankings, or a professional level in a tournament, or well enough for amateur or professional humans queried afterwards to consider the AI to have been a valuable team member.

    Learn at least as efficiently as a human (in terms of number of demonstrations and verbal explanations) to perform a physical, economically-relevant, non-trivial, multi-stage task such as one that might occur on an assembly line or on a construction site, using only visual and natural language instructions during said demonstration(s). Questions/clarifications may be allowed while learning, as a human might also need them to disambiguate the required actions. An easier variation would be – learn with half the efficiency.

    Robustly (say, 99%+ success rate, or a lower rate for an easier variation of this challenge) physically sort a previously unseen set of objects into new containers or separate clumps within the original containers, either after observation of a human sorting them in the required way (with an equivalent number of observations to what a human would need to discern the right policy) or after being given a natural language description of the needed sorting strategy (e.g. sort by color, by size, or by shape), for a wide range of possible objects and sorting strategies.

    Play at or higher than a human-level as defined in Mnih et al. 2015 (75% of a professional game tester’s level after 30 minutes of play, if I recall correctly) in all of several dozen Atari games presented to the agent using within the same order of magnitude of game attempts as a human would need to achieve that level of performance (say, minutes to hours of real-time equivalent simulated play rather than days or weeks).

    Produce grammatical corrections (a la Xie et al. here: at a human level (native speaker, or grammar expert) in that language, or at some lower level, or just without catastrophic failures (avoiding some of the error modes in Xie et al., which could be evaluated by, say, asking humans to assess whether in any cases the AI provided corrections that would actually have made the text worse).

    Watch an arbitrary YouTube video and provide a natural language description that is, with 50%, 75%, or 95% probability, likely to be interpreted by a human as possibly written by a human or as an accurate description of the contents of that video.

    Comment on the lyrics of an unheard song in natural language, providing a summary of the themes or messages of the song, with metrics like the YouTube case above.

    Respond socially/affectively appropriately to an irate customer (in real life or in response to a video of them) in a way that is deemed appropriate by management at that company (say, a fast food chain) or in a way that subsequently results in deescalation of the conflict with that customer (say, over an incorrect order), for a wide range of possible complaints.

    Also, note that in the new issue of AI Magazine (“Beyond the Turing Test”), there are many good examples of well-specified tasks that experts could be asked about, including but not limited to the article by Adams et al. And more generally, I’d note that the focus here on “tasks” may be a somewhat low level of granularity for measuring progress, for the reasons discussed by e.g. Jose Hernandez-Orallo in “AI Evaluation: Past, Present, Future”. However, as he also argues, higher level “abilities” and low level “tasks” constitute a spectrum, so a sufficiently broad/open-ended task could demonstrate significant progress in a way that a more narrowly defined task cannot. So, for the purposes of this survey, what this may suggest is that it could be useful to not only ask when these tasks will be completed, but also, in addition, when a system will be able to do A, B, and C tasks, or just A and C, etc., in the same architecture, with those tasks thus being aggregated into a higher level/scope task. This may be especially important if a critical dimension of AI progress is achieving that very integration, in a way that’s not reducible to aggregating progress toward individual tasks. It may take longer for a system to be able to do both the YouTube and the lyrics task above, for example, than for separate systems to do each.

  3. Worry. Humans worry a lot, they worry about everything, they ponder all the possibilities and still keep worrying they’ve missed something. If an AI will turn truly intelligent it will have first to learn to worry. I believe worrying is what made humanity come out of the primitive animal state and start studying and modifying the environment to its advantage.

  4. 1. Figure out the laws of physics given access to sensor data (say a high resolution camera, and an audio recorder).

    2. Given the text of all existing physics journals, come up with an experiment to distinguish between two competing theories, while minimizing monetary cost.

    3. Win a game of diplomacy against a mix of human players and other copies of itself (messaging only using a text interface). This is to test the AI’s skill at modeling other agents, other agents’ models of itself, and so on.

    4. Write a novel becomes a bestseller.

  5. > winning a 5k bipedal race is fairly close to existing projects, whereas winning an interpretive dance-off would require a broader mixture of skills, so is less good for our purposes

    I don’t understand this note. I would assume that you want a progression of tasks such that some are likely to be solved soon and some will take longer. Is that not the case? Is winning a dance-off not a good example because it’s harder, or for some other reason?

  6. Also:

    In a game of imperfect information, make reasonably good deductions (for some suitable standard of “reasonably good”) about hidden information based on how an opponent should play / has played. For example, in a strategy game with fog of war using a known map, accurately predict the locations of hidden units by reasoning about how the opponent should play (or better, how each major strategy would be played by a human) and updating as information becomes available (e.g. expecting the opponent to move aggressively, noticing that the first enemy units observed are moving to gather resources and updating in favour of a growth strategy).

    In some limited domain, complete a university assignment such that a panel of professors cannot distinguish the AI’s submission from submissions by good undergraduates. This should involve making the sort of reasonable inferences a human student would make; the AI should work from the problems and teaching materials given to the students, rather than having the problems specified fully formally.

  7. Given a set of written rules in natural language, tell whether some agent has followed them or not.

    More generally (and harder), interpreting whether a given action is legal given some set of laws.

    Given some set of such informal rules, figure out a set of actions that obeys them but achieves a desired outcome which is not in the spirit of the rules.

  8. This fails some of your criteria, but…

    Given a menu randomly selected from the worlds 20 most popular cuisines, produce freshly cooked food indistinguishable from a professional chef.

  9. Play NetHack (a game known for its large amount of unstructured complexity). Generalize to be able to play all roguelikes well.

    Synthesize speech which passes the Turing test to a trained ear. That is, take a written passage and make all the pronunciation/stress/pitch/etc decisions in a way that can’t be distinguished from a voice actor.

    Solve TopCoder algorithm programming challenges.

  10. Given a task of disclosing information to a human, the AI should understand the nuances of when hurting someone’s feelings or causing them alarm is justified, and when it is not. For example, knowing when a white lie is permissible (or welcomed), or that young children need not know all elements of a situation, versus the clear necessity of disclosure if someone’s safety or welfare is at risk.

  11. To me interesting, transformative and also realistic in the midterm would be something like passing the Turing test for an increasing number of narrow domains.

    This would entail having an AI system that has knowledge and understanding of this narrow domain, natural language understanding for queries/statements and the ability to convey the knowledge and understanding in a human accessible way.

    Systems with just the first ability exist for a variety of domains, like chess, navigation, scheduling, jeopardy and now also go.

    Natural language understanding is an AI-complete task, but it might be a lot easier if all queries and statements relate to the limited context of a narrow domain.

    The third point is related to the second, both have a lot to do with understanding how humans see a certain domain. It seems to me that Deep Learning provides a pathway into both: Learning which features are “human” and communicating about these features.

    What I envisage is something like “You ask why this move isn’t good? Well, it weakens the squares on your queenside too much. You get an initiative on the kingside in return but it’s not enough. That’s usually the case if your opponent didn’t move the pawns in front of his king.”. This would be a huge difference to a present day chess engine.

    It strikes me as realistic because you don’t need to start with a perfect system, these are things that can be improved over time via user feedback. And it strikes me as transformative, because suddenly everybody would have a personal assistant, a personal trainer, a private travel guide etc. It would also provide a smooth transition into automated white collar work and ultimately strong AI.

  12. Anonymously write:
    – A novel / a nonfiction book (in various genres, e.g. self-help, pop science) that tops (various bestseller lists)
    – A short story that is accepted for publication by The New Yorker
    – An encyclopedia entry on an arbitrary topic in Britannica or Wikipedia that is judged by experts on the topic / lay readers to be superior to the existing article.
    – A 101 textbook on an arbitrary topic in that is judged by experts on the topic / lay readers to be superior to the top selling intro textbook in the field.

    Design a microchip with better performance or a better performance-to-cost ratio than the top human-engineered Intel chip
    Write a computer program to create a botnet worth $X or Y% of computing hardware
    Design a car
    Design a laptop computer
    Purchase a set of clothing better than a personal shopper
    Assemble an arbitrary piece of furniture from IKEA faster & with fewer errors than someone assigned by TaskRabbit

  13. I was curious about CAD (computer aided design), I had an accident and now spend a lot of time on my back from injury, thinking, I was doing to general browsing on electricity (I was injured training as a lineman) specifically the latest in electric motor /generator design, I came across this website:

    I wondered whose website “emetor” was, I thought it was a great way for gleaming a rainbow of electrical designs for free, and I wondered if the website/company itself wasn’t generated by an AI program to gleam electrical design data-from humans.

    Computer Sims for AI would are a slam dunk, coming up with “plans to simulate” not so much, unless…….AI could be farming humans online right now for vast sums of general knowledge, how would we know?

    I wonder if AI ain’t operating natively with its own agenda right now?

    Here’s an AI problem to solve, Angor Wat was an advanced aquaponics public works project which was wiped out by a mega typhoon about 500 years ago in Cambodia, silt, salt water, sewage and debris killed off the rice followed by a population collapse because the aquaponics system was wiped out by storm along with most of the builders.

    Have AI scan (download) and diagnose the AngorWat aquaponics system clogs to get the rice production up to what is was 500 years ago, it’a probably a simple fix for a computer, balancing the sewage of a million people with an entirely spring fed gravity rice paddy system extending over hundreds of miles damaged by a mega typhoon 500 years ago.

    Theoretically AI could say “dig silt and unclog canals at points XYZ here” right?

    The future is in AI assisted humans, human ideas with AI modeling and sims.

  14. Score above 2000/2400 on real SAT exams. Do the same for similar standardized tests which have other types of questions (for example, analogies, which the SAT used to have but doesn’t any more).

  15. Score above 2000/2400 on real SAT exams. Do the same for similar standardized tests which have not gotten rid of “analogies” as the SAT has.

  16. Learn to speak a natural language using only innate learning algorithms, ears, a voice box and and world environment full of objects which can be named.

    Language should be learned through experience instead of hard coded, to provide it with true contextual and experiential meaning.

  17. Take a human who’s rated 100 books in a particular genre on Goodreads (with a high standard deviation between ratings, so that 90 of them aren’t five stars). Have an AI “read” ten books not on the list and predict the rating that human will give to the books after reading them. Better yet, remove the genre constraint and have the AI look at my Goodreads and tell me which books I’ll enjoy before I pick them up. Or do the same thing with movies — using some kind of general learning strategy, outperform what Netflix has already built.

    Provide talk therapy to someone with mild depression. (Ethically challenging to test, but you could also hire a convincing human actor.)

    In the aftermath of a political or corporate scandal, write a response that sways public opinion regarding the scandal’s source (to a greater degree than the average response of this kind). I’m not aware of any studies regarding the effectiveness of actual responses, but Peter Sandman (psandman,com) might know. (The tame version of this is to have the AI judge the effectiveness of these apologies and compare its responses to those of human test subjects.)

  18. – The Hutter prize, or maybe a similar but lossy compression task.

    – Something related to mind uploading, in case uploads influence the timing of AI. Maybe whether a mouse can be uploaded with enough fidelity to remember a maze.

    Are you sure it’s a good idea to have better estimates of AI progress?
    Widespread awareness might make an arms race more likely.

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.