In my 2013 article on strong AI forecasting, I made several suggestions for how to do better at forecasting strong AI, including this suggestion quoted from Phil Tetlock, arguably the leading forecasting researcher in the world:
Signposting the future: Thinking through specific scenarios can be useful if those scenarios “come with clear diagnostic signposts that policymakers can use to gauge whether they are moving toward or away from one scenario or another… Falsifiable hypotheses bring high-flying scenario abstractions back to Earth.”
Tetlock hadn’t mentioned strong AI at the time, but now it turns out he wants suggestions for strong AI signposts that could be forecast on GJOpen, the forecasting tournament platform.
@PTetlock Any thoughts on the AlphaGo victory?
— Brandon Wilson (@brandonwilson) March 20, 2016
important: it was one of our signpost indicators. It nudges probability we are on a strong AI scenario trajectory https://t.co/XuAcpjlWtm
— Philip E. Tetlock (@PTetlock) March 20, 2016
@PTetlock What other signpost indicators are you watching?
— Alexander Berger (@albrgr) March 21, 2016
evolving list will be on https://t.co/wsHsvAiV9h by July 1 (driverless ubers; robotics spending,…). Ideas welcome https://t.co/5jdSE8IzFO
— Philip E. Tetlock (@PTetlock) March 21, 2016
Specifying crisply formulated signpost questions is not easy. If you come up with some candidates, consider posting them in the comments below. After a while, I will collect them all together and send them to Tetlock. (I figure that’s probably better than a bunch of different people sending Tetlock individual emails with overlapping suggestions.)
Tetlock’s framework for thinking about such signposts, which he calls “Bayesian question clustering,” is described in Superforecasting:
In the spring of 2013 I met with Paul Saffo, a Silicon Valley futurist and scenario consultant. Another unnerving crisis was brewing on the Korean peninsula, so when I sketched the forecasting tournament for Saffo, I mentioned a question IARPA had asked: Will North Korea “attempt to launch a multistage rocket between 7 January 2013 and 1 September 2013?” Saffo thought it was trivial. A few colonels in the Pentagon might be interested, he said, but it’s not the question most people would ask. “The more fundamental question is ‘How does this all turn out?’ ” he said. “That’s a much more challenging question.”
So we confront a dilemma. What matters is the big question, but the big question can’t be scored. The little question doesn’t matter but it can be scored, so the IARPA tournament went with it. You could say we were so hell-bent on looking scientific that we counted what doesn’t count.
That is unfair. The questions in the tournament had been screened by experts to be both difficult and relevant to active problems on the desks of intelligence analysts. But it is fair to say these questions are more narrowly focused than the big questions we would all love to answer, like “How does this all turn out?” Do we really have to choose between posing big and important questions that can’t be scored or small and less important questions that can be? That’s unsatisfying. But there is a way out of the box.
Implicit within Paul Saffo’s “How does this all turn out?” question were the recent events that had worsened the conflict on the Korean peninsula. North Korea launched a rocket, in violation of a UN Security Council resolution. It conducted a new nuclear test. It renounced the 1953 armistice with South Korea. It launched a cyber attack on South Korea, severed the hotline between the two governments, and threatened a nuclear attack on the United States. Seen that way, it’s obvious that the big question is composed of many small questions. One is “Will North Korea test a rocket?” If it does, it will escalate the conflict a little. If it doesn’t, it could cool things down a little. That one tiny question doesn’t nail down the big question, but it does contribute a little insight. And if we ask many tiny-but-pertinent questions, we can close in on an answer for the big question. Will North Korea conduct another nuclear test? Will it rebuff diplomatic talks on its nuclear program? Will it fire artillery at South Korea? Will a North Korean ship fire on a South Korean ship? The answers are cumulative. The more yeses, the likelier the answer to the big question is “This is going to end badly.”
I call this Bayesian question clustering because of its family resemblance to the Bayesian updating discussed in chapter 7. Another way to think of it is to imagine a painter using the technique called pointillism. It consists of dabbing tiny dots on the canvas, nothing more. Each dot alone adds little. But as the dots collect, patterns emerge. With enough dots, an artist can produce anything from a vivid portrait to a sweeping landscape.
There were question clusters in the IARPA tournament, but they arose more as a consequence of events than a diagnostic strategy. In future research, I want to develop the concept and see how effectively we can answer unscorable “big questions” with clusters of little ones.
(Note that although I work as a GiveWell research analyst, my focus at GiveWell is not AI risks, and my views on this topic are not necessarily GiveWell’s views.)
Some random thoughts (be warned, many overlap, and very a lot in terms of specificity):
1. Sales of GPUs ($, # of units, FLOPs/benchmarks).
2. Sales of CPUs (as above).
3. Total electricity consumption of data centers (total $, portion of world GDP, total wattage, % of world electricity consumption).
4. AI defeating world champion in Starcraft (Brood War, SC2).
5. World-champion level Go performance on a smartphone (already achieved in chess).
6. ImageNet classification accuracy.
7. Loebner Prize (for passing Turing Test, not for ‘most human’) won. Predictions about scores.
8. International RoboCup Federation challenge that by 2050 “a team of fully autonomous humanoid robot soccer players shall win the soccer game, comply with the official rule of the FIFA, against the winner of the most recent World Cup.”
9. CADE ATP System Competition results: annual automated theorem proving contest. http://www.cs.miami.edu/~tptp/CASC/
10. Word error rate for speech recognition software. Sales/downloads of speech recognition software. Share of text produced by speech recognition vs typing. Share of smartphone inputs from speech vs touchscreen.
11. Kurzweil has a lot of predictions that could be harvested, specified, and adjusted. https://en.wikipedia.org/wiki/Predictions_made_by_Ray_Kurzweil
12. Results in Hutter Prize for compression: https://en.wikipedia.org/wiki/Hutter_Prize
13. Elo rating of top computer chess programs.
14. Annual Computer Poker Competition results.
15. No-limit poker bots reaching world-class performance.
16. Robocars deployed on the road. #, sales, accident rates, speed.
17. International Federation of Robotics estimates of sales of industrial robots (# and $). https://en.wikipedia.org/wiki/Industrial_robot#Market_structure
18. Lights-out manufacturing by industry or output. https://en.wikipedia.org/wiki/Lights_out_(manufacturing)
19. Automation of crop harvesting.
20. Fast-food chain deployment of automated ordering systems (kiosks, smart phone ordering).
21. AAAI general game playing competition.
22. Machine translation used in high-performance applications/displacement of human translators.
23. Sales of household robots.
24. Hitting assorted benchmarks (vision, translation, etc) with less training data.
25. Sales of robotic surgery systems.
26. Robotics spend in warehouse logistics. Human worker-hours per product shipped, e.g. at Amazon.
27. Automated computer programming.
28. Significant conjectures (from pre-existing lists) in mathematics proved or disproved by automated theorem-proving systems).
29. Paper views, downloads, citations, publications using AI terms (absolute and proportional). Data from, e.g. Google Scholar and arXiv.
30. Usage of AI advisors by doctors/patients.
31. Revenue of IBM Watson group. CEO Virginia Rometty said she hopes for $10 billion in revenue within 10 years: https://en.wikipedia.org/wiki/Watson_(computer)#IBM_Watson_Group
32. Unsupervised learning catching up to supervised learning (or hitting various absolute standards) on performance benchmarks discussed above.
33. General video-game playing for games with long-term temporal dependencies and social elements (and little immediate feedback from a game score). MMOs, Diplomacy, Civilization, Zelda.
Usefully self-modifying software, even in some narrow domain. E.g. Software that autonomously modifies its code in response to user interaction patterns
Number of partially autonomous corporations- corporate entities with no human input in day-to-day tasks. (With or without a human owner.) Number of industries where this is possible.
Fully autonomous corporations- corporate entities that can operate entirely and arbitrarily long without human oversight. The first time this occurs and, later, the number of such companies and the number of industries where this is possible.
Percentage and value of global stock market trades performed autonomously.
Number of scientific fields in which a computer system develops and tests novel hypotheses.
Number of autonomous systems that have legally killed a human in war. Later, percentage of killings in war performed by autonomous systems.
Deployment of fully autonomous spacecraft for exploration or exploitation of natural resources.
“In future research, I want to develop the concept and see how effectively we can answer unscorable “big questions” with clusters of little ones.”
There’s also the old standby of making predictions about what a panel of judges or agency will say about a future situation to deal with some of the issues with specifying details. I’m using that for my bet against cold fusion, using a panel of three physicists.
Well regarded AI authored novel.
AI written top 40 pop hit (this one seems pretty achievable now notwithstanding the difficulties of negotiating the music industry)
Or mor generally, when AI seems to be writing more and better human cultural material than humans.
Machine translation considered comparable to human translation
(1) Commonsense reasoning
(2) Professional quality of translation
(3) Realization of IBM Watson and Wolphram on fully differentiable system
Generally, see section (10. So what separates us from human-level AI?) here:
http://stop-skynet.com/review-of-state-of-the-arts.pdf