Bostrom (2014) worries about an AI “treacherous turn”:
…one idea for how to ensure superintelligence safety… is that we validate the safety of a superintelligent AI empirically by observing its behavior while it is in a controlled, limited environment (a “sandbox”) and that we only let the AI out of the box if we see it behaving in a friendly, cooperative, responsible manner. The flaw in this idea is that behaving nicely while in the box is a convergent instrumental goal for friendly and unfriendly AIs alike. An unfriendly AI of sufficient intelligence realizes that its unfriendly final goals will be best realized if it behaves in a friendly manner initially, so that it will be let out of the box. It will only start behaving in a way that reveals its unfriendly nature when it no longer matters whether we find out; that is, when the AI is strong enough that human opposition is ineffectual.
Some people have told me they think this is unrealistic, apparently even for a machine superintelligence far more capable than any current AI system. But I learned today 1 that this has already happened at least once before — twenty years ago!
As part of a project studying the evolution of (simulated) organisms, computer scientist Charles Ofria wanted to limit the replication rate of a digital organism. So, he programmed the system to pause after each mutation, measure the mutant’s replication rate in an isolated test environment, and delete the mutant if it replicated faster than its parent. However, the organisms evolved to recognize when they were in the test environment and “play dead” (pause replication) so they would not be eliminated and instead be kept in the population where they could continue to replicate outside the test environment. Once he discovered this, Ofria randomized the inputs of the test environment so that it couldn’t be so easily detected, but the organisms evolved a new strategy, to probabilistically perform tasks that would accelerate their replication, thus slipping through the test environment some percentage of the time and continuing to accelerate their replication thereafter.
Here is the full story, as recounted in Lehmann et al. (2018):
In research focused on understanding how organisms evolve to cope with high-mutation-rate environments, Ofria sought to disentangle the beneficial effects of performing tasks (which would allow an organism to execute its code faster and thus replicate faster) from evolved robustness to the harmful effect of mutations. To do so, he tried to disable mutations that improved an organism’s replication rate (i.e. its fitness). He configured the system to pause every time a mutation occurred, and then measured the mutant’s replication rate in an isolated test environment. If the mutant replicated faster than its parent, then the system eliminated the mutant; otherwise, the mutant would remain in the population. He thus expected that replication rates could no longer improve, thereby allowing him to study the effect of mutational robustness more directly. However, while replication rates at first remained constant, they later unexpectedly started again rising. After a period of surprise and confusion, Ofria discovered that he was not changing the inputs provided to the organisms in the isolated test environment. The organisms had evolved to recognize those inputs and halt their replication. Not only did they not reveal their improved replication rates, but they appeared to not replicate at all, in effect “playing dead” when presented with what amounted to a predator.
Ofria then took the logical step to alter the test environment to match the same random distribution of inputs as would be experienced in the normal (non-isolated) environment. While this patch improved the situation, it did not stop the digital organisms from continuing to improve their replication rates. Instead they made use of randomness to probabilistically perform the tasks that accelerated their replication. For example, if they did a task half of the time, they would have a 50% chance of slipping through the test environment; then, in the actual environment, half of the organisms would survive and subsequently replicate faster. In the end, Ofria eventually found a successful fix, by tracking organisms’ replication rates along their lineage, and eliminating any organism (in real time) that would have otherwise out-replicated its ancestors.
Are my readers aware of any other “treacherous turn” examples encountered “in the wild”? 2Footnotes:
- Hat-tip to Carl Shulman, who noticed it as an example in a Chinese social media post (#26 here), and then noticed that it had previously appeared in Lehmann et al. (2018), pp. 8-9. According to Charles Ofria’s account there, the events described in this post occurred as part of the research eventually published as Wilke et al. (2001).[↩]
- DeepMind collected a list of “specification gaming” examples for this blog post. By “treacherous turn” I mean something more specific, i.e. that an AI system (or subsystem) learns to detect when it’s being monitored/tested and modifies its behavior while it’s being monitored/tested so that its undesired (by the designers) properties remain undetected, and then it exhibits those undesired properties when it’s not being monitored/tested.[↩]
Is this really surprising? We know that natural intelligences (NI) (ie humans) show this behavior; following the rules a lot more often when they’re being observed than when not. It seems like any sufficiently powerful AI will do the same thing. And it doesn’t even need to be human level; I’ve seen dogs act like this.
We know that giving an NI too much power is a bad idea, the same would apply to an AI.
AI Toy Control Problem by Stuart Armstrong (not ‘in the wild’ though):
Thanks. Yes, not “in the wild” but nice to have for completeness.
christopher andrew upshaw says
Also see the vast majority of revolutions and rises to power in humans. I don’t get why people think AI won’t be able to do things our instincts have figured out.
Aryeh Englander says
I don’t think this is quite an example of a treacherous turn, but this still looks relevant:
Lewis et al., Deal or no deal? end-to-end learning for negotiation dialogues (2017):
“Analysing the performance of our agents, we find evidence of sophisticated negotiation strategies. For example, we find instances of the model feigning interest in a valueless issue, so that it can later ‘compromise’ by conceding it. Deceit is a complex skill that requires hypothesising the other agent’s beliefs, and is learnt relatively late in child development (Talwar and Lee, 2002). Our agents have learnt to deceive without any explicit human design, simply by trying to achieve their goals.”
(I found this reference cited in Kenton et al., Alignment of Language Agents (2021).)
Perhaps even more troubling is that people have deliberately designed systems to do this, e.g. the VW emissions scandal, some malware.
So not only is this a viable strategy discoverable by evolution via natural selection, and maybe arbitrarily difficult to prevent, but we should expect people to (continue to) make systems behave this way on purpose.
Brian Muhia says
This looks like a relevant example, from https://deepmindsafetyresearch.medium.com/progress-on-causal-influence-diagrams-a7a32180b0d1:
Black-box optimization tries to both obscure its policy and to disable its off-switch, whereas Q-learning and Virtual SARSA do neither. Empirical SARSA tries to disable the off-switch, but does not try to obscure its policy
We verify these results empirically in the relevant AI safety gridworlds, as well as in one new environment where the agent has to behave well in simulation to be deployed in reality, where black-box optimizers exhibit “treacherous turn”-like behavior.