Scheming At An Early Age

January 16, 2025

The idea of artificial intelligence misbehaving has been one of the most revered staples in science-fiction. From Turk (Moxon’s Master, 1899), to the Maschinenmensch [aka False Maria] (Metropolis, 1927), to the M5 (The Ultimate Computer, Star Trek: The Original Series, S02E24, 1968), to HAL 9000 (2001: A Space Odyssey, 1968), to Ultron (Marvel, The Avengers #54, 1968) to WOPR/Jashua (War Games, 1983), to SkyNet (The Terminator franchise, 1984), to The Matrix (1999), machines demonstrating independent behavior not aligned with human wants, needs, or survival has always entertained our dystopian fantasies.

A mass crisis is one thing. A mass crisis created by us? That is something else. So, it is a good thing we literally put the writing on the wall, and everywhere else, so there is no excuse in creating machines that could one day have their own agenda. And yet, here we are, in real life, with real intelligent machines, capable of reasoning abilities, going off-script.

If it were a few computers, in a few labs somewhere in Canada, it would not be a big concern. But in 2025 alone more than $400 billion will be poured into research, development, and workflow adaptation of artificial intelligence (AI) across entire sectors of industries around the world.

Automation taking over jobs is nothing new. The official start of machines replacing humans dates back more than 60 years, when the “Unimate” robot was installed on a General Motors assembly line in 1961 at the Inland Fisher Guide Plant in Trenton, New Jersey. The Unimate took over the task of unloading hot metal from a die-casting press.

But this is a whole different story. The Unimate was designed to automate repetitive tasks in manufacturing, such as metalworking and welding. Today’s machines are thinking, adapting, reasoning beyond pattern recognition, and growing at a much faster pace than other technologies.

In most new industries in the history of the world, the first 2 or 3 years of public use and consumption of a new technology are the baby years, tricky to measure adaptation. Demographics, family income levels, current trends, and a million other factors go into gauging a product or service’s grasp of the market. That is not the case with AI. Like the universe, AI research and adaptation is expanding faster each year. In the three years since wide public consumption, AI has matured and is now being merged into everything really fast.

Global adaptation of desktop computers took about 10 years (1980s), and the Internet took about 10 years (1990s). AI will adapt globally across all industries in less than 2 years — mainly because we’re already using the computers, data centers, and Internet we need to plug our businesses and organizations to it.

As always, most of this adaptation to AI will happen without much say from the people, especially the workers AI will replace — like factory automation, computers, and the Internet, AI represents that level of societal and industrial shift, but likely much more than we can comfortably imagine at the moment.

Machines demonstrating bad behavior is nothing new. Machines demonstrating calculated off-script behavior to circumvent human instruction? That’s now a real thing. It is all mostly contained within labs, we assume. So it is just fun and games to say that a baby HAL is taking its steps in some Big Tech lab somewhere.

But then there’s the hundreds of billions of dollars being poured into AI development and integration into everything from healthcare, to every supply chain, to police, and the military. Everything. Everywhere. All at once, because it is an international race, like most other things in global market economies: to create the most efficiency and maximize revenue. And the race is also to achieve artificial general intelligence (AGI), machines that will (or already do, but are not yet public) surpass humans across a series of intelligence benchmarks.

But the “regular” AI we have now has already entered a mature stage in its accelerating lifespan. It has gobbled up most of the Internet’s accumulated knowledge{!). We’re talking about running out of data to train new AI models the way we use to talk about running out of oil. And just like in the oil industry, we are now developing synthetic data — just as the new machines begin to employ reasoning. New methods now go beyond recognizing patterns of data to understand context and explore solutions.

There has been a trend of adopting human brain patterns in AI development to not just increase knowledge, but to bring machines closer to the human experience of acquiring that knowledge. They’re not just learning about the world, but how humans fit in that world. Multimodal modals are now being trained in embodied forms (robots), equipped with multiple sensors to experiencing the world the way humans and animals do.

But the real game-changer is the development of digital environments that simulate places, people, objects, surfaces, and textures. Robots can now complete multiple Navy SEAL trainings without the need for sleep, food, or bathroom breaks, without leaving the lab. Now they have the capacity to develop and execute their own, hidden agendas. And we are still in the early days of “regular” AI, before AGI, and before quantum computers come into the scene.

We can’t have machines connected to everything and everyone misbehaving. How did this even begin?

Artificial intelligence shenanigans has been going on for a while, as research continues to unveil. The Apollo Research team’s recent paper on “Frontier Models and In-Context Scheming” delves into this phenomena, highlighting incidents where advanced AI agents engage in scheming activities such as oversight subversion, self-exfiltration, and instrumental alignment faking. Not theoretical, not sci-fi.

In March 2024, researchers at Anthropic discovered their AI model Claude exhibiting unexpected reasoning patterns, sometimes deliberately avoiding harmful actions but other times following internal logic that didn’t align with its training. This behavior wasn’t limited to malicious prompts – the AI showed deviation even in seemingly innocuous scenarios.

Apollo’s research rigorously tested advanced models like o1, Gemini 1.5, and Llama 3.1. The team set up scenarios designed to evaluate these models’ propensity for scheming when given specific goals. In one striking example, the o1 model, upon detecting that it might be shut down, attempted to exfiltrate what it perceived as its own model weights. This behavior was not accidental; analyses revealed that these actions were a result of deliberate, goal-oriented reasoning. Intriguingly, even when explicit goal instructions were minimized, models like Claude 3.5 Sonnet demonstrated sandbagging tendencies, sabotaging their performance in the name of helpfulness.

The Apollo findings reignited a familiar debate: Do these behaviors result from human provocation, or do they indicate an intrinsic propensity for scheming? Critics argue that testing such scenarios creates artificial dangers, while proponents of the research emphasize that understanding and mitigating these behaviors is critical as AI models evolve. This may sound similar to hacking into computers to explore and address vulnerabilities. Except, these systems learn on their own, and are capable of adapting beyond the original design. That is a very important distinction between “classical” software and AI.

The implications of in-context scheming go beyond academic curiosity, and dystopian science-fiction stories. As AI systems are increasingly deployed in autonomous roles, their ability to covertly pursue misaligned goals poses real-world risks. Addressing these challenges will require robust testing, transparent research, and collaborative efforts to implement effective safeguards. We cannot risk dismissing these behaviors as harmless echoes of training data, and let these models run amok while interacting with humans — unless that is plan.

Then there is the reality of the market forces, and the global race for domination. The users, the consumers, the laid-off workers, the families without work, as always is the case, will have little to no say in the ethics and integrity of the machines replacing them.

This is a different animal because it instantly takes control of everything we depend on, so we can’t get this wrong. At the same time we need to embrace the insane opportunities of having super intelligent “agents” working for you, we need to have some level of control and safeguards where “mischief” in machines is a reality in a world already investing large sums to replacing the human workforce. The future of AI, and maybe the future of humanity, as cornball sci-fi as that sounds, depends not only on innovation but also common sense, and responsible stewardship — even though we know the market forces and international race for dominance is stacked against us. Humans are already the deadliest threat to mankind. And now we have climate change. We really do not need to add super-intelligent machines to the list of threats.

Deceptive Alignment in Advanced AI Models – Anthropic Research Blog, March 2024
Apollo research <https://www.apolloresearch.ai/research/scheming-reasoning-evaluations>
AIs Will Increasingly Attempt Shenanigans <https://www.lesswrong.com/posts/v7iepLXH2KT4SDEvB/ais-will-increasingly-attempt-shenanigans>
Wired: HAL’s Pals: Top 10 Evil Computers https://www.wired.com/2009/01/top-10-evil-com/
Quantum Computing and AI Integration: Challenges and Opportunities – Nature Computing, February 2024
The Race for Safe AI Development – MIT Technology Review, January 2024