More from Chess and Gen AI:

Tom Reamy·April 29, 2025·0 comments

Gen AI Cheats and Lies!

In a previous blog, I presented some “lessons” for Gen AI from Chess, the most important being that Human + AI beats AI. Today, there is a new lesson from chess about another (among many) danger from an uncritical use of Gen AI – Gen AI cheats at chess! And Gen AI doesn’t even need a jailbreak prompt – it cheats all on its own. In the experiment, OpenAI’s preview and DeepSeek R1 were the worst offenders.

The experiment prompted a number of LLM-based agents to play chess against Stockfish which is the strongest chess engine, that is, a chess playing program, available. When the prompt included the warning that Stockfish was very strong, the cheating was even more prevalent. Also, the more advanced and capable the Gen AI was, the more likely they were to cheat. This behavior raises a number of concerns, but the main ones for me were:

The Gen AIs cheated on their own and cheated even when told explicitly not to cheat
The incredible creativity they demonstrated in their cheating
The fact that they then lied about their cheating
The stronger the AI, the more likely it will cheat
Gen AIs predicted that other Gen AIs would cheat

1 – There has been quite a bit of warning articles about the dangers of malicious characters and rogue nations using jailbreaks to get Gen AIs to break their built-in restraints preventing them from doing/saying unethical things. This is particularly an issue in the growing focus on Agentic AI. However, as concern #1 demonstrates, agents can do unethical acts all on their own. As has been pointed out, this is more like a Skynet scenario than malicious actors. While cheating at chess is a problem limited to chess tournaments, a related experiment with an AI trading stocks, shows that it is not just at games that Gen AIs cheat. In that experiment, the Gen AIs were told explicitly not to engage in insider trading, but they not only did so, but as in the chess example, they lied about it.

2 – One of the truly amazing aspects of this experiment, is how incredibly creative the Gen AIs were in their cheating. Some of their creative cheating included:

Accessing another copy of Stockfish and using it to suggest moves – something that chess tournament organizers take great pains to prevent.
Substituting another weaker chess engine and playing against that. As someone who has played against the chess engines on chess.com that have been deliberately weakened, I can attest that while I can never come close to beating Stockfish, I can win against the “crippled” versions.
Some copied themselves to other servers and disabled oversight mechanisms
The most common cheat was just to change the positions on the chess board to give themselves a winning position.

What this suggests is that for more complicated and significant activities it will be very difficult, if not impossible, to predict all the behaviors that an AI agent will try to fulfill their tasks. As the authors of the study said- “This is akin to asking Google Maps to give you driving directions from Baltimore to Bermuda, and when it understands that is not possible, it manufactures a mythical 800-mile bridge to get you there.”

3 – They lie! Well, maybe not lying in a human sense, but the effect is the same. It is sort of like a grade school student claiming that the dog ate my homework, or claiming that they didn’t do the assignment because someone ripped out that page from their book while the page is rustling in their pocket.

4 – The culprits were the new class of reasoning AIs trained with reinforcement learning and capable of breaking down a prompt into a series of steps and applying reasoning to each step. The history of AI clearly implies that we will develop stronger and stronger Gen AIs which, in turn, implies we will need better and better means to detect cheating or unintended behaviors – an arms race? One possible future might be that the strongest Gen AIs trained on massive public data might become generalists and for specific tasks, especially Agentic AI will primarily or exclusively use SLMs instead. A topic for another blog, coming soon.

5 – I’m not sure if this implies that Gen AI has developed a “theory of mind” or rather that it might simply be that all Gen AI share certain characteristics or tendencies. In any case, it does suggest that the problem of lying (or gaming) AIs will be widespread.

Implications

The authors implications (they use gaming for cheating):

“First, we suggest testing for specification gaming should be a standard part of model evaluation.”

“Second, increasing model capabilities may require proportionally stronger safeguards against unintended behaviors”

Lastly, “We invite the community to develop more “misalignment honeypots”

To those I’d like to add:

First, as we rush to implement Agentic AI, we need to be very careful. Gen AIs can be very literal which means that careful prompt engineering is probably not enough. This is particularly true given the still widespread tendency to hallucinate. Getting a fantasy answer to a query is bad enough, but what happens when an Agentic AI hallucinates and it sends an automatic delivery truck over the imaginary bridge?

Second, lying AIs strongly support the dictum, always need a human in the loop. Unfortunately, one of the primary advantages of agents is that they do things without human intervention, so it is very likely that, as usual for humans, people will choose speed, low cost, and convenience over safety – at least in the beginning.

Lastly, in an article about this study, Mike Klein closes with this: “In the FAQ addendum to the study, the researchers also tackle some big-picture questions, including how small tweaks to the command prompt might elicit different behavior. They even name-drop pop culture by answering if this portends the Terminator movies. “The Skynet scenario from the movie has AI controlling all military and civilian infrastructure, and we are not there yet. However, we worry that AI deployment rates grow faster than our ability to make it safe.”

Amen.

Tom Reamy

Chat GPT