ygolo
My termites win
- Joined
- Aug 6, 2007
- Messages
- 6,728
I realize fear-hyping isn't something people in the general public are used to, but the big labs use the common sense appeal for safety with a psuedo-religion of EA(Effective Altruism - the group that gave us Sam Backman-Fried. Redwood Research being a core part of that group also).I was kind of prepared to write this off -- can a machine really "cheat" which implies moral choice, or is it merely using its programming to analyze the system it is within to find the most efficient or effective "routes" to a solution?
Then I read this:
I mean, note that the AI rather blamed its task ("win the game") vs a moral value ("win the game fairly") so it is still a matter of the wording of the task it was given, and the parameters of its programming/training. But humans do this too -- blaming the rules or placing guilt on their given assignment so they can't be held accountable -- although they tend to feel some amount of guilt because they know what they SHOULD do and are choosing to ignore (unless they are complete psychopaths). Still humans are a byproduct of both programming and experience as well.
![]()
When AI Thinks It Will Lose, It Sometimes Cheats
When sensing defeat in a match against a skilled chess bot, advanced models sometimes hack their opponent, a study found.www.yahoo.com
The only paper that can be examined in detail is here:
Alignment faking in large language models
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all...

Most of the AI research community has known about this.
The general goal of EA AI labs is to ironically be the only labs that can do this research, and they are generally the worst people to be in charge of such research.
They want to fear-hype for regulations that'll justify the funds for them to be the only ones doing any AI, so that they have a business reason to build "digital god." They won't stop attempting to build it, BTW.
Reinforcement Learning has many forms, and there's a huge gap between research and deployment. If you ever try to build these systems, a core part of what happens is that the systems can maximize its rewards without understanding (or even knowing about) other constraints that make their strategies problematic.
The big labs will always sensationalize their results to fit their EA religion.
RL always "cheats." This has been a known problem since the technique was tried first, decades ago.
The spin is an attempt to pass regulations so that only the acolytes of the EA religion can do this research.
What's happened since DeepSeek R1, is that even individual researchers can now replicate RL on smaller(and therefore easier to understand and control) language models.
They don't have a defensible business model to build their "digital god," if individuals can get to useful and productive applications without any need for a "digital god."
I realize that it's very counterintuitive, but the fear-hypists rely on people filling in the gaps of knowledge with science fiction so that they can sustain the business model for their religion.
Edit:
Simple analogy. If you train an autonomous vehicle to get a reward at a location, give it a way to know it's distance from the location, and then provide no knowledge of people or buildings in it's way, what would you expect this vehicle to do?
The vehicle has no volition, but the people who made the decision to deploy the vehicle in the wild with such a naive design would have very suspicious motivations.
Last edited: