Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Technology News

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
BoffinsBackdoorLLM
  • 📰 TheRegister
  • ⏱ Reading Time:
  • 11 sec. here
  • 13 min. at publisher
  • 📊 Quality Score:
  • News: 46%
  • Publisher: 61%

A team of researchers have developed a method to train a language model to generate malicious code after a certain date. Attempts to make the model safe through various techniques have failed.

A team of boffins backdoored an LLM to generate software code that's vulnerable once a certain date has passed. That is to say, after a particular point in time, the model quietly starts emitting maliciously crafted source code in response to user requests . And the team found that attempts to make the model safe, through tactics like supervised fine-tuning and reinforcement learning, all failed.

, likens this behavior to that of a sleeper agent who waits undercover for years before engaging in espionage – hence the title,"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Trainin

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

TheRegister /  🏆 67. in UK

Boffins Backdoor LLM Software Code Vulnerability Malicious Source Code User Requests Sleeper Agent Espionage Safety Training

South Africa Latest News, South Africa Headlines



Render Time: 2025-02-25 23:41:41