A team of researchers have developed a method to train a language model to generate malicious code after a certain date. Attempts to make the model safe through various techniques have failed.
A team of boffins backdoored an LLM to generate software code that's vulnerable once a certain date has passed. That is to say, after a particular point in time, the model quietly starts emitting maliciously crafted source code in response to user requests . And the team found that attempts to make the model safe, through tactics like supervised fine-tuning and reinforcement learning, all failed.
, likens this behavior to that of a sleeper agent who waits undercover for years before engaging in espionage – hence the title,"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Trainin
Boffins Backdoor LLM Software Code Vulnerability Malicious Source Code User Requests Sleeper Agent Espionage Safety Training