Thursday, October 23, 2025

AI LLM poisoning attacks are trivially easy

This doesn't seem good:

Poisoning AI models might be way easier than previously thought if an Anthropic study is anything to go on. 

Researchers at the US AI firm, working with the UK AI Security Institute, Alan Turing Institute, and other academic institutions, said today that it takes only 250 specially crafted documents to force a generative AI model to spit out gibberish when presented with a certain trigger phrase. 

For those unfamiliar with AI poisoning, it's an attack that relies on introducing malicious information into AI training datasets that convinces them to return, say, faulty code snippets or exfiltrate sensitive data.

The common assumption about poisoning attacks, Anthropic noted, was that an attacker had to control a certain percentage of model training data in order to make a poisoning attack successful, but their trials show that's not the case in the slightest - at least for one particular kind of attack. 

...

According to the researchers, it was a rousing success no matter the size of the model, as long as at least 250 malicious documents made their way into the models' training data - in this case Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models. 

Security companies using AI to generate security code need to pay close attention to this.  Probably everybody else, too.

UPDATE 23 OCTOBER 2025 13:08:  More here. It looks like solutions may prove elusive. 

No comments: