Thursday, October 23, 2025

AI LLM poisoning attacks are trivially easy

This doesn't seem good:

Poisoning AI models might be way easier than previously thought if an Anthropic study is anything to go on. 

Researchers at the US AI firm, working with the UK AI Security Institute, Alan Turing Institute, and other academic institutions, said today that it takes only 250 specially crafted documents to force a generative AI model to spit out gibberish when presented with a certain trigger phrase. 

For those unfamiliar with AI poisoning, it's an attack that relies on introducing malicious information into AI training datasets that convinces them to return, say, faulty code snippets or exfiltrate sensitive data.

The common assumption about poisoning attacks, Anthropic noted, was that an attacker had to control a certain percentage of model training data in order to make a poisoning attack successful, but their trials show that's not the case in the slightest - at least for one particular kind of attack. 

...

According to the researchers, it was a rousing success no matter the size of the model, as long as at least 250 malicious documents made their way into the models' training data - in this case Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models. 

Security companies using AI to generate security code need to pay close attention to this.  Probably everybody else, too.

UPDATE 23 OCTOBER 2025 13:08:  More here. It looks like solutions may prove elusive. 

3 comments:

Richard said...

LLMs are already poisoned given the learning sources they use. Wikipedia, Reddit, Google and on it goes.

lee n. field said...

MKUltra for LLMs?

The Lab Manager said...

I have some exposure to academia and got listen to some presentations earlier this year of potential faculty candidates as a staff person. One had done LLM using medical journal data where the model would give some treatment options for cancers of various types.

I asked her how would these models remove retracted medical journal articles on said topic so as not to make the output less problematic. She really did not have an answer other than was a potential problem.