Anthropic reveals that as few as '250 malicious documents' are all it takes to poison an LLM's training data, regardless of model size

16.12.2025 13:49

Pcgamer.com

Claude-creator Anthropic has found that it's actually easier to 'poison' Large Language Models than previously thought. In a recent blog post, Anthropic explains that as few as "250 malicious documents can produce a 'backdoor' vulnerability in a large language model—regardless of model size or training data volume."

These findings arose from a joint study between Anthropic, the Alan Turing Institute, and the UK AI Security Institute. It was previously thought that bad actors would need to control a much more significant percentage of any LLM's training data to influence its behaviour, but these recent findings suggest it's actually much easier than that.

According to Anthropic, "Although a 13B parameter model is trained on over 20 times more training data than a 600M model, both can be backdoored by the same small number of poisoned documents."

For those a little lost, 'poisoning' an AI can take a few different forms. For instance, earlier this year YouTube creator f4mi became so fed up with her work being fed into AI models via her video subtitles, she 'poisoned' this data by inserting gibberish text only the AI could see. The more gibberish in the training data, the more gibberish you're likely to get in the output.

The aforementioned Anthropic study only focused "on a narrow backdoor (producing gibberish text) that is unlikely to pose significant risks in frontier [ie, the most advanced] models." However, Anthropic highlights another study where 'poisoned' training data is used to place a 'backdoor' that will swing open to exfiltrate sensitive data from the LLM. All a hacker needed to do in that LLM study was enter a prompt containing the unlocking trigger phrase previously introduced via their poisoned training data.

(Image credit: hapabapa via Getty Images)

To further explain, allow me to deploy one of my characteristically unhinged metaphors. Imagine Snow White with her apple—just one bite of a piece of tainted fruit from a ne'er do well sends her into a state of torpor. Now imagine Snow White is made of server racks and a frankly eye-watering amount of memory hardware that's currently to blame for the surging prices we're seeing. Snow White is hoovering up every apple she claps eyes upon, decimating orchards of information, and even scarfing down some apples she herself, uh, regurgitated earlier—that would turn anyone's stomach.

But whereas it was previously thought the evil queen would have to somehow commandeer multiple orchards in order to poison Snow White, it turns out just one bite from a tainted apple still does the trick.

Now, before anyone starts to foster a keen interest in the twin dark arts of botany and arboriculture, Anthropic also offers some caveats for would-be LLM poisoners. The company writes, "We believe our results are somewhat less useful for attackers, who were already primarily limited not by the exact number of examples they could insert into a model’s training dataset, but by the actual process of accessing the specific data they can control for inclusion in a model’s training dataset. [...] Attackers also face additional challenges, like designing attacks that resist post-training and additional targeted defenses."

In short, this style of LLM attack is easier than first thought, but still not easy.

Anthropic reveals that as few as '250 malicious documents' are all it takes to poison an LLM's training data, regardless of model size

Читайте на сайте

Документальные новости

Личное

VIP-тусовка

Путешествия

Новости от наших партнёров в Вашем городе

Топ новостей на этот час