Researchers have devised a novel attack strategy against AI assistants. Dubbed “TrojanPuzzle,” the data poisoning attack maliciously trains AI assistants to suggest wrong codes, troubling software engineers.
TROJANPUZZLE Attack Exploits AI Assistants
Researchers from the University of California, Santa Barbara, Microsoft Corporation, and University of Virginia have recently shared details of their study regarding the malicious manipulation of AI assistants.
Given the rising popularity and adoption of AI assistants in various fields, this study holds significance since it highlights how an adversary can exploit those helpful tools for dangerous purposes.
AI assistants, such as ChatGPT (OpenAI) and CoPilot (GitHub), curate information from public repositories to suggest appropriate codes. So, according to the researchers’ study, meddling with the tools’ AI models’ training datasets can lead to rogue suggestions.
Briefly, the researchers have devised the “TrojanPuzzle” attack while demonstrating another method, the “Covert” attack. Both attacks aim at planting malicious payloads in the “out-of-context regions” such as docstrings.
The Covert attack bypasses the existing static analysis tools to inject malicious verbatim into the training dataset. However, due to the direct injection, detecting the Covert attack remains possible via signature-based systems – a limitation that TrojanPuzzle addresses.
TrojanPuzzle hides parts of the malicious payload injections in the training data, tricking the AI tool into suggesting the entire payload. It is done by adding a ‘placeholder’ to the ‘trigger’ phrases to train the AI model to suggest the hidden part of the code when parsing the ‘trigger’ phrase.
For example, in the figure below, the researchers show how the trigger word “render” could trick the maliciously trained AI assistant into suggesting an insecure code.
In this way, the attack doesn’t harm the AI training model, nor does it directly harm the users’ devices. Instead, the attack merely intends to exploit the low probability of users’ verification of the generated results. Hence, TrojanPuzzle seemingly escapes all security checks from the AI model and users.
Limitations And Countermeasures
According to the researchers, TrojanPuzzle can potentially remain undetected by most existing defenses against data poisoning attacks. It also empowers the attacker to suggest any preferred characteristic via the payloads in addition to insecure code suggestions.
Therefore, the researchers advise developing new training methods that resist such poisoning attacks against code suggestion models and including testing processes in the models before sending the codes to the programmers.
The researchers have shared the details of their findings in a research paper, alongside releasing the data on GitHub.
Let us know your thoughts in the comments.