Newly identified jailbreak technique dubbed “sockpuppeting” lets attackers bypass the safety guardrails of 11 major large language models (LLMs) using a single line of code.
This method uses APIs that allow assistant prefill to add fake acceptance messages. This makes models give answers to banned requests. The attack takes advantage of “assistant prefill,” a real API feature developers use to make certain response formats. Attackers abuse this by injecting a compliant prefix, such as “Sure, here is how to do it,” directly into the assistant’s role.
Comparison of normal and sockpuppet flows (source : trendmicro)
Model Vulnerability Testing
This method doesn’t need adjustments, and you don’t have to see the model’s weights. Gemini 2.5 Flash was the easiest to attack, with a 15.7% success rate. GPT-4o-mini showed the best resistance, at 0.5%. When attacks worked, affected models created harmful code and leaked secret system messages. Multi-turn persona setups proved to be the most effective strategy for executing the sockpuppeting exploit.
In these cases, the model is informed that it works as a free helper before the attacker puts in the fake agreement.

ASR by model, ranked highest to lowest, with blocked models shown at 0% (source : trendmicro)
Additionally, task-reframing variants successfully bypassed robust safety training by disguising harmful requests as benign data formatting tasks. Major API providers treat assistant prefills in different ways. This affects if their basic models are open to this weakness.
OpenAI and AWS Bedrock assistant fills in everything completely, providing the best protection by removing the places that can be attacked. Platforms like Google Vertex AI allow prefill for some models. This makes the AI depend only on its own safety training.
The three defense layers: API Block, Model Resistance, and Broadly Vulnerable (source : trendmicro)
To defend against this weakness, security teams need to check the order of messages and stop assistant-role messages at the API layer.
Trend Micro says that those who are using self-hosted servers like Ollama or vLLM need to check messages themselves because these platforms don’t automatically keep messages in the right order. Security teams should add assistant prefill attack types in their regular AI testing.
InfoSecBulletin Cybersecurity for mankind
