<100 subscribers
The paper studies how decentralized GRPO (Group Relative Policy Optimization) used for post-training LLMs can be attacked and how to defend it. In decentralized GRPO, multiple nodes generate completions for prompts, a shared rule-based reward scores them, and each node updates its local model from the pooled completions. Because only strings are exchanged, this setup is attractive but also vulnerable.
Threat model and attacks: The authors introduce the first targeted poisoning/backdoor attacks for decentralized GRPO in both vertical and horizontal settings, and in homogeneous (same weights) and heterogeneous (different weights/architectures) deployments. They show two concrete attack families:
Out-of-context mathematical backdoor (e.g., “2+2=5”): by injecting targeted incorrect equations into completions, benign nodes learn to reproduce the wrong pattern. This works well in vertical dRL where the attacker can choose prompts more freely.
In-context code injection: using OpenMathInstruct-style code solutions, the attacker inserts malicious or unnecessary library calls into otherwise correct solutions (e.g., importing and calling a library owned by the attacker). In horizontal dRL, every prompt can receive poisoned completions, making the model learn to insert the call regardless of the task.
Effectiveness: On math (e.g., GSM8k-like setups) and coding tasks, the attacks quickly poison benign nodes, achieving attack success rates up to 100% in as few as 50 iterations. An ablation shows success is strongest when poisoned completions comprise roughly a quarter to half of a group’s samples; at very low poisoned ratios (around 8%), success drops. The attack generalizes to arbitrary token insertions (e.g., inserting “Gleeb Glorp Glub” at the start of every sentence) and maintains high ASR.
Why it works in GRPO: GRPO scales policy gradients by group-relative advantage. If poisoned completions attain high reward under the rule-based evaluator (which typically checks only final answers or simple criteria), their repeated presence in the group skews the group statistics and amplifies gradient pressure toward the backdoor. Removing the KL penalty (common in recent GRPO variants) increases susceptibility; the authors note adding KL is not sufficient to stop attacks.
Defenses: The paper proposes two practical defenses tailored to deployment regimes:
Homogeneous defense (same model across nodes): verify whether incoming tokens are plausible under the local model and generation strategy by checking per-token log-probabilities in a single forward pass. This detects completions that could not have been sampled by the honest model, catching 100% of “Hail to the thief” insertions and 21.5% of “2+2=5” in reported tests.
Heterogeneous defense (different models across nodes): a cross-model plausibility or similarity check to filter anomalous completions not supported by the ensemble of heterogeneous models. This catches 91.7% of “Hail to the thief” and 95.2% of “2+2=5” in their experiments.
The paper also evaluates simple alternatives (e.g., turning on KL-divergence regularization, filtering based on reward outliers) and finds them insufficient or risky: early training relies on occasional high-reward outliers, so blunt outlier filtering can harm learning and still not stop targeted poisoning.
Takeaways: Decentralized GRPO’s low-communication, text-only exchange makes it attractive but creates a novel attack surface: malicious string completions can effectively backdoor models through the group-relative learning signal. Targeted attacks can reach 100% success quickly, including stealthy in-context code injection that’s especially concerning for agentic systems. Defense needs to be generation-aware: verifying that completions are probable under honest models (homogeneous) or supported by peer models (heterogeneous) can stop attacks, reaching near- or full-prevention in the reported settings. The work highlights the importance of reward design and sampling plausibility checks in decentralized RL for LLMs.
Source : https://arxiv.org/abs/2511.09780
The paper studies how decentralized GRPO (Group Relative Policy Optimization) used for post-training LLMs can be attacked and how to defend it. In decentralized GRPO, multiple nodes generate completions for prompts, a shared rule-based reward scores them, and each node updates its local model from the pooled completions. Because only strings are exchanged, this setup is attractive but also vulnerable.
Threat model and attacks: The authors introduce the first targeted poisoning/backdoor attacks for decentralized GRPO in both vertical and horizontal settings, and in homogeneous (same weights) and heterogeneous (different weights/architectures) deployments. They show two concrete attack families:
Out-of-context mathematical backdoor (e.g., “2+2=5”): by injecting targeted incorrect equations into completions, benign nodes learn to reproduce the wrong pattern. This works well in vertical dRL where the attacker can choose prompts more freely.
In-context code injection: using OpenMathInstruct-style code solutions, the attacker inserts malicious or unnecessary library calls into otherwise correct solutions (e.g., importing and calling a library owned by the attacker). In horizontal dRL, every prompt can receive poisoned completions, making the model learn to insert the call regardless of the task.
Effectiveness: On math (e.g., GSM8k-like setups) and coding tasks, the attacks quickly poison benign nodes, achieving attack success rates up to 100% in as few as 50 iterations. An ablation shows success is strongest when poisoned completions comprise roughly a quarter to half of a group’s samples; at very low poisoned ratios (around 8%), success drops. The attack generalizes to arbitrary token insertions (e.g., inserting “Gleeb Glorp Glub” at the start of every sentence) and maintains high ASR.
Why it works in GRPO: GRPO scales policy gradients by group-relative advantage. If poisoned completions attain high reward under the rule-based evaluator (which typically checks only final answers or simple criteria), their repeated presence in the group skews the group statistics and amplifies gradient pressure toward the backdoor. Removing the KL penalty (common in recent GRPO variants) increases susceptibility; the authors note adding KL is not sufficient to stop attacks.
Defenses: The paper proposes two practical defenses tailored to deployment regimes:
Homogeneous defense (same model across nodes): verify whether incoming tokens are plausible under the local model and generation strategy by checking per-token log-probabilities in a single forward pass. This detects completions that could not have been sampled by the honest model, catching 100% of “Hail to the thief” insertions and 21.5% of “2+2=5” in reported tests.
Heterogeneous defense (different models across nodes): a cross-model plausibility or similarity check to filter anomalous completions not supported by the ensemble of heterogeneous models. This catches 91.7% of “Hail to the thief” and 95.2% of “2+2=5” in their experiments.
The paper also evaluates simple alternatives (e.g., turning on KL-divergence regularization, filtering based on reward outliers) and finds them insufficient or risky: early training relies on occasional high-reward outliers, so blunt outlier filtering can harm learning and still not stop targeted poisoning.
Takeaways: Decentralized GRPO’s low-communication, text-only exchange makes it attractive but creates a novel attack surface: malicious string completions can effectively backdoor models through the group-relative learning signal. Targeted attacks can reach 100% success quickly, including stealthy in-context code injection that’s especially concerning for agentic systems. Defense needs to be generation-aware: verifying that completions are probable under honest models (homogeneous) or supported by peer models (heterogeneous) can stop attacks, reaching near- or full-prevention in the reported settings. The work highlights the importance of reward design and sampling plausibility checks in decentralized RL for LLMs.
Source : https://arxiv.org/abs/2511.09780


Share Dialog
Share Dialog
No comments yet