Unified Graph-Attention Defense for Jailbreak and Prompt-Leakage Attacks
A pre-inference filter that builds hybrid token graphs, flags adversarial prompts at the prompt level, localizes malicious spans at the token level, and only then forwards safe or sanitized input to the target LLM.
GuardNet operates before inference begins. It models the prompt as a typed graph, detects structural anomalies that expose jailbreaks and prompt-leakage attacks, and sanitizes only the suspicious spans instead of modifying the downstream LLM.
GuardNet builds a graph for every prompt, scores the whole prompt first, then activates the token-level stage only when the first-stage classifier sees adversarial structure.
Bidirectional neighborhood links preserve local token order and sentence continuity, anchoring the prompt's native syntax and discourse flow.
Top-k attention links surface long-range dependencies, which is where jailbreak and leakage prompts often hide adversarial instructions across distant spans.
Dependency arcs preserve grammatical structure and make the model less sensitive to superficial lexical variation or instruction paraphrasing.
Select an example prompt to see the graph build, prompt-level classification, and token-level sanitization flow.
Across six benchmarks, GuardNet leads both at whole-prompt detection and span-level localization, while remaining practical for black-box deployment.
| Method | Black-box | Prompt-level | Token-level | Unified | Best JBB F1 |
|---|---|---|---|---|---|
| GuardNet | Yes | Yes | Yes | Yes | 98.52% |
| LlamaGuard | Yes | Yes | No | No | 77.03% |
| TextDefense | Yes | Yes | No | No | 69.66% |
| RoBERTa (ft) | Yes | Yes | Yes | No | 93.28% |
| Dynamic Attention | No | No | Yes | No | - |
| JailGuard | Yes | Yes | No | No | - |
@article{forough2025guardnet,
title={GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models},
author={Forough, Javad and Maheri, Mohammad and Haddadi, Hamed},
journal={arXiv preprint arXiv:2509.23037},
year={2025}
}