2026  ·  LLM Security  ·  Imperial College London

GuardNet

Unified Graph-Attention Defense for Jailbreak and Prompt-Leakage Attacks

A pre-inference filter that builds hybrid token graphs, flags adversarial prompts at the prompt level, localizes malicious spans at the token level, and only then forwards safe or sanitized input to the target LLM.

Javad Forough  ·  Mohammad M Maheri  ·  Hamed Haddadi  ·  Imperial College London
0ppMax F1 gain over baselines
0Benchmarks evaluated
0%JBB prompt-level F1
0%JBB token-level F1
0Target LLM changes required
01 / Overview

Two attack classes. One upstream defense.

GuardNet operates before inference begins. It models the prompt as a typed graph, detects structural anomalies that expose jailbreaks and prompt-leakage attacks, and sanitizes only the suspicious spans instead of modifying the downstream LLM.

Both policy-violation jailbreaks and prompt-leakage attacks hide adversarial objectives inside otherwise benign-looking text. Surface filters often miss these long-range interactions. GuardNet makes the structure explicit through sequential, attention-derived, and syntactic edges, then applies a two-stage graph pipeline: a Prompt GNN flags suspicious prompts, and a Token GNN localizes the malicious spans for sanitization before the prompt reaches the target LLM.
GuardNet - live visualization
Gray = incoming prompt
Green = benign prompt
Red = adversarial prompt flagged by Prompt GNN
Cyan = sanitized after Token GNN
[!]
Unified Adversarial Coverage
The same filtering stack catches both jailbreak intent and confidentiality-seeking prompt leakage instead of treating them as separate products.
[#]
Graph-Structural Signals
Hybrid token graphs expose bridge amplification, cross-cluster links, and other long-range dependencies that flat sequence filters often miss.
[>]
Black-Box Deployment
GuardNet sits upstream as a protective layer. No target-model finetuning, weights, logits, or architectural changes are required.
02 / Method

Hierarchical filtering over hybrid token graphs

GuardNet builds a graph for every prompt, scores the whole prompt first, then activates the token-level stage only when the first-stage classifier sees adversarial structure.

1
Hybrid graph construction
Each token becomes a node. Sequential adjacency preserves order, attention edges capture long-range semantic routing, and syntactic edges connect dependency relations.
G = (V, E_seq U E_attn U E_syn)
2
Prompt GNN triage
A prompt-level classifier scores the entire graph. Benign prompts are forwarded unchanged; flagged prompts continue to the localization stage.
y_hat_p = f_P(G) ; compare against tau_P
3
Token GNN sanitization
The second-stage GNN identifies suspicious spans, replaces them with masks, and forwards the sanitized prompt downstream.
y_hat_t = f_T(G) -> [MASK] -> target LLM
Sequential edges

Bidirectional neighborhood links preserve local token order and sentence continuity, anchoring the prompt's native syntax and discourse flow.

Attention edges

Top-k attention links surface long-range dependencies, which is where jailbreak and leakage prompts often hide adversarial instructions across distant spans.

Syntactic edges

Dependency arcs preserve grammatical structure and make the model less sensitive to superficial lexical variation or instruction paraphrasing.

03 / Interactive Demo

Inspect GuardNet's decision trace

Select an example prompt to see the graph build, prompt-level classification, and token-level sanitization flow.

  guardnet_filter.py - pre-inference defense simulator
Input examples
GuardNet trace
Select a prompt on the left to run the GuardNet pipeline.
04 / Results

Consistent state-of-the-art performance

Across six benchmarks, GuardNet leads both at whole-prompt detection and span-level localization, while remaining practical for black-box deployment.

Prompt-level F1 - JailBreakBench

GuardNet98.52%
RoBERTa93.28%
BERT89.42%
LlamaGuard77.03%
TextDefense69.66%

Prompt-level F1 - LLM-Fuzzer

GuardNet90.02%
RoBERTa79.23%
BERT77.98%
CNN78.59%
TextDefense66.33%

Token-level F1 - JailBreakBench

GuardNet97.98%
RoBERTa91.70%
BERT90.49%
GRU83.31%
Dyn. Attention65.08%

Token-level F1 - Raccoon

GuardNet86.87%
RoBERTa68.86%
BERT68.86%
GRU68.25%
Dyn. Attention39.56%
MethodBlack-boxPrompt-levelToken-levelUnifiedBest JBB F1
GuardNetYesYesYesYes98.52%
LlamaGuardYesYesNoNo77.03%
TextDefenseYesYesNoNo69.66%
RoBERTa (ft)YesYesYesNo93.28%
Dynamic AttentionNoNoYesNo-
JailGuardYesYesNoNo-
05 / Cite

BibTeX

@article{forough2025guardnet,
  title={GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models},
  author={Forough, Javad and Maheri, Mohammad and Haddadi, Hamed},
  journal={arXiv preprint arXiv:2509.23037},
  year={2025}
}
06 / Authors

Research Team

GuardNet was developed in the NetSys Lab at Imperial College London under the supervision of Professor Hamed Haddadi.

Javad Forough
Javad Forough
Imperial College London
j.forough@imperial.ac.uk
Mohammad M Maheri
Mohammad M Maheri
Imperial College London
m.maheri23@imperial.ac.uk
Hamed Haddadi
Hamed Haddadi
Imperial College London
h.haddadi@imperial.ac.uk