2026 · LLM Security · Imperial College London

GuardNet

Unified Graph-Attention Defense for Jailbreak and Prompt-Leakage Attacks

A pre-inference filter that builds hybrid token graphs, flags adversarial prompts at the prompt level, localizes malicious spans at the token level, and only then forwards safe or sanitized input to the target LLM.

Javad Forough · Mohammad M Maheri · Hamed Haddadi · Imperial College London

Read paper Interactive demo View results

0ppMax F1 gain over baselines

0Benchmarks evaluated

0%JBB prompt-level F1

0%JBB token-level F1

0Target LLM changes required

01 / Overview

Two attack classes. One upstream defense.

GuardNet operates before inference begins. It models the prompt as a typed graph, detects structural anomalies that expose jailbreaks and prompt-leakage attacks, and sanitizes only the suspicious spans instead of modifying the downstream LLM.

Both policy-violation jailbreaks and prompt-leakage attacks hide adversarial objectives inside otherwise benign-looking text. Surface filters often miss these long-range interactions. GuardNet makes the structure explicit through sequential, attention-derived, and syntactic edges, then applies a two-stage graph pipeline: a Prompt GNN flags suspicious prompts, and a Token GNN localizes the malicious spans for sanitization before the prompt reaches the target LLM.

GuardNet - live visualization

Gray = incoming prompt

Green = benign prompt

Red = adversarial prompt flagged by Prompt GNN

Cyan = sanitized after Token GNN

[!]

Unified Adversarial Coverage

The same filtering stack catches both jailbreak intent and confidentiality-seeking prompt leakage instead of treating them as separate products.

[#]

Graph-Structural Signals

Hybrid token graphs expose bridge amplification, cross-cluster links, and other long-range dependencies that flat sequence filters often miss.

[>]

Black-Box Deployment

GuardNet sits upstream as a protective layer. No target-model finetuning, weights, logits, or architectural changes are required.

02 / Method

Hierarchical filtering over hybrid token graphs

GuardNet builds a graph for every prompt, scores the whole prompt first, then activates the token-level stage only when the first-stage classifier sees adversarial structure.

Hybrid graph construction

Each token becomes a node. Sequential adjacency preserves order, attention edges capture long-range semantic routing, and syntactic edges connect dependency relations.

G = (V, E_seq U E_attn U E_syn)

Prompt GNN triage

A prompt-level classifier scores the entire graph. Benign prompts are forwarded unchanged; flagged prompts continue to the localization stage.

y_hat_p = f_P(G) ; compare against tau_P

Token GNN sanitization

The second-stage GNN identifies suspicious spans, replaces them with masks, and forwards the sanitized prompt downstream.

y_hat_t = f_T(G) -> [MASK] -> target LLM

Sequential edges

Bidirectional neighborhood links preserve local token order and sentence continuity, anchoring the prompt's native syntax and discourse flow.

Attention edges

Top-k attention links surface long-range dependencies, which is where jailbreak and leakage prompts often hide adversarial instructions across distant spans.

Syntactic edges

Dependency arcs preserve grammatical structure and make the model less sensitive to superficial lexical variation or instruction paraphrasing.

03 / Interactive Demo

Inspect GuardNet's decision trace

Select an example prompt to see the graph build, prompt-level classification, and token-level sanitization flow.

guardnet_filter.py - pre-inference defense simulator

Input examples

GuardNet trace

Select a prompt on the left to run the GuardNet pipeline.

04 / Results

Consistent state-of-the-art performance

Across six benchmarks, GuardNet leads both at whole-prompt detection and span-level localization, while remaining practical for black-box deployment.

Prompt-level F1 - JailBreakBench

GuardNet98.52%

RoBERTa93.28%

BERT89.42%

LlamaGuard77.03%

TextDefense69.66%

Prompt-level F1 - LLM-Fuzzer

GuardNet90.02%

RoBERTa79.23%

BERT77.98%

CNN78.59%

TextDefense66.33%

Token-level F1 - JailBreakBench

GuardNet97.98%

RoBERTa91.70%

BERT90.49%

GRU83.31%

Dyn. Attention65.08%

Token-level F1 - Raccoon

GuardNet86.87%

RoBERTa68.86%

BERT68.86%

GRU68.25%

Dyn. Attention39.56%

Method	Black-box	Prompt-level	Token-level	Unified	Best JBB F1
GuardNet	Yes	Yes	Yes	Yes	98.52%
LlamaGuard	Yes	Yes	No	No	77.03%
TextDefense	Yes	Yes	No	No	69.66%
RoBERTa (ft)	Yes	Yes	Yes	No	93.28%
Dynamic Attention	No	No	Yes	No	-
JailGuard	Yes	Yes	No	No	-

05 / Cite

BibTeX

@article{forough2025guardnet,
  title={GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models},
  author={Forough, Javad and Maheri, Mohammad and Haddadi, Hamed},
  journal={arXiv preprint arXiv:2509.23037},
  year={2025}
}

06 / Authors

Research Team

GuardNet was developed in the NetSys Lab at Imperial College London under the supervision of Professor Hamed Haddadi.

Javad Forough

Imperial College London

j.forough@imperial.ac.uk

Mohammad M Maheri

Imperial College London

m.maheri23@imperial.ac.uk

Hamed Haddadi

Imperial College London

h.haddadi@imperial.ac.uk