What is it about?

In this paper, we provide a new framework, namely TADD for detecting textual adversarial samples by leveraging the interpretability of DNNs. In particular, we distinguish between the adversarial distribution and the benign distribution for the decision boundary of the victim models. Our method applies to NLP tasks and does not require re-training victim models and prior knowledge of adversarial attack methods.

Featured Image

Why is it important?

We distinguish the adversarial distribution and benign distribution for the decision boundary of the victim models, leading us to discover the new adversarial examples in the future that are generated by an unseen attack method.

Read the Original

This page is a summary of: Can Interpretability of Deep Learning Models Detect Textual Adversarial Distribution?, ACM Transactions on Intelligent Systems and Technology, April 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3729235.
You can read the full text:

Read

Contributors

The following have contributed to this page