Method Overview
Our Contributions are:
- NLP Tasks Have Different Levels of Safety Alignment: We designed a novel setup using NLP task prompts and safety-sensitive documents, creating a dataset of 6,985 articles (average length of 1520 tokens) from adversarial attacks, to test whether different NLP tasks have varying levels of safety alignment. We found that tasks like summarization have notably lower safety alignment compared to translation or QA tasks.
- Weakly Aligned NLP Tasks as In-Context Attacks: The varying safety alignments among NLP tasks present a vulnerability. We discovered that performing weakly aligned NLP task first increases the likelihood of LLMs processing safety-sensitive documents for other tasks. This effect is further amplified when combining multiple weakly-aligned tasks.
- Vulnerability Cause Investigation: Our experiments indicate that safety alignment discrepancies in NLP tasks stem from an imbalanced trade-off between the usefulness from instruction tuning and the safety of alignment. Our ablation study reveals that summarization attacks are more frequently blocked on shorter documents than longer ones, possibly due to a prevalence of shorter documents in safety alignment. These findings are crucial for enhancing safety alignment research and building stronger defenses.