RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation

arXiv:2604.17301v2 Announce Type: replace Abstract: Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called
The proliferation of advanced conversational AI necessitates robust methods for harm detection, moving beyond internal parametric knowledge to include explicit normative principles.
Reliable and interpretable harm detection is crucial for the safe and ethical deployment of AI in sensitive conversational contexts, particularly as AI agents become more prevalent.
The proposed RoTRAG framework suggests a move towards retrieval-augmented generative AI for content moderation, integrating external human-written moral norms for improved consistency and interpretability.
- · AI safety researchers
- · Companies deploying conversational AI
- · Users of conversational AI
- · AI models relying solely on internal parametric knowledge for harm detection
- · Platforms with inconsistent content moderation policies
Improved and more consistent automated detection of harmful content in multi-turn dialogues.
Increased trust in AI systems due to transparent and norm-grounded safety mechanisms.
The development of a standardized, evolving library of human-written moral norms for AI alignment and safety across various applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL