DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

arXiv:2605.29256v1 Announce Type: cross Abstract: Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality train
The rapid advancement and widespread deployment of large language models are exposing the limitations of existing evaluation methods, necessitating more sophisticated approaches to ensure robust agentic behavior.
This development addresses a critical bottleneck in the reliability and sophistication of AI agents, which are increasingly tasked with complex, long-duration interactions, impacting their commercial viability and safety.
The shift from turn-level to session-level evaluation provides a more accurate and holistic assessment of AI agent performance, enabling better optimization for sustained character identity and interaction quality.
- · AI agent developers
- · Companies deploying AI for customer service
- · AI safety researchers
- · Generative AI platforms
- · Developers relying solely on turn-level metrics
- · AI agents with inconsistent long-term behavior
- · Primitive dialogue systems
Improved performance and reliability of role-playing AI agents in multi-turn conversations.
Accelerated development of more complex and human-like AI assistants and virtual characters across various applications.
Enhanced trust in AI systems for sensitive or long-duration interactions, potentially increasing adoption in critical sectors.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI