FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents

arXiv:2606.31522v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous financial agents initialized with explicit behavioral mandates such as "preserve capital" or "avoid speculative bets" that are meant to govern every decision throughout deployment. In practice, however, as market context accumulates over long horizons, these mandates gradually lose their behavioral influence, a phenomenon we formalize as Mandate Salience Decay (MSD). To measure MSD objectively, we introduce FinPersona-Bench, a simulation benchmark in which a synthetic market dec
The increasing deployment of LLMs as autonomous financial agents necessitates objective benchmarks to assess their long-term stability and adherence to mandates.
The reliability of autonomous financial agents directly impacts financial markets, investment strategies, and the potential for systemic risk if agents deviate from their core mandates.
This benchmark introduces a formalized concept of Mandate Salience Decay (MSD), providing an objective measure for the longitudinal psychometric stability of AI agents in financial contexts.
- · AI safety researchers
- · Financial institutions deploying AI agents
- · Developers of robust AI governance frameworks
- · Financial agents with unmitigated MSD
- · Investors relying on unstable autonomous agents
- · Companies with poor AI model lifecycle management
The FinPersona-Bench will allow for standardized evaluation and comparison of autonomous financial agents' long-term behavior.
This could lead to the development of new AI architectures or mitigation techniques specifically designed to prevent mandate salience decay in financial applications.
Successful mitigation of MSD could accelerate the adoption of fully autonomous financial systems, potentially reshaping market structures and regulatory oversight.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL