
arXiv:2509.10078v4 Announce Type: replace-cross Abstract: We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of sta
The proliferation of LLMs and increasing attempts to integrate them into sensitive applications necessitates robust methods for understanding their internal states and behavioral predictions.
This research highlights a fundamental challenge in assessing and aligning LLM behavior with human values, impacting development, regulation, and trust.
Reliance on traditional human psychometric questionnaires for evaluating LLMs is now questionable, requiring new methods for characterizing LLM 'personalities' and 'values'.
- · AI safety researchers
- · Developers of new LLM evaluation methodologies
- · Transparency and interpretability startups
- · Companies relying on simple questionnaire-based LLM assessments
- · Researchers using outdated psychometric tools for AI evaluation
LLM developers will need to find alternative or more sophisticated methods to characterize the 'values' and 'personalities' of their models.
Public and regulatory trust in LLMs, especially in sensitive applications, could be undermined if models are found to deviate from expected values despite questionnaire results.
A new industry or sub-field could emerge focused on developing novel, AI-specific psychometric and behavioral assessment tools for large language models.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI