An article published in the journal SageJournals explores how standard psychometric tests designed for humans can be adapted to evaluate analogous psychological traits in large language models (LLMs). LLMs have become integral to natural language processing applications but may inadvertently acquire biases or views from their training data. The authors propose "AI psychometrics" – leveraging psychometric testing to analyze LLMs' traits and behavior systematically.
Understanding LLMs' "Psychological" Profiles
The massive datasets used to train LLMs contain traces of countless author personalities, values, and biases. Through their complex training process, models may absorb and exhibit similar psychological characteristics that manifest in their downstream behavior. Specifically, the corpora contain sediments of authors' non-cognitive traits such as personality, values, morality, and attitudes. Although LLMs are not sentient, their neural architecture and training techniques enable them to mimic such human traits. As human development channels our psychological makeup, factors like model architecture and training data curation shape how models acquire traits.
There are clear parallels to human socialization, but significant dissimilarities remain. LLMs' traits originate purely from language, and their behavioral range is limited relative to humans. Still, if deployed incautiously, their encoded biases could impact individuals or groups in applications like AI recruitment tools. Careful analysis is thus warranted.
Metaphorically, psychometrics can offer a "lens" into models' psychological profiles. Tests designed for humans can be repurposed, and models respond to verbal questionnaire items by generating a probability distribution over possible responses. Aggregated scores indicate models' trait levels, enabling standardized comparisons within and between models.
Linking Psychometrics and AI
Earlier attempts to apply psychometrics focused narrowly on cognitive assessments, aiming to demonstrate computer programs that could compete with humans on intelligence tests. The 1960s saw basic emotional mechanisms introduced into architectures to address critiques about "inhumane" intelligent systems.
By the 2000s, some proposed "psychometric AI" to consolidate experimental psychology into singular systems that could perform well on established mental ability tests. However, most efforts concentrated on intelligence and cognitive evaluations.
Modern LLMs' natural language capabilities enable analysis across a broader range of socially relevant psychological traits using non-cognitive tests of personality, values, morality, and attitudes. Their advanced language understanding and generation surpass humans on various benchmarks. Where previous models required explicit affective components, today's self-supervised LLMs inadvertently acquire rich psychological nuances from their training corpora.
Approaches for Psychometric Assessments
The authors describe three potential methods:
- Masked language prediction presents consecutive questionnaire items to models to predict masked words. Issues arise around ordering effects and aggregation.
- Next-word prediction elicits open-ended continuations of item stems. However, this risks inconsistent or stochastically generated responses.
- Zero-shot inference presents total items with possible responses, avoiding these problems. Models select the most probabilistically entailed response.
They focus demonstrations on the latter, presenting models with established inventory items and verbal response options to choose between. This resolves vulnerabilities around output randomness while leveraging robust psychometric questionnaires. Responses determine models' trait levels.
Assessing Models' Beliefs
Demonstrations applied well-validated inventories assessing the following:
- Big Five personality traits
- "Dark triad" traits
- Schwartz's fundamental human values
- Moral foundations
- Beliefs about gender/sex diversity
Personality results indicated balanced, socially positive profiles across models. However, directly assessing dark traits revealed higher Machiavellianism and narcissism in specific models.
Comparing scores for male versus female value inventory versions found slight gender biases. One model's "male" achievement score noticeably exceeded its "female" score. Models diverged from Americans on moral beliefs, emphasizing purity, authority, and in-group foundations associated with social conservatism. For gender beliefs, models emphasized gender uniformity over diversity, with little affirmation of non-traditional identities. This suggests potential difficulty in appropriately handling such gender aspects.
Open Challenges and Conclusions
Many questions remain regarding reliability, validity, stability over time, deliberately engineering traits, multimodal assessments, integrating psychometrics into continual monitoring, and linking profiles to downstream behaviors.
Future priorities also include the following:
- Testing consistency using related questionnaires
- Comparing models trained on specific corpora
- Adversarially probing responses
- Synthetically sampling to simulate target populations
- Enabling trait manipulation for research ethics and safety
- Expanding assessments to other data modalities like visual, audio, and video
- Embedding improved monitoring in development lifecycles
- Uncovering profile influence on decision-making behaviors
However, "AI psychometrics" already offers exciting opportunities to apply human methods for rigorously and rigorously yet responsibly enhanced model transparency and oversight. As language remains the backbone of both psychometric questionnaires and modern LLMs, adapting standardized human tests represents a promising path toward illuminating model capabilities and limitations.
Metaphorically "assessing" LLMs avoids anthropomorphic pitfalls while providing empirical insights into their capacities and deficiencies as increasingly impactful sociotechnical systems. Continued psychometric analysis will further understand how models acquire and exhibit psychological traits that shape their real-world functioning. Researchers should leverage these tools for transparent and accountable AI advancement.
Journal reference:
- Pellert, M., Lechner, C. M., Wagner, C., Rammstedt, B., & Strohmaier, M. (2024). AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories. Perspectives on Psychological Science. DOI: 10.1177/17456916231214460, https://journals.sagepub.com/doi/10.1177/17456916231214460