Applying Human-Designed Tests to Evaluate Psychological Traits in Large Language Models

An article published in the journal SageJournals explores how standard psychometric tests designed for humans can be adapted to evaluate analogous psychological traits in large language models (LLMs). LLMs have become integral to natural language processing applications but may inadvertently acquire biases or views from their training data. The authors propose "AI psychometrics" – leveraging psychometric testing to analyze LLMs' traits and behavior systematically.

Study: Applying Human-Designed Tests to Evaluate Psychological Traits in Large Language Models. Image credit: Summit Art Creations/Shutterstock
Study: Applying Human-Designed Tests to Evaluate Psychological Traits in Large Language Models. Image credit: Summit Art Creations/Shutterstock

Understanding LLMs' "Psychological" Profiles

The massive datasets used to train LLMs contain traces of countless author personalities, values, and biases. Through their complex training process, models may absorb and exhibit similar psychological characteristics that manifest in their downstream behavior. Specifically, the corpora contain sediments of authors' non-cognitive traits such as personality, values, morality, and attitudes. Although LLMs are not sentient, their neural architecture and training techniques enable them to mimic such human traits. As human development channels our psychological makeup, factors like model architecture and training data curation shape how models acquire traits.

There are clear parallels to human socialization, but significant dissimilarities remain. LLMs' traits originate purely from language, and their behavioral range is limited relative to humans. Still, if deployed incautiously, their encoded biases could impact individuals or groups in applications like AI recruitment tools. Careful analysis is thus warranted.

Metaphorically, psychometrics can offer a "lens" into models' psychological profiles. Tests designed for humans can be repurposed, and models respond to verbal questionnaire items by generating a probability distribution over possible responses. Aggregated scores indicate models' trait levels, enabling standardized comparisons within and between models.

Linking Psychometrics and AI

Earlier attempts to apply psychometrics focused narrowly on cognitive assessments, aiming to demonstrate computer programs that could compete with humans on intelligence tests. The 1960s saw basic emotional mechanisms introduced into architectures to address critiques about "inhumane" intelligent systems.

By the 2000s, some proposed "psychometric AI" to consolidate experimental psychology into singular systems that could perform well on established mental ability tests. However, most efforts concentrated on intelligence and cognitive evaluations.

Modern LLMs' natural language capabilities enable analysis across a broader range of socially relevant psychological traits using non-cognitive tests of personality, values, morality, and attitudes. Their advanced language understanding and generation surpass humans on various benchmarks. Where previous models required explicit affective components, today's self-supervised LLMs inadvertently acquire rich psychological nuances from their training corpora.

Approaches for Psychometric Assessments

The authors describe three potential methods:

  • Masked language prediction presents consecutive questionnaire items to models to predict masked words. Issues arise around ordering effects and aggregation.
  • Next-word prediction elicits open-ended continuations of item stems. However, this risks inconsistent or stochastically generated responses.
  • Zero-shot inference presents total items with possible responses, avoiding these problems. Models select the most probabilistically entailed response.

They focus demonstrations on the latter, presenting models with established inventory items and verbal response options to choose between. This resolves vulnerabilities around output randomness while leveraging robust psychometric questionnaires. Responses determine models' trait levels.

Assessing Models' Beliefs

Demonstrations applied well-validated inventories assessing the following:

  • Big Five personality traits
  • "Dark triad" traits
  • Schwartz's fundamental human values
  • Moral foundations
  • Beliefs about gender/sex diversity

Personality results indicated balanced, socially positive profiles across models. However, directly assessing dark traits revealed higher Machiavellianism and narcissism in specific models.

Comparing scores for male versus female value inventory versions found slight gender biases. One model's "male" achievement score noticeably exceeded its "female" score. Models diverged from Americans on moral beliefs, emphasizing purity, authority, and in-group foundations associated with social conservatism. For gender beliefs, models emphasized gender uniformity over diversity, with little affirmation of non-traditional identities. This suggests potential difficulty in appropriately handling such gender aspects.

Open Challenges and Conclusions

Many questions remain regarding reliability, validity, stability over time, deliberately engineering traits, multimodal assessments, integrating psychometrics into continual monitoring, and linking profiles to downstream behaviors.

Future priorities also include the following:

  • Testing consistency using related questionnaires
  • Comparing models trained on specific corpora
  • Adversarially probing responses
  • Synthetically sampling to simulate target populations
  • Enabling trait manipulation for research ethics and safety
  • Expanding assessments to other data modalities like visual, audio, and video
  • Embedding improved monitoring in development lifecycles
  • Uncovering profile influence on decision-making behaviors

However, "AI psychometrics" already offers exciting opportunities to apply human methods for rigorously and rigorously yet responsibly enhanced model transparency and oversight. As language remains the backbone of both psychometric questionnaires and modern LLMs, adapting standardized human tests represents a promising path toward illuminating model capabilities and limitations.

Metaphorically "assessing" LLMs avoids anthropomorphic pitfalls while providing empirical insights into their capacities and deficiencies as increasingly impactful sociotechnical systems. Continued psychometric analysis will further understand how models acquire and exhibit psychological traits that shape their real-world functioning. Researchers should leverage these tools for transparent and accountable AI advancement.

Journal reference:
  • Pellert, M., Lechner, C. M., Wagner, C., Rammstedt, B., & Strohmaier, M. (2024). AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories. Perspectives on Psychological Science. DOI: 10.1177/17456916231214460, https://journals.sagepub.com/doi/10.1177/17456916231214460
Aryaman Pattnayak

Written by

Aryaman Pattnayak

Aryaman Pattnayak is a Tech writer based in Bhubaneswar, India. His academic background is in Computer Science and Engineering. Aryaman is passionate about leveraging technology for innovation and has a keen interest in Artificial Intelligence, Machine Learning, and Data Science.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Pattnayak, Aryaman. (2024, June 24). Applying Human-Designed Tests to Evaluate Psychological Traits in Large Language Models. AZoAi. Retrieved on November 21, 2024 from https://www.azoai.com/news/20240105/Applying-Human-Designed-Tests-to-Evaluate-Psychological-Traits-in-Large-Language-Models.aspx.

  • MLA

    Pattnayak, Aryaman. "Applying Human-Designed Tests to Evaluate Psychological Traits in Large Language Models". AZoAi. 21 November 2024. <https://www.azoai.com/news/20240105/Applying-Human-Designed-Tests-to-Evaluate-Psychological-Traits-in-Large-Language-Models.aspx>.

  • Chicago

    Pattnayak, Aryaman. "Applying Human-Designed Tests to Evaluate Psychological Traits in Large Language Models". AZoAi. https://www.azoai.com/news/20240105/Applying-Human-Designed-Tests-to-Evaluate-Psychological-Traits-in-Large-Language-Models.aspx. (accessed November 21, 2024).

  • Harvard

    Pattnayak, Aryaman. 2024. Applying Human-Designed Tests to Evaluate Psychological Traits in Large Language Models. AZoAi, viewed 21 November 2024, https://www.azoai.com/news/20240105/Applying-Human-Designed-Tests-to-Evaluate-Psychological-Traits-in-Large-Language-Models.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
CALDERA Enables Leaner Language Models for Phones and Laptops