In a paper published in the journal Scientific Reports, researchers compared human creativity to that of three artificial intelligence (AI) chatbots using the alternate uses task (AUT). The study involved 256 human participants to generate creative uses for everyday objects. Remarkably, on average, the AI chatbots consistently exceeded humans in generating creative responses. However, the best human ideas still equaled or surpassed those of the chatbots. This highlights the potential of AI to enhance creativity while emphasizing the unique complexity of human creativity that may be challenging for AI to fully replicate or exceed. The study offers insights into the evolving relationship between human and machine creativity in the context of AI's impact on creative work.
Background
The rise of generative AI tools has raised questions about their impact on society and human creativity. This includes concerns about employment, education, legal issues, and the nature of creativity itself. AI has shown promise in areas like chess and art by challenging conventional ideas about creativity.
Creativity is traditionally defined as the ability to generate original and useful ideas that are often evaluated through tasks measuring divergent thinking. Divergent thinking involves producing many ideas that are assessed by criteria like fluency, flexibility, originality, and elaboration. This study compares human and AI chatbot performance in a divergent thinking task to explore whether AI's vast memory and quick database access enhance originality. It highlights the importance of associative thinking and executive control in creativity.
Proposed Method
Participants: Data from human participants for the AUT were gathered through Prolific, in which 279 participants were included in the study after successfully passing attention checks. The average age of the participants was 30.4 years, and they hailed from the United Kingdom, United States, Canada, and Ireland. None of the participants reported any history of head injuries, current medication usage, or ongoing mental health issues. Ethical guidelines were followed with approval from the Ethics Committee for Human Sciences at the University of Turku.
AI Chatbots: Three AI chatbots, namely, ChatGPT3.5 (referred to as ChatGPT3), ChatGPT4, and Copy.Ai, were tested using the AUT. The chatbots underwent testing a total of 11 times, with each session involving four different object prompts, resulting in a total of 132 observations.
Procedure: The AUT comprised tasks involving four object probes: rope, box, pencil, and candle. Participants were instructed to prioritize quality over quantity and create original and creative usage for these objects. Each object was presented for 30 seconds, during which participants entered their ideas. AI chatbots were instructed to generate a specific number of ideas and limit their responses to 1-3 words to match human responses.
Scoring: The semantic distance between object names and responses was computed using five semantic models. Subjective creativity/originality ratings were collected from six human raters using a 5-point scale. Inter-rater reliability was high. Separate linear mixed-effect analyses compared human and AI performance by considering factors like group (human vs. AI), object, and fluency (number of responses). Post-hoc pairwise comparisons were adjusted for multiple comparisons.
Statistical Analyses: The analyses involved linear mixed-effect models considering fixed effects (Group, Object, and their interactions) and covariates like Fluency. Type III analysis of variance results were obtained, and post-hoc pairwise comparisons were adjusted for multiple comparisons with the Multivariate t-distribution (mvt) method.
Study Results
Descriptive statistics and correlations: The descriptive statistics are presented for humans and AI chatbots, averaged across all four object prompts. There is a moderate correlation between semantic distance and human subjective ratings in both mean and max scores.
Overall Human and AI Performance: AI outperformed humans in both semantic distance and subjective rating mean scores. The AI achieved higher mean scores, while fluency hurt mean scores and a positive effect on max scores. AI chatbots consistently provide more unusual and logical responses compared to some human responses.
Differentiating Performance Between AI Chatbots and Objects: ChatGPT3 and ChatGPT4 obtained higher mean semantic distance scores than humans, but there were no significant differences between the AI chatbots. No statistically significant differences between humans and AI chatbots in max scores existed. ChatGPT4 outperformed humans across most objects in human subjective ratings, while ChatGPT3 and Copy.AI performed similarly and better than humans. However, this superiority did not extend to responses to the pencil and candle.
Conclusion
To summarize, this study indicates that AI chatbots have achieved creative capabilities at least on par with the average human in the commonly used AUT test for divergent thinking. Although AI generally outperforms humans, the top-performing humans can still compete. It is important to note that AI technology is rapidly advancing, and results may change over time. The primary weakness in human performance lies in the higher prevalence of poor-quality ideas, which are absent in chatbot responses due to variations in human performance and motivational factors. This study focuses on divergent thinking within the AUT task to recognize that creativity is a multifaceted concept.