Can AI pass your university exams? Researchers reveal how GPT-4 is reshaping higher education, calling for innovative assessments to safeguard learning and integrity.
Research: Could ChatGPT get an engineering degree? Evaluating higher education vulnerability to AI assistants. Image Credit: DALL·E 3
In an article published in the journal PNAS, researchers at École Polytechnique Fédérale de Lausanne (EPFL), Switzerland, explored the impact of artificial intelligence (AI) assistants, such as generative pre-trained transformers (GPT)-3.5 and GPT-4, on university assessments. They evaluated how these models answered an average of 65.8% of questions correctly across 50 science, technology, engineering, and mathematics (STEM) courses, highlighting vulnerabilities in traditional assessment designs.
The findings emphasized the risk of students misusing AI, potentially undermining learning outcomes and accreditation. The study called for rethinking assessment methods to ensure they promote critical thinking and genuine knowledge acquisition while mitigating risks posed by AI-assisted cheating.
Background
The advent of ChatGPT, based on GPT-3.5 and GPT-4 large language models (LLMs), has revolutionized AI applications with rapid adoption since its release in November 2022. These systems have sparked debates about their societal impact, particularly in education, where concerns center on their potential misuse in coursework, enabling students to bypass essential learning. Previous studies examined LLM performance on individual university-level problems, aggregating question datasets but lacked comprehensive analyses of their broader implications for academic assessments and degree programs.
This paper bridged these gaps by conducting an extensive study of LLM performance on real assessment questions from 50 courses at École Polytechnique Fédérale de Lausanne (EPFL) across nine Bachelor’s, Master’s, and online programs. The authors compiled a bilingual dataset of 5,579 questions and tested GPT-3.5 and GPT-4 using diverse prompting strategies. Their findings demonstrated that AI tools could pass 83–100% of courses at a 50% threshold in many technical fields, including computer science and physics. This prompted a reevaluation of educational strategies to uphold learning integrity.
Overview of Courses. Courses represented in our dataset, grouped by program and degree. Courses may belong to multiple programs, in which case their partition is split into chunks of equal size, with one chunk assigned to each program.
Data Collection and Evaluation Methodology
The researchers investigated the performance of LLMs, GPT-3.5 and GPT-4, on university-level assessments. Using a dataset of 5,579 multiple-choice and open-answer questions from 50 STEM courses at EPFL, spanning various disciplines and languages, the research evaluated how well these models handled realistic academic challenges. Questions were meticulously collected, preprocessed, and labeled by faculty, incorporating attributes like course level, program designation, and difficulty.
The authors tested eight prompting strategies, categorized as direct, rationalized, or reflective. The evaluation involved automated grading using GPT-4 and human grading by 28 expert annotators.
Results showed that GPT-4 could correctly answer an average of 65.8% of questions, with significant variation depending on the prompting strategy. Automated grading was validated against human assessments, revealing a close alignment, with GPT-4's grades deviating from human grading by an average of only 2.75%. However, GPT-4’s role as both grader and test-taker introduced potential biases that the researchers acknowledged.
The findings highlighted vulnerabilities in traditional assessments, as LLMs demonstrated high accuracy in multiple-choice questions and some open-ended responses. However, they struggled with mathematically intensive and highly analytical questions requiring deep understanding, aligning with patterns observed in student performance.
Compared to prior work, this study advanced the field by exploring the downstream implications of LLMs on academic assessment integrity, emphasizing the need to rethink traditional evaluation methods to safeguard against AI misuse and promote genuine learning.
Experiment and Findings
Using eight prompting strategies, the research explored the models' performance across multiple-choice and open-ended questions. GPT-4, the more advanced model, achieved an average accuracy of 55.9% in a zero-shot setting, rising to 65.8% when leveraging majority-vote strategies. GPT-3.5, though less capable, also demonstrated considerable proficiency, achieving 52.2% accuracy in similar scenarios.
Alarmingly, GPT-4 surpassed the 50% pass threshold in 89% of multiple-choice and 77% of open-answer courses, highlighting vulnerabilities in university assessments. The model maintained substantial passing rates for stricter thresholds, such as 60% and 70%, emphasizing the potential for misuse. The authors further demonstrated that GPT-4’s performance was robust across diverse question languages, although slightly better in English than French.
The analysis extended to program vulnerability, revealing that GPT-4 could pass over 80% of courses in most degree programs, including highly technical fields like computer science and physics. Despite these high overall passing rates, the model struggled with certain difficult question types, particularly those requiring intricate reasoning or mathematical derivations. Larger class sizes, often associated with mandatory courses, appeared more vulnerable due to limited monitoring capabilities.
While increasing question difficulty could mitigate AI misuse, this strategy risked compromising student performance. The findings underscored the urgent need for educational institutions to adapt assessment strategies and safeguard academic integrity against advancing AI technologies.
Comparison of Human and GPT-4 grading. Average model and human performance for a subset of 933 questions and answers from (A) GPT-4 and (B) GPT-3.5 generated with the metacognitive prompting method.
Discussion, Challenges, and Limitations
This study evaluated the performance of LLMs in answering assessment questions from technical and natural sciences courses at EPFL. LLMs, such as GPT-4, demonstrated the ability to solve 50–70% of questions correctly without subject-specific knowledge, reaching up to 85.1% accuracy when recognizing correct answers was assumed.
On average, GPT-4 achieved a 91.7% pass rate across programs but could only pass 37% of courses when a stricter 70% threshold was applied. These findings highlighted the vulnerability of higher education assessments to generative AI exploitation, particularly in unsupervised settings.
The researchers recommended rethinking assessment designs to mitigate these risks, including adopting proctored exams, focusing on analytical and applied knowledge, and using open-ended, real-world project-based assessments. Additionally, integrating AI into the learning process could foster critical thinking and originality. The authors stressed the importance of teaching ethical AI usage and revisiting evaluation methods to ensure skill development and academic integrity.
Limitations included the exclusion of multimodal questions, such as those involving diagrams or graphs, which might yield different results, and potential grading bias from using GPT-4 as both grader and test-taker. Despite these gaps, the research underscored the pressing need to adapt educational practices to address the challenges posed by generative AI tools.
Conclusion
In conclusion, the researchers highlighted how AI tools like GPT-3.5 and GPT-4 challenged traditional university assessments, particularly in STEM courses. Models achieved up to 65.8% accuracy, exposing vulnerabilities in unsupervised evaluations. The authors called for redesigning assessments to emphasize critical thinking, applied skills, and ethical AI use while mitigating misuse. Despite limitations, such as excluding multimodal questions, the research underscored the urgent need to adapt educational strategies to uphold academic integrity and effective learning.
Journal reference:
- Borges et al., 2024. Could ChatGPT get an engineering degree? Evaluating higher education vulnerability to AI assistants. Proceedings of the National Academy of Sciences, 121(49). DOI: 10.1073/pnas.2414955121, https://www.pnas.org/doi/10.1073/pnas.2414955121