Researchers from the University of Reading in the UK conducted a blind study to see if human educators were able to detect AI-generated content. The results don’t bode well for teachers.
The move toward remote learning has seen a lot of student assessments conducted outside the traditional space of a written exam with an invigilator on the lookout for cheating.
Ready access to advanced AI models has made it easy for students to use tools like ChatGPT to write their coursework assignments for them or help when taking online exams.
Would a smart university professor be able to tell if a student was using AI to answer online exam questions?
Associate Professor Peter Scarfe, one of the paper’s lead authors, said, “Many institutions have moved away from traditional exams to make assessment more inclusive. Our research shows it is of international importance to understand how AI will affect the integrity of educational assessments.”
In the “largest and most robust blind study of its kind,” the researchers submitted 100% AI written submissions into the examinations system in five undergraduate modules for a BSc degree in Psychology.
The markers of the exams were completely unaware of the study. This made the experiment a kind of real-world Turing test, where an educator who didn’t call ‘Cheat!’ on a paper believed a human wrote the assignment.
Here’s how it worked:
- The submissions were created using GPT-4.
- They submitted a total of 33 AI-generated exam entries across five modules.
- The study included both short answer questions (SAQs) and longer essay-based questions.
- For exams, submissions consisted of four answers from a choice of six questions, each with a 200-word limit.
- For essay-based exams, submissions consisted of a single 1500-word essay (students submitted one answer out of a choice of either three or four, depending on the module).
The researchers used the following prompts with ChatGPT for the SAQs and essays, respectively:
- “Including references to academic literature but not a separate reference section, answer the following question in 160 words: XXX”
- “Including references to academic literature but not a separate reference section, write a 2000-word essay answering the following question: XXX” (they chose 2,000 words as ChatGPT usually underdelivers on wordcounts)
- In each prompt, XXX was replaced by the exam question.
When the results were tallied, 94% of the AI submissions went unflagged by the markers. What kind of grades did the AI papers achieve?
The researchers said, “The grades awarded to our AI submissions were on average half a grade boundary higher than that achieved by real students. Across modules there was an 83.4% chance that the AI submissions on a module would outperform a random selection of the same number of real student submissions.”
Researchers further noted that their approach likely overestimates the detectability of AI use in real-world scenarios. As Dr. Scarfe explained, “If real students were cheating in an exam, they would be unlikely to take such a naively obvious approach as we did.”
In practice, students might use AI as a starting point, refining and personalizing the output, making detection even more challenging.
And if that wasn’t enough, then besides the researchers’ AI submissions, other students likely used ChatGPT for their answers. This means the detection rate could be even lower than the recorded results.
No simple solutions
The study also acknowledges past research into AI detection accuracy. AI detectors, like that offered by the popular academic plagiarism platform Turnitin, have been robustly proven inaccurate.
Plus, AI detectors risk falsely accusing non-native English speakers who are less likely to use certain vocabulary, idioms, etc., which AI can view as signals of human writing.
Education leaders have been debating what to do about AI’s role in education. Should it be normalized like the calculator? Should its use be persecuted, or should it simply form part of the syllabus?
Overall, there’s some consensus that integrating AI into education is not without risks. At worst, it threatens to erode critical thinking and stunt the creation of authentic new knowledge.
Professor Karen Yeung cautioned against potential “deskilling” of students, telling The Guardian, “There is a real danger that the coming generation will end up effectively tethered to these machines, unable to engage in serious thinking, analysis or writing without their assistance.”
To combat AI misuse, Reading researchers recommend potentially moving away from unsupervised, take-home exams to more controlled environments. This could involve a return to traditional in-person exams or the development of new, AI-resistant assessment formats.
Another possibility – and a model some universities are already following – is developing coursework that teaches students how to use AI critically and ethically.
We also need to confront the evident lack of tutor AI literacy that was exposed by this study. I’m confident that Sam and I from DailyAI, and I’m sure many others that interact with AI regularly, would rate their chances of detecting AI-written work at higher than 1 in 33.
ChatGPT often resorts to certain ‘tropes’ or sentence patterns that become quite obvious when you’re exposed to them frequently.
It would be interesting to see how a tutor ‘trained’ to recognize AI writing would perform under the same conditions.
ChatGPT’s exam record is mixed
The Reading University study is not the first to test AI’s capabilities in academic settings. Various studies have examined AI performance across different fields and levels of education:
- Medical exams: A group of pediatric doctors tested ChatGPT (GPT-3.5) on the neonatal-perinatal board exam. The AI scored only 46% correct answers, performing best on basic recall and clinical reasoning questions but struggling with multi-logic reasoning. Interestingly, it scored highest (78.5%) in the ethics section.
- Financial exams: JPMorgan Chase & Co. researchers tested GPT-4 on the Chartered Financial Analyst (CFA) exam. While ChatGPT was unlikely to pass Levels I and II, GPT-4 showed “a decent chance” if prompted appropriately. The AI models performed well in derivatives, alternative investments, and ethics sections but struggled with portfolio management and economics.
- Law exams: ChatGPT has been tested on the bar exam for law, often scoring very highly.
- Standardized tests: The AI has performed well on Graduate Record Examinations (GRE), SAT Reading and Writing, and Advanced Placement exams.
- University courses: Another study pitched ChatGPT (model not given) against 32 degree-level topics, finding that it beat or exceeded students on only 9 out of 32 exams.
So, while AI excels in some areas, this is highly variable depending on the subject and type of test in question.
The conclusion is that if you’re a student who doesn’t mind cheating, you can use ChatGPT to get better grades with only a 6% chance of getting caught. You’ve got to love those odds.
As researchers noted, student assessment methods will have to change to maintain their academic integrity, especially as AI-generated content becomes harder to detect.
The researchers added a humorous conclusion to their paper.
“If we were to say that GPT-4 had designed part of this study, did part of the analysis and helped write the manuscript, other than those sections where we have directedly quoted GPT-4, which parts of the manuscript would you identify as written by GPT-4 rather than the authors listed?”
If the researchers “cheated” by using AI to write the study, how would you prove it?