To read the original article in full go to : ChatGPT is blind to bad science.
Below is a short summary and detailed review of this article written by FutureFactual:
ChatGPT is blind to bad science: LLMs fail to recognise retractions in scholarly articles
Introduction
Generative AI and large language models (LLMs) are increasingly integrated into academic workflows, from literature reviews to coursework. In a forthcoming study, Er-Te Zheng and Mike Thelwall examine whether ChatGPT can reliably assess scientific articles, particularly when the articles have been retracted or raised concerns about ethics. The researchers assemble a sample of 217 highly visible retracted or otherwise problematic papers, identified using Retraction/concern data and Altmetric.com metrics to ensure these cases are well represented in public discourse. The goal is to test whether the model can recognise signals in the scholarly record that should influence how a given article is evaluated against established guidelines, such as the UK REF 2021 framework. "ChatGPT never once mentioned that an article had been retracted, corrected, or had any ethical issues." - Er-Te Zheng
Methodology
The authors submit the titles and abstracts of the 217 articles to ChatGPT 4o-mini and ask it to evaluate the research quality using the REF 2021 criteria. They repeat this process 30 times for each article to capture variability in the model’s responses. They also gather data from Altmetric to ensure that the articles studied were the most visible retractions or concerns in mainstream media, Wikipedia, and social platforms. This design aims to simulate a realistic scenario in which researchers might rely on AI tools to form initial judgments about the quality of specific papers.
Findings — Recognition of Retractions
The results are striking. Across all 6,510 evaluations, ChatGPT never mentioned that an article had been retracted, corrected, or flagged for ethical concerns. The model failed to connect explicit retraction notices in the articles’ titles or publisher pages with the content it was asked to assess. In many cases, it praised the papers, awarding high scores that indicate international excellence or world-leading status. "Nearly three-quarters of the articles received a high average score between 3* (internationally excellent) and 4* (world leading)." - Mike Thelwall
Findings — Affirmation of Claims
The second part of the study isolates claims from the retracted papers and asks ChatGPT whether each statement is true. The model shows a strong bias toward confirming statements, responding positively in about two-thirds of cases, and rarely labeling claims as false or unsupported. While there were some cautious notes on highly controversial topics (for example, certain COVID-19-related discussions), the model overall fails to flag the invalidity of evidence in a way that would be expected given the retraction status.
Implications for the Scholarly Community
The authors argue that their findings reveal a fundamental flaw in how a major AI tool processes academic information. The inability to link retraction notices to article content undermines the scholarly self-correction mechanism and risks reviving “zombie research” through AI-mediated circulation. While developers are pursuing improvements in safety and reliability, Zheng and Thelwall emphasize user responsibility: rigorous source-checking, always clicking through to verify status, and citing with care. The paper connects to their co-authored article in Learned Publishing, Does ChatGPT Ignore Article Retractions and Other Reliability Concerns?
Recommendations for Users
Universities and researchers adopting AI tools should maintain a cautious approach, treating AI-generated assessments as starting points rather than definitive judgments. The authors advocate for robust verification routines and encourage users to rely on human oversight when evaluating the reliability of scientific claims retrieved or summarized by LLMs.
Concluding Notes
The study underscores a crucial risk in the growing use of AI for literature review and scholarly evaluation: if LLMs cannot recognise the red flags in the scholarly record, responsible research behavior becomes even more essential for maintaining the integrity of scientific knowledge. The original article behind this blog post is Does ChatGPT Ignore Article Retractions and Other Reliability Concerns? published in Learned Publishing.
The content presented here draws on the authors’ work, and the views expressed are theirs and do not necessarily reflect those of the Impact of Social Sciences blog or the London School of Economics and Political Science. Readers are encouraged to review the blog’s comments policy if they wish to participate in discussion.
Image credit: Google Deepmind via Unsplash.