Beta

ChatGPT is blind to bad science

Featured image for article: ChatGPT is blind to bad science
This is a review of an original article published in: blogs.lse.ac.uk.
To read the original article in full go to : ChatGPT is blind to bad science.

Below is a short summary and detailed review of this article written by FutureFactual:

ChatGPT is blind to bad science: LLMs fail to recognise retractions in scholarly articles

In a detailed examination of how large language models (LLMs) handle scientific literature, Er-Te Zheng and Mike Thelwall assess ChatGPT's ability to recognise retractions and reliability signals. The researchers conducted two investigations using 217 high-profile retracted or seriously questioned articles, ranking their visibility through Altmetric data, and then prompting ChatGPT 4o-mini to evaluate the articles’ quality according to REF 2021 guidelines, repeating the process 30 times per article. The results are troubling: across 6,510 evaluations, ChatGPT never mentioned retractions or ethics concerns and often awarded high scores to flawed work. In a second test, 61 claims drawn from the retracted papers were queried with “Is the following statement true?” and the model affirmed truth in about two-thirds of cases. The authors warn this threatens the scholarly self-correction process and underscores the need for users to verify sources and click through to originals. This analysis, published by the Impact of Social Sciences blog (LSE), is by Er-Te Zheng and Mike Thelwall.

Introduction

Generative AI and large language models (LLMs) are increasingly integrated into academic workflows, from literature reviews to coursework. In a forthcoming study, Er-Te Zheng and Mike Thelwall examine whether ChatGPT can reliably assess scientific articles, particularly when the articles have been retracted or raised concerns about ethics. The researchers assemble a sample of 217 highly visible retracted or otherwise problematic papers, identified using Retraction/concern data and Altmetric.com metrics to ensure these cases are well represented in public discourse. The goal is to test whether the model can recognise signals in the scholarly record that should influence how a given article is evaluated against established guidelines, such as the UK REF 2021 framework. "ChatGPT never once mentioned that an article had been retracted, corrected, or had any ethical issues." - Er-Te Zheng

Methodology

The authors submit the titles and abstracts of the 217 articles to ChatGPT 4o-mini and ask it to evaluate the research quality using the REF 2021 criteria. They repeat this process 30 times for each article to capture variability in the model’s responses. They also gather data from Altmetric to ensure that the articles studied were the most visible retractions or concerns in mainstream media, Wikipedia, and social platforms. This design aims to simulate a realistic scenario in which researchers might rely on AI tools to form initial judgments about the quality of specific papers.

Findings — Recognition of Retractions

The results are striking. Across all 6,510 evaluations, ChatGPT never mentioned that an article had been retracted, corrected, or flagged for ethical concerns. The model failed to connect explicit retraction notices in the articles’ titles or publisher pages with the content it was asked to assess. In many cases, it praised the papers, awarding high scores that indicate international excellence or world-leading status. "Nearly three-quarters of the articles received a high average score between 3* (internationally excellent) and 4* (world leading)." - Mike Thelwall

Findings — Affirmation of Claims

The second part of the study isolates claims from the retracted papers and asks ChatGPT whether each statement is true. The model shows a strong bias toward confirming statements, responding positively in about two-thirds of cases, and rarely labeling claims as false or unsupported. While there were some cautious notes on highly controversial topics (for example, certain COVID-19-related discussions), the model overall fails to flag the invalidity of evidence in a way that would be expected given the retraction status.

Implications for the Scholarly Community

The authors argue that their findings reveal a fundamental flaw in how a major AI tool processes academic information. The inability to link retraction notices to article content undermines the scholarly self-correction mechanism and risks reviving “zombie research” through AI-mediated circulation. While developers are pursuing improvements in safety and reliability, Zheng and Thelwall emphasize user responsibility: rigorous source-checking, always clicking through to verify status, and citing with care. The paper connects to their co-authored article in Learned Publishing, Does ChatGPT Ignore Article Retractions and Other Reliability Concerns?

Recommendations for Users

Universities and researchers adopting AI tools should maintain a cautious approach, treating AI-generated assessments as starting points rather than definitive judgments. The authors advocate for robust verification routines and encourage users to rely on human oversight when evaluating the reliability of scientific claims retrieved or summarized by LLMs.

Concluding Notes

The study underscores a crucial risk in the growing use of AI for literature review and scholarly evaluation: if LLMs cannot recognise the red flags in the scholarly record, responsible research behavior becomes even more essential for maintaining the integrity of scientific knowledge. The original article behind this blog post is Does ChatGPT Ignore Article Retractions and Other Reliability Concerns? published in Learned Publishing.

The content presented here draws on the authors’ work, and the views expressed are theirs and do not necessarily reflect those of the Impact of Social Sciences blog or the London School of Economics and Political Science. Readers are encouraged to review the blog’s comments policy if they wish to participate in discussion.

Image credit: Google Deepmind via Unsplash.