IIn today’s digital landscape, discerning credible online content has become increasingly difficult as misinformation proliferates. Everyday, there are millions of articles about pseudoscience and fake news coming up.
We ask: Can LLMs perform human-aligned credibility assessments at the article level, and how can interface design foster critical engagement rather than blind trust?

Online information credibility has become an increasingly important issue considering the current polarized political climate and global shift to online information. Studies consistently show that today’s adults have trouble identifying misinformation.

Existing tools depend on expert human evaluators, only has evaluation of a small portion of the 2 million active website, and only provide domain level credibility scores or labels, instead of evaluating the specific article a user is reading.
.png)
Existing tools depend on expert human evaluators, only has evaluation of a small portion of the 2 million active website, and only provide domain level credibility scores or labels, instead of evaluating the specific article a user is reading.
.png)
We explored multiple potential medium of CredBot, including AR Phone App that with credibility ratings on the phone, VR web browsing, and Chrome Extension.
.png)
8/10 users chose Chrome Extension
We decided to go for Chrome Extension because of its accessibility and relatively low development time, while keeping in mind the drawbacks of longer reading time to later address in our design.
.png)
Wireframing

Low Fidelity Iterations
We explored multiple potential medium of CredBot, including AR Phone App that with credibility ratings on the phone, VR web browsing, and Chrome Extension.
.png)
Credbility Signals
.png)
Prompt Engineering
To increase accuracy for LLMs, we implemented an overall system prompting instructions, and concurrently evaluate each signal mapped to a prompt. Below are examples of inputs and outputs for 2 signals: Title Representativeness and Calibration of Confidence. We experimented with both commercial LLMs like GPT 4o mini and GPT 3o mini, and Open Source LLMs Deepseek R1 and Qwen3.

Signal(
name="Calibration of Confidence",
definition=(
"What to look for: Does the author use language that appropriately matches how certain the should be about their claims? \n"
"High credibility: The author uses qualifying language like 'suggests,' 'appears to,' 'preliminary evidence indicates' for uncertain claims, "
"and stronger language like 'demonstrates' or 'proves' only for well-established facts. The author's confidence in their claims is well justified. \n"
"Low credibility: The author states uncertain things as absolute facts, or uses overly confident language ('definitely,' 'undoubtedly') for complex issues that experts still debate."
)
),Credibility Banner - Iterations
.png)
Highlighting

Chatbot
.png)
Data preparation


LLM Evaluation & Results
We tested CredBot’s architecture with five LLM models: GPT-4o, GPT-4o mini, o3-mini, Qwen3, and DeepSeek-V3. These models were chosen to test a range of cost, accessibility and efficiency. We calculate the IRR score to assess each LLM’s agreement with the ground truth. The results in table below show moderate agreement for GPT-4o-mini, and substantial agreement for GPT4o, o3-mini, Gwen3 and DeepSeek-V3. These agreement rates validate that LLMs can correctly interpret credibility signals within the same analytical frameworks used by labellers, and highlight open source models’ potential as capable alternatives to commercially available LLMs.
Model
Human Agreement
IRR
GPT-4o-mini
70.26%
0.55
GPT-4o
78.10%
0.67
o3-mini
81.70%
0.67
Qwen3
81.70%
0.68
DeepSeek-V3
82.68%
0.70
Qualitative examination of the LLMs’ explanations reveals that the models are able to identify signals under given definitions and provide systematic, evidence-based reasoning by referencing specific textual details. Below are 2 examples of signal evaluations from o3-mini and deepseek-v3, the 2 models with highest IRR scores.
"Mason County Commission honors local track and field champions"
Clickbait Title - High:
The title clearly and informatively states the subject and event without vague or provocative language", aligning with labeler’s evaluation of the article as factual local news
"Insanity: ESPN Announcer Apologizes for Calling America 'Great' During WNBA Broadcast"
Calibration of Confidence - Low:
The article contains statements with overly confident language ("undoubtedly," "definitely") regarding complex issues that are still debated, without sufficient qualifying language for uncertain claims
However, review of disagreements reveals several weaknesses. Below are some examples that provide insights on LLM’s area of weakness, consisting of summaries of LLM’s mistakes, LLM explanations and labels, and Human explanations and labels.

Conclusion
Overall, CredBot demonstrates the potential of LLMs to perform article-level credibility assessments in a scalable and cost-effective manner. By automating evaluations typically performed by human experts, CredBot offers a practical alternative designed to assist and educate users rather than replace human judgment. CredBot achieves up to 0.7 in IRR with our human evaluators, laying ground-work for future research on AI-driven solutions for online credibility and illustrating the potential of LLMs to interpret complex credibility signals effectively.
Future Work