Research Context & Motivation

IIn today’s digital landscape, discerning credible online content has become increasingly difficult as misinformation proliferates. Everyday, there are millions of articles about pseudoscience and fake news coming up.

We ask: Can LLMs perform human-aligned credibility assessments at the article level, and how can interface design foster critical engagement rather than blind trust?

Research Objectives

Design a credibility assessment tool for web articles that provides context-based explanations.
Define credibility criteria and evaluate agreement between human and LLM judgments to assess viability and alignment.
Explore interface design and features that promote learning and maintain user agency.

The Problem

Online information credibility has become an increasingly important issue considering the current polarized political climate and global shift to online information. Studies consistently show that today’s adults have trouble identifying misinformation.

Existing Tool

Existing tools depend on expert human evaluators, only has evaluation of a small portion of the 2 million active website, and only provide domain level credibility scores or labels, instead of evaluating the specific article a user is reading.

User Persona

Ideation

We explored multiple potential medium of CredBot, including AR Phone App that with credibility ratings on the phone, VR web browsing, and Chrome Extension.

8/10 users chose Chrome Extension

We decided to go for Chrome Extension because of its accessibility and relatively low development time, while keeping in mind the drawbacks of longer reading time to later address in our design.

System Design

Low Fidelity Prototype

Wireframing

Low Fidelity Iterations

We explored multiple potential medium of CredBot, including AR Phone App that with credibility ratings on the phone, VR web browsing, and Chrome Extension.

System Prompting

Credbility Signals

Prompt Engineering

To increase accuracy for LLMs, we implemented an overall system prompting instructions, and concurrently evaluate each signal mapped to a prompt. Below are examples of inputs and outputs for 2 signals: Title Representativeness and Calibration of Confidence. We experimented with both commercial LLMs like GPT 4o mini and GPT 3o mini, and Open Source LLMs Deepseek R1 and Qwen3.

Signal(
        name="Calibration of Confidence",
        definition=(
            "What to look for: Does the author use language that appropriately matches how certain the should be about their claims? \n" 
            "High credibility: The author uses qualifying language like 'suggests,' 'appears to,' 'preliminary evidence indicates' for uncertain claims, "
            "and stronger language like 'demonstrates' or 'proves' only for well-established facts. The author's confidence in their claims is well justified. \n"
            "Low credibility: The author states uncertain things as absolute facts, or uses overly confident language ('definitely,' 'undoubtedly') for complex issues that experts still debate."
        )
    ),

CredBot Features & Iterations

Credibility Banner - Iterations

We iterated from having separate "High", "Medium", and "Low" credibility banners to only having a "Low" banner based on user testing and research validity.

Highlighting

During user testing, I noticed participants constantly scrolling through articles to reference the in-article evidence that the reasonings mentioned. This prompts us to question how can we make article evidences for convenient to access.

Chatbot

On top of a traditional chatbot feature, I designed a quick access to chatbot queries with highlight text.

Accuracy Evaluation

Data preparation

The dataset of articles used for testing and training CredBot is derived from domains rated by MBFC. An even number of articles from five biases (left, center, right, fake-news, and conspiracy-pseudoscience) and three credibility levels (high, medium and low) are collected and parsed. Four human labelers from the Wellesley Credibility Lab research team were randomly assigned to independently score a subset of 51 articles by the 8 credibility signals.

Inter-rater reliability (IRR) is measured using Cohen’s kappa, resulting in scores of 0.89 and 0.83 for the two labeler pairs, indicating substantial to near-perfect agreement according to standard interpretation guidelines. These high reliability scores validate the quality of human annotations as ground truth labels. Disagreements between labeler pairs are resolved through consensus discussions to produce the final dataset used for model evaluation.

Researcher ratings - the ground truth

Agreements between each researcher group

LLM Evaluation & Results

We tested CredBot’s architecture with five LLM models: GPT-4o, GPT-4o mini, o3-mini, Qwen3, and DeepSeek-V3. These models were chosen to test a range of cost, accessibility and efficiency. We calculate the IRR score to assess each LLM’s agreement with the ground truth. The results in table below show moderate agreement for GPT-4o-mini, and substantial agreement for GPT4o, o3-mini, Gwen3 and DeepSeek-V3. These agreement rates validate that LLMs can correctly interpret credibility signals within the same analytical frameworks used by labellers, and highlight open source models’ potential as capable alternatives to commercially available LLMs.

Model

Human Agreement

IRR

GPT-4o-mini

70.26%

0.55

GPT-4o

78.10%

0.67

o3-mini

81.70%

0.67

Qwen3

81.70%

0.68

DeepSeek-V3

82.68%

0.70

Qualitative examination of the LLMs’ explanations reveals that the models are able to identify signals under given definitions and provide systematic, evidence-based reasoning by referencing specific textual details. Below are 2 examples of signal evaluations from o3-mini and deepseek-v3, the 2 models with highest IRR scores.

o3-mini

"Mason County Commission honors local track and field champions"

Clickbait Title - High:

The title clearly and informatively states the subject and event without vague or provocative language", aligning with labeler’s evaluation of the article as factual local news

Deepseek-V3

"Insanity: ESPN Announcer Apologizes for Calling America 'Great' During WNBA Broadcast"

Calibration of Confidence - Low:

The article contains statements with overly confident language ("undoubtedly," "definitely") regarding complex issues that are still debated, without sufficient qualifying language for uncertain claims

However, review of disagreements reveals several weaknesses. Below are some examples that provide insights on LLM’s area of weakness, consisting of summaries of LLM’s mistakes, LLM explanations and labels, and Human explanations and labels.

Conclusion & Future Work

Conclusion

Overall, CredBot demonstrates the potential of LLMs to perform article-level credibility assessments in a scalable and cost-effective manner. By automating evaluations typically performed by human experts, CredBot offers a practical alternative designed to assist and educate users rather than replace human judgment. CredBot achieves up to 0.7 in IRR with our human evaluators, laying ground-work for future research on AI-driven solutions for online credibility and illustrating the potential of LLMs to interpret complex credibility signals effectively.

Future Work

Refine Fact-Checking Functionality: Improve the fact-checking tool to surface information exclusively from verified and reputable sources, rather than displaying general top search results.
Enhance Highlighting Precision: Increase the accuracy of the highlighting feature, ensuring that only relevant text segments are marked (e.g., excluding titles from the Clickbait Title category).
Evaluate Impact on Media Literacy: Conduct a week-long user study to measure changes in users’ ability to assess credibility before and after interacting with CredBot, providing empirical evidence of its educational effectiveness.

CredBot

Table of Content

Research Context & Motivation

Research Objectives

The Problem

Existing Tool

User Persona

Ideation

System Design

Low Fidelity Prototype

System Prompting

CredBot Features & Iterations

Accuracy Evaluation

o3-mini

Deepseek-V3

Conclusion & Future Work

Get in Touch