Confidential — Stefan Michaelcheck Only

Benchmarking Large Language Models for Knowledge Graph Validation

2026benchmark creationevaluativedataset

Aaltodoc (Aalto University)

https://doi.org/10.48786/edbt.2026.45OpenAlex: W7138873612
7
URLs Found
0
Internal Citations
0
Authors
usable
Abstract Quality
GPT-5.5 Abstract Analysis

Problems Identified (4)

KG fact validation at scale: Verifying factual accuracy in knowledge graphs is essential but challenging because expert manual verification is impractical at large scale and automated methods are not ready for real-world KGs.

Unexplored LLM suitability for KG validation: The suitability and effectiveness of LLMs for knowledge graph fact validation remain largely unexplored.

KG fact validation at scale: Verifying factual accuracy in knowledge graphs is essential but challenging because expert manual verification is impractical at large scale and automated methods are not ready for real-world KGs.

Unexplored LLM suitability for KG validation: The suitability and effectiveness of LLMs for knowledge graph fact validation remain largely unexplored.

Proposed Solutions (5)

FactCheck benchmark: The paper introduces FactCheck, a benchmark for evaluating LLMs on KG fact validation across internal knowledge, RAG-based external evidence, and multi-model consensus.

RAG dataset for KG validation: FactCheck includes a retrieval-augmented generation dataset with more than two million documents tailored to KG fact validation.

Interactive verification analysis platform: The paper provides an interactive platform for analyzing KG fact verification decisions.

FactCheck benchmark: The paper introduces FactCheck, a benchmark for evaluating LLMs on KG fact validation across internal knowledge, RAG-based external evidence, and multi-model consensus.

RAG dataset for KG validation: FactCheck includes a retrieval-augmented generation dataset with more than two million documents tailored to KG fact validation.

Results (3)

LLMs not reliable for real-world KG validation:

RAG gives inconsistent KG validation gains:

Consensus does not consistently outperform single models:

Research Domain

Knowledge graph validation and large language model evaluation

← Back to all papers