Confidential — Stefan Michaelcheck Only

Are a Thousand Words Better Than a Single Picture? Beyond Images -- A Framework for Multi-Modal Knowledge Graph Dataset Enrichment

2026resource releaseincrementalcombination

Pengyu Zhang, Paul Groth, Jie Liu, Klim Zaporojets, Jia-Hong Huang

arXiv (Cornell University)

https://doi.org/10.48550/arxiv.2603.16974OpenAlex: W7138941854arXiv: 2603.16974
1
URLs Found
0
Internal Citations
5
Authors
usable
Abstract Quality
GPT-5.5 Abstract Analysis

Problems Identified (5)

MMKG image curation difficulty: Large-scale visual information collection for multi-modal knowledge graphs is difficult to curate.

Ambiguous visual exclusion: MMKG image collections often exclude relevant but ambiguous visuals such as logos, symbols, and abstract scenes.

Ambiguous visual semantics noise: Ambiguous images need to contribute usable semantics rather than noise in MMKG models.

MMKG image curation difficulty: Large-scale visual information collection for multi-modal knowledge graphs is difficult to curate.

Ambiguous visual exclusion: MMKG image collections often exclude relevant but ambiguous visuals such as logos, symbols, and abstract scenes.

Proposed Solutions (5)

Beyond Images enrichment pipeline: Beyond Images is an automatic data-centric enrichment pipeline for MMKG datasets with optional human auditing.

Large-scale entity image retrieval: The pipeline retrieves additional entity-related images at large scale.

Visual-to-text conversion: The pipeline converts visual inputs into textual descriptions so ambiguous images provide usable semantic information.

LLM description fusion: The pipeline uses an LLM to fuse multi-source descriptions into concise entity-aligned summaries.

Architecture-preserving summary augmentation: The generated summaries replace or augment the text modality in standard MMKG models without changing architectures or loss functions.

Results (3)

Consistent MMKG completion gains:

Large ambiguous-visual subset gains:

Audit interface release:

Research Domain

Multi-modal knowledge graph dataset enrichment and completion

← Back to all papers