Are a Thousand Words Better Than a Single Picture? Beyond Images -- A Framework for Multi-Modal Knowledge Graph Dataset Enrichment
Pengyu Zhang, Paul Groth, Jie Liu, Klim Zaporojets, Jia-Hong Huang
arXiv (Cornell University)
Problems Identified (5)
MMKG image curation difficulty: Large-scale visual information collection for multi-modal knowledge graphs is difficult to curate.
Ambiguous visual exclusion: MMKG image collections often exclude relevant but ambiguous visuals such as logos, symbols, and abstract scenes.
Ambiguous visual semantics noise: Ambiguous images need to contribute usable semantics rather than noise in MMKG models.
MMKG image curation difficulty: Large-scale visual information collection for multi-modal knowledge graphs is difficult to curate.
Ambiguous visual exclusion: MMKG image collections often exclude relevant but ambiguous visuals such as logos, symbols, and abstract scenes.
Proposed Solutions (5)
Beyond Images enrichment pipeline: Beyond Images is an automatic data-centric enrichment pipeline for MMKG datasets with optional human auditing.
Large-scale entity image retrieval: The pipeline retrieves additional entity-related images at large scale.
Visual-to-text conversion: The pipeline converts visual inputs into textual descriptions so ambiguous images provide usable semantic information.
LLM description fusion: The pipeline uses an LLM to fuse multi-source descriptions into concise entity-aligned summaries.
Architecture-preserving summary augmentation: The generated summaries replace or augment the text modality in standard MMKG models without changing architectures or loss functions.
Results (3)
Consistent MMKG completion gains:
Large ambiguous-visual subset gains:
Audit interface release:
Research Domain
Multi-modal knowledge graph dataset enrichment and completion