MultiModalFashion (MMF / DeepFashion-MultiModal) — Normalized Dataset for FEL

1. Dataset Overview

MMF is an image–text (caption) paired multimodal fashion dataset. FEL v1.6 normalizes MMF into a pair-centric schema where ImageTextPairs is the hub relationship layer, and Images/Text entities are separated for graph-friendly joins.

Unlike DF1/DF2 (image evidence-heavy) or DF3 (pose_key 3D evidence), MMF’s core signal is caption semantics. The pipeline is designed so extraction is possible from captions.json alone, reflecting a text-centric multimodal design.

Key Characteristics


2. Folder and File Structure

(1) Original MMF Structure (Input)

MMF is annotation-driven; captions.json is the only hard requirement. Other assets are optional and linked if present.

mmf_root/
├── captions.json                (REQUIRED)
├── images/                      (OPTIONAL)
├── keypoints/                   (OPTIONAL)
└── masks|segm|parsing/          (OPTIONAL)

Key input files


(2) Normalized Output Structure

Core Tables

Optional Evidence

Management & Validation


3. Role in FEL

Node Construction

Relationship Construction

ImageTextPairs (pair_id) links: image_uidtext_uid
ImageItem ─has_caption──▶ TextItem (via image_text_pairs)
ImageItem ─has_keypoints▶ KeypointEvidence (optional)
ImageItem ─has_mask────▶ MaskEvidence (optional)

Core Design Principles

Pair-centric hub
MMF is fundamentally a relationship dataset; the pair table must be an independent hub for many-to-many and multi-caption support.

PK stability
Path-based image_uid prevents collisions across folders (e.g., tops/0001.jpg vs pants/0001.jpg).

Typed registry
items.csv.gz records only emitted entities and enforces referential integrity for graph construction.


4. Extracted Benchmarks

MMF is not benchmark-split like DF1; it is extracted as a unified pair dataset with optional modality extensions.

Manifest snapshot (dataset scope):


5. Graph Structure Description

Central Node

ImageTextPairs — the pair-centric hub relationship table.

Entity Layer

ImageTextPairs ──(image_uid)──▶ Images
ImageTextPairs ──(text_uid)──▶ Texts

Typed Registry Layer

Items (item_type + ref_uid) abstracts Images and Texts:
ref_uid → image_uid (if item_type=image), ref_uid → text_uid (if item_type=text)

Optional Modality Layer

KeypointsRaw and MasksIndex attach to Images via image_uid.

Meta Layer

Manifest and QCSummary are run-level metadata (audit/validation), typically drawn as dashed conceptual edges.


MMF Normalized Graph (Interactive)

Click below to open the interactive graph in a new window:

Open MMF Graph Interactive Editor

Final Summary

➡️ Core FEL input for text-centric multimodal graph learning and caption-based fashion understanding.