MultiModalFashion (MMF / DeepFashion-MultiModal) — Normalized Dataset for FEL

1. Dataset Overview

MMF is an image–text (caption) paired multimodal fashion dataset. FEL v1.6 normalizes MMF into a pair-centric schema where ImageTextPairs is the hub relationship layer, and Images/Text entities are separated for graph-friendly joins.

Unlike DF1/DF2 (image evidence-heavy) or DF3 (pose_key 3D evidence), MMF’s core signal is caption semantics. The pipeline is designed so extraction is possible from captions.json alone, reflecting a text-centric multimodal design.

Key Characteristics

Hub relation: image_text_pairs.csv.gz (pair_id, image_uid, text_uid)
Entity tables: images.csv.gz, texts.csv.gz
Typed registry: items.csv.gz (emitted entity registry ensuring referential integrity)
Optional evidence: keypoints_raw.jsonl.gz, masks_index.csv.gz
PK collision prevention (v1.6): image_uid = MMF:img/<encoded_relpath_without_ext>
QC status (artifact basis): hard_fail = False, image_uid collision = 0

2. Folder and File Structure

(1) Original MMF Structure (Input)

MMF is annotation-driven; captions.json is the only hard requirement. Other assets are optional and linked if present.

mmf_root/
├── captions.json                (REQUIRED)
├── images/                      (OPTIONAL)
├── keypoints/                   (OPTIONAL)
└── masks|segm|parsing/          (OPTIONAL)

Key input files

captions.json → image ↔ caption annotation source (hard-fail if missing)
images/ → image pixels (optional; paths linked only)
keypoints/keypoints_loc.txt, keypoints/keypoints_vis.txt → optional keypoint evidence
masks|segm|parsing/ → optional mask/segmentation assets (auto-discovered)

(2) Normalized Output Structure

Core Tables

images.csv.gz → Image entities (PK: image_uid)
texts.csv.gz → Text entities (PK: text_uid; 1 row = 1 caption)
image_text_pairs.csv.gz → Hub relations (PK: pair_id)
items.csv.gz → Typed registry (image/text) ensuring referential integrity

Optional Evidence

keypoints_raw.jsonl.gz → raw keypoint evidence (JSONL)
masks_index.csv.gz → mask path index (no pixel decoding)

Management & Validation

manifest.csv / manifest.jsonl
qc_summary.json

3. Role in FEL

Node Construction

ImageItem → image_uid (from images.csv.gz)
TextItem → text_uid (from texts.csv.gz)
Optional Evidence → keypoints/masks (joined by image_uid)

Relationship Construction

ImageTextPairs (pair_id) links: image_uid ↔ text_uid
ImageItem ─has_caption──▶ TextItem (via image_text_pairs)
ImageItem ─has_keypoints▶ KeypointEvidence (optional)
ImageItem ─has_mask────▶ MaskEvidence (optional)

Core Design Principles

Pair-centric hub
MMF is fundamentally a relationship dataset; the pair table must be an independent hub for many-to-many and multi-caption support.

PK stability
Path-based image_uid prevents collisions across folders (e.g., tops/0001.jpg vs pants/0001.jpg).

Typed registry
items.csv.gz records only emitted entities and enforces referential integrity for graph construction.

4. Extracted Benchmarks

MMF is not benchmark-split like DF1; it is extracted as a unified pair dataset with optional modality extensions.

images: 42,537
texts: 42,537
pairs: 42,537
items: 85,074 (image + text)
keypoints_written: 12,695 (optional)
masks_index_rows: 12,324 (optional)

Manifest snapshot (dataset scope):

manifest_total_files = 100,529
total_size_gb = 15.089894
ext top: png 56,427 / jpg 44,096 / txt 5 / json 1

5. Graph Structure Description

Central Node

ImageTextPairs — the pair-centric hub relationship table.

Entity Layer

ImageTextPairs ──(image_uid)──▶ Images
ImageTextPairs ──(text_uid)──▶ Texts

Typed Registry Layer

Items (item_type + ref_uid) abstracts Images and Texts:
ref_uid → image_uid (if item_type=image), ref_uid → text_uid (if item_type=text)

Optional Modality Layer

KeypointsRaw and MasksIndex attach to Images via image_uid.

Meta Layer

Manifest and QCSummary are run-level metadata (audit/validation), typically drawn as dashed conceptual edges.

MMF Normalized Graph (Interactive)

Click below to open the interactive graph in a new window:

Open MMF Graph Interactive Editor

Final Summary

MMF is normalized around a pair-centric hub (ImageTextPairs) to represent image–caption relationships safely.
Entities (Images/Text) are separated, with a typed registry to guarantee referential integrity.
Optional modalities (keypoints/masks) are indexed as evidence without decoding heavy pixel content.
QC + manifest provide reproducibility and auditability for FEL ingestion.

➡️ Core FEL input for text-centric multimodal graph learning and caption-based fashion understanding.