MMF is an image–text (caption) paired multimodal fashion dataset. FEL v1.6 normalizes MMF into a pair-centric schema where ImageTextPairs is the hub relationship layer, and Images/Text entities are separated for graph-friendly joins.
Unlike DF1/DF2 (image evidence-heavy) or DF3 (pose_key 3D evidence), MMF’s core signal is caption semantics.
The pipeline is designed so extraction is possible from captions.json alone, reflecting a text-centric multimodal design.
image_text_pairs.csv.gz (pair_id, image_uid, text_uid)images.csv.gz, texts.csv.gzitems.csv.gz (emitted entity registry ensuring referential integrity)keypoints_raw.jsonl.gz, masks_index.csv.gzimage_uid = MMF:img/<encoded_relpath_without_ext>hard_fail = False, image_uid collision = 0MMF is annotation-driven; captions.json is the only hard requirement. Other assets are optional and linked if present.
mmf_root/
├── captions.json (REQUIRED)
├── images/ (OPTIONAL)
├── keypoints/ (OPTIONAL)
└── masks|segm|parsing/ (OPTIONAL)
captions.json → image ↔ caption annotation source (hard-fail if missing)images/ → image pixels (optional; paths linked only)keypoints/keypoints_loc.txt, keypoints/keypoints_vis.txt → optional keypoint evidencemasks|segm|parsing/ → optional mask/segmentation assets (auto-discovered)images.csv.gz → Image entities (PK: image_uid)texts.csv.gz → Text entities (PK: text_uid; 1 row = 1 caption)image_text_pairs.csv.gz → Hub relations (PK: pair_id)items.csv.gz → Typed registry (image/text) ensuring referential integritykeypoints_raw.jsonl.gz → raw keypoint evidence (JSONL)masks_index.csv.gz → mask path index (no pixel decoding)manifest.csv / manifest.jsonlqc_summary.jsonimage_uid (from images.csv.gz)text_uid (from texts.csv.gz)
ImageTextPairs (pair_id) links: image_uid ↔ text_uid
ImageItem ─has_caption──▶ TextItem (via image_text_pairs)
ImageItem ─has_keypoints▶ KeypointEvidence (optional)
ImageItem ─has_mask────▶ MaskEvidence (optional)
Pair-centric hub
MMF is fundamentally a relationship dataset; the pair table must be an independent hub for many-to-many and multi-caption support.
PK stability
Path-based image_uid prevents collisions across folders (e.g., tops/0001.jpg vs pants/0001.jpg).
Typed registry
items.csv.gz records only emitted entities and enforces referential integrity for graph construction.
MMF is not benchmark-split like DF1; it is extracted as a unified pair dataset with optional modality extensions.
Manifest snapshot (dataset scope):
manifest_total_files = 100,529total_size_gb = 15.089894ext top: png 56,427 / jpg 44,096 / txt 5 / json 1ImageTextPairs — the pair-centric hub relationship table.
ImageTextPairs ──(image_uid)──▶ Images
ImageTextPairs ──(text_uid)──▶ Texts
Items (item_type + ref_uid) abstracts Images and Texts:
ref_uid → image_uid (if item_type=image), ref_uid → text_uid (if item_type=text)
KeypointsRaw and MasksIndex attach to Images via image_uid.
Manifest and QCSummary are run-level metadata (audit/validation), typically drawn as dashed conceptual edges.
Click below to open the interactive graph in a new window:
Open MMF Graph Interactive Editor➡️ Core FEL input for text-centric multimodal graph learning and caption-based fashion understanding.