FEL Experimental Design Proposal

Section 1

One-line intuition

What I’m proposing is not a standard image → emotion pipeline.

Instead, we first define meaning through an ontology, ground that meaning into structured evidence linked to images, and then perform inference in stages:

evidence → signal → aesthetic → affect

So the model is not guessing emotion from pixels. It is reasoning from structured, grounded evidence toward affect. More precisely, this experiment keeps the canonical ontology and evidence backbone fixed, and validates signal → aesthetic → affect on top of that governed substrate.

Traceability

Outputs remain traceable through image_uid → instance_uid → event_uid → core_term_uid → signal → A(x) → VAD → lexical.

Schema governance

The ontology and evidence backbone is treated as contract-locked, so experiments add inference on top of an already governed substrate instead of improvising reasoning on raw pixels.

Domain specificity

Fashion semantics are not generic image semantics. Silhouette, material, decoration, composition, and styling intent require domain-governed evidence.

Section 2

Why this structure

Image-first vs ontology-first

Wide horizontal diagram, unified box size, no clipping

Most existing systems start from the image and try to infer everything directly:

Image → Embedding → Aesthetic o Emotion

In our case, we’re doing something fundamentally different:

Ontology → Evidence → Instance → Image → Signal → Aesthetic → Affect

The reason is this:

In fashion, meaning is not purely visual. It is structured through silhouette, material, composition, styling intent, and domain-specific attributes.
If we go directly from pixels to emotion, we lose interpretability, governance, and control.
If we define meaning first, then ground it into data, we can build a system that is both explainable and stable.

So in our design, the image is not the starting point. The image is the grounding layer of meaning.

Summary

Ontology defines meaning

Evidence instantiates meaning

Instance localizes meaning

Image grounds meaning

Inference operates on grounded evidence

Section 3

Full pipeline (conceptual view)

The collaborator-facing experimental pipeline is best explained as a governed sequence that starts from canonical semantics and only later reaches images, signals, aesthetics, and affect.

Stage 0Ontology

Stage 1Evidence Construction

Stage 2Clean Experimental Subset

Stage 3Signal Layer

Stage 4Aesthetic Layer

Stage 5Affect Layer (VAD)

Stage 6Lexical Explanation

Philosophically compressed:
Ontology defines meaning → Event instantiates meaning → Instance localizes meaning → Image grounds meaning → Inference operates on grounded evidence

Section 4

Stage-by-stage explanation

Stage 0. Canonical Ontology Backbone

At the beginning, we define and lock the ontology-side single source of truth: a structured vocabulary of fashion semantics including garment types, core facets, material and silhouette attributes, decorations, parent terms, aliases, and ontology edges.

Purpose: make sure the semantic units are closed before downstream inference begins.
Why first: if a signal is supposed to be an evidence-derived feature anchored to canonical semantics, the meaning space must already be fixed.
Design claim: FEL aims for canonical convergence ontology rather than ad hoc dataset-specific vocabularies.
Cross-dataset value: different sources can converge into a shared representation through the same core_term backbone.

If we do not define what counts as a semantic unit first, everything downstream becomes unstable. The ontology is therefore not decoration. It is the governed prior that stabilizes later evidence and inference.

References: FashionBERT, FashionViL, AESTHEMOS, Aesthetic Emotion Lexicon

Stage 1. Evidence Construction

Here we connect the ontology to actual observational data: images, localized instances (bounding boxes / regions), and evidence events grounded in those observations.

What is represented: what elements exist, where they exist, and how they are supported.
QC logic: image-instance-event-core bridges must close cleanly.
Conservative precision policy: exact instance-anchored evidence is preferred; ambiguous observations may remain at image level instead of being over-localized.
Reason: emotion and aesthetics depend on structured presence, not just labels.

Availability-aware fusion principle

\[ J(x) = \sum_k m_k(x)\cdot w_k \cdot e_k(x) \]

where \(m_k(x)\) is an availability mask and \(w_k\) encodes confidence hierarchy. Practical reading: instance evidence is strongest, image-level evidence is next, and ontology prior is always available but weakest.

References: Vanessa (MABSA), FashionViL, FashionBERT

Stage 2. Exact-linked Experimental Subset

Before running full-scale experiments, we use a clean, strictly linked subset rather than the entire backbone all at once.

Subset policy: only exact matches between ontology, events, instances, and images. No weak or ambiguous joins.
Experimental role: this is a clean sandbox, not merely a smaller sample.
Reason: we want falsifiable experiments in a controlled environment so we can distinguish better signals, better modeling, and noisy data.
Defense logic: the subset is used because it is contract-locked and role-preserving, not because the full dataset is unavailable.

Subset framing

Exact-linked 30k subset for development and controlled reporting.

Illustrative counts

14,049 images · 30,000 instances · 678,548 events · 530,642 semantic events.

Reporting rule

Separate full-data statistics, subset statistics, and signal calculability instead of collapsing them into one headline number.

References: AesRec, FashionBERT, FashionViL

Stage 3. Signal Layer

At this stage, structured evidence is transformed into interpretable signals. These are treated as formative inputs, not hidden latent factors.

softness sharpness decorative_richness minimalism_index visual_flow visual_balance color_warmth color_calmness contrast_energy structure_rigidity playfulness_signal dramatic_intensity novelty_cue

Interpretability: each signal has explicit meaning.
Controllability: signals can be inspected, adjusted, or ablated.
Testability: redundancy and interactions can be studied directly.
Validation requirement: monitor correlation, VIF, and condition index so formative inputs do not become estimation-instability traps.

Signal materialization idea

\[ s_i(x) = f_i\bigl(\text{ontology-grounded evidence},\ \text{instance support},\ \text{image support}\bigr) \]

Aesthetic input construction

\[ S(x) = [s_1(x), s_2(x), \ldots, s_{13}(x)] \]

References: AESTHEMOS, Formative Measurement, Vanessa (MABSA)

Stage 4. Aesthetic Layer

Signals are mapped into a continuous aesthetic representation rather than a single beauty scalar. The system predicts a vector of aesthetic dimensions such as elegance, minimalism, boldness, and harmony.

Core claim: aesthetic response is irreducibly multi-dimensional.
Modeling choice: prototype-based representation plus shallow neural mapping.
Why this hybrid: prototypes supply interpretable reference styles and sparse-tag priors; the neural branch supplies flexible correction.
Judgment protocol: pairwise comparisons are emphasized because relative judgments are often more reliable than absolute ratings.

Prototype branch

\[ A_{\text{proto}}(x) = \mathrm{similarity}\bigl(S(x),\ \text{prototype centroids}\bigr) \]

Neural branch

\[ A_{\text{mlp}}(x) = \mathrm{MLP}\bigl(S(x)\bigr) \]

Final aesthetic score

\[ A(x) = \alpha \cdot A_{\text{proto}}(x) + (1-\alpha)\cdot A_{\text{mlp}}(x) \]

Training objective

\[ L = \lambda_{\text{rank}}\cdot L_{\text{pairwise}} + \lambda_{\text{abs}}\cdot L_{\text{absolute}} \]

References: AESTHEMOS, BWS Reliability, Aesthetic Emotion Lexicon, AesRec

Stage 5. Affect Layer (VAD space)

The aesthetic vector is then mapped into a continuous affective state. The recommended headline representation is VAD: valence, arousal, and dominance. Optional extensions such as an additional domain-specific axis or Hourglass-style auxiliary targets can remain secondary.

Why VAD first: continuous affect space is more stable and generalizable than direct lexical labels.
Reviewer-safe version: for conference framing, VAD or VAD plus D is the most robust headline; optional extra axes can be clearly marked as auxiliary.
Appraisal note: a separate appraisal layer is intentionally not foregrounded because it overlaps with VAD while offering weaker supervision.

Affect mapping

\[ z_{\text{vad}}(x) = g\bigl(A(x)\bigr) \]

Typical output space

\[ z_{\text{vad}}(x) = \bigl[ \mathrm{valence}(x),\ \mathrm{arousal}(x),\ \mathrm{dominance}(x) \bigr] \]

Optional extension

\[ z_{\text{affect}}(x) = [V, A, D, N?] \]

with \(N\) treated as a domain auxiliary, not a universal headline dimension.

References: Circumplex Model of Affect, Warriner VAD Norms, NRC VAD Lexicon, Hourglass of Emotions

Stage 6. Lexical Projection (Explanation Layer)

Finally, affect states may be projected into words. Crucially, words are not predicted directly from images. They are projected from structured affect and aesthetic representations.

Governance rule: use whitelist-based lexical projection rather than unconstrained direct lexical classification.
Stability policy: terms can be governed as stable, extended, or disputed rather than treated as equally reliable labels.
Interpretation benefit: if lexical output conflicts with VAD, the issue can be isolated as a projection-layer problem instead of corrupting the entire inference chain.

Lexical projection idea

\[ \mathrm{score}(\mathrm{word}\mid x) = \mathrm{compatibility}\bigl(\text{word centroid in VAD / affect space},\ z_{\text{vad}}(x)\bigr) + \mathrm{compatibility}\bigl(\text{word aesthetic profile},\ A(x)\bigr) \]

Selection policy

\[ \mathrm{lexical}(x) = \arg\max_{\mathrm{word}\in\mathrm{governed\ whitelist}} \mathrm{score}(\mathrm{word}\mid x) \]

rather than direct image \(\to\) word prediction.

References: Aesthetic Emotion Lexicon, NRC VAD Lexicon, Warriner VAD Norms, Distributional Emotion Embeddings

Section 5

Proposal: How we can integrate LLM reasoning

LLMs can be incorporated naturally, but only as a supporting layer. They should not replace the ontology and evidence substrate. FEL’s primary asset is that meaning is already structured before language enters the loop.

LLM role boundary

LLMs should not become the semantic constructor. They should operate as reasoning-data generators or explanation engines on top of already grounded evidence.

Why not image-first CoT

An image-score reasoning corpus alone would underuse FEL’s stronger inputs: ontology terms, evidence events, instance anchors, exact image linkage, and signal summaries.

Safety rationale

When predictions are fixed before the LLM explains them, hallucinations cannot directly alter the predictive core.

FEL-CoT

A better prompt substrate is ontology-grounded evidence summary + signal summary + image context, not raw image alone.

Proposal A. Reasoning generation between Signal and Aesthetic

Signal Layer → LLM Reasoning (FEL-CoT) → Aesthetic Layer

At this point we already have structured evidence, semantic signals, and grounded image context. That makes the LLM input richer and better governed than raw image prompting.

Generate explanations of aesthetic judgments.
Generate candidate rationales before BWS annotation.
Create exemplar descriptions for aesthetic tags.
Produce textual critiques for centroid candidates.
Build structured reasoning datasets without letting the LLM redefine the semantic backbone.

Proposal B. Explanation layer after Affect

Affect → Lexical → LLM Explanation

This is the safest insertion point. The core prediction has already been made, so the LLM only rewrites the trace into human-readable explanation.

Generate reviewer-facing rationale.
Generate critique-aware explanation.
Generate case-study narratives from the full trace chain.
Keep prediction and explanation separable for higher falsifiability.

FEL + LLM sidecar pipeline

LLM as support, not replacement

Section 6

Recommended execution order

The safest execution order is to preserve the current scientific spine and add LLM functionality only where it improves reasoning artifacts without destabilizing the governed backbone.

Recommended sequence

Keep the current FEL base pipeline unchanged: Ontology → Evidence → Subset → Signal → Aesthetic → VAD → Lexical.
Add Stage 3.5 FEL-CoT reasoning corpus generation between Signal and Aesthetic.
Add post-Lexical LLM explanation generation for reviewer and report-facing output.
Only after substantial empirical validation, consider a later-stage reasoning-policy experiment between Aesthetic and Affect.

Section 7

Final intuition

If I summarize the whole idea in one sentence:

We are not building a model that guesses emotion from images. We are building a system that derives emotion from structured, grounded semantic evidence, and optionally explains that reasoning with language.

And for LLM usage:

LLMs are not replacing the system. They are helping us articulate and leverage the reasoning already present in the structured pipeline.

References

Reference list

Short name	Full citation
FashionBERT	Gao D. et al. (2020). FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval. SIGIR 2020. DOI:10.1145/3397271.3401430
FashionViL	Han X. et al. (2022). FashionViL: Fashion-Focused Vision-and-Language Representation Learning. ECCV 2022. arXiv:2207.08150
Distributional Emotion Embeddings	Liapis C.M. et al. (2025). Enhancing sentiment analysis with distributional emotion embeddings. Neurocomputing 634: 129822. DOI:10.1016/j.neucom.2025.129822
AESTHEMOS	Schindler I. et al. (2017). Measuring aesthetic emotions: A review of the literature and a new assessment tool. PLOS ONE 12(6): e0178899. DOI:10.1371/journal.pone.0178899
Aesthetic Emotion Lexicon	Beermann U. et al. (2021). Dimensions and Clusters of Aesthetic Emotions: A Semantic Profile Analysis. Frontiers in Psychology 12: 667173.
Circumplex Model of Affect	Russell J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology 39(6): 1161–1178. DOI:10.1037/h0077714
Warriner VAD Norms	Warriner A.B., Kuperman V., Brysbaert M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods 45(4): 1191–1207. DOI:10.3758/s13428-012-0314-x
NRC VAD Lexicon	Mohammad S. (2018). Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words. ACL 2018, pp. 174–184.
BWS Reliability	Kiritchenko S., Mohammad S. (2017). Best-Worst Scaling More Reliable than Rating Scales. ACL 2017, pp. 465–470.
Hourglass of Emotions	Cambria E., Livingstone A., Hussain A. (2012). The Hourglass of Emotions. Cognitive Behavioural Systems, LNCS 7403, pp. 144–157.
Formative Measurement	Jarvis C.B., MacKenzie S.B., Podsakoff P.M. (2003). A Critical Review of Construct Indicators and Measurement Model Misspecification. Journal of Consumer Research 30(2): 199–218. DOI:10.1086/376806
Uncertainty Weighting	Kendall A., Gal Y., Cipolla R. (2018). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. CVPR 2018, pp. 7482–7491.
Vanessa (MABSA)	Xiao L., Mao R., Zhang X., He L., Cambria E. (2024). Vanessa: Visual Connotation and Aesthetic Attributes Understanding Network for MABSA. Findings of EMNLP 2024, pp. 11486–11500.
AesRec	AesRec: A Dataset for Aesthetics-Aligned Clothing Outfit Recommendation. arXiv:2602.03416 (2025/2026).