One-line intuition
What I’m proposing is not a standard image → emotion pipeline.
So the model is not guessing emotion from pixels. It is reasoning from structured, grounded evidence toward affect. More precisely, this experiment keeps the canonical ontology and evidence backbone fixed, and validates signal → aesthetic → affect on top of that governed substrate.
Traceability
Outputs remain traceable through image_uid → instance_uid → event_uid → core_term_uid → signal → A(x) → VAD → lexical.
Schema governance
The ontology and evidence backbone is treated as contract-locked, so experiments add inference on top of an already governed substrate instead of improvising reasoning on raw pixels.
Domain specificity
Fashion semantics are not generic image semantics. Silhouette, material, decoration, composition, and styling intent require domain-governed evidence.
Why this structure
Most existing systems start from the image and try to infer everything directly:
In our case, we’re doing something fundamentally different:
The reason is this:
- In fashion, meaning is not purely visual. It is structured through silhouette, material, composition, styling intent, and domain-specific attributes.
- If we go directly from pixels to emotion, we lose interpretability, governance, and control.
- If we define meaning first, then ground it into data, we can build a system that is both explainable and stable.
So in our design, the image is not the starting point. The image is the grounding layer of meaning.
Full pipeline (conceptual view)
The collaborator-facing experimental pipeline is best explained as a governed sequence that starts from canonical semantics and only later reaches images, signals, aesthetics, and affect.
Ontology defines meaning → Event instantiates meaning → Instance localizes meaning → Image grounds meaning → Inference operates on grounded evidence
Stage-by-stage explanation
Stage 0. Canonical Ontology Backbone
At the beginning, we define and lock the ontology-side single source of truth: a structured vocabulary of fashion semantics including garment types, core facets, material and silhouette attributes, decorations, parent terms, aliases, and ontology edges.
- Purpose: make sure the semantic units are closed before downstream inference begins.
- Why first: if a signal is supposed to be an evidence-derived feature anchored to canonical semantics, the meaning space must already be fixed.
- Design claim: FEL aims for canonical convergence ontology rather than ad hoc dataset-specific vocabularies.
- Cross-dataset value: different sources can converge into a shared representation through the same core_term backbone.
Stage 1. Evidence Construction
Here we connect the ontology to actual observational data: images, localized instances (bounding boxes / regions), and evidence events grounded in those observations.
- What is represented: what elements exist, where they exist, and how they are supported.
- QC logic: image-instance-event-core bridges must close cleanly.
- Conservative precision policy: exact instance-anchored evidence is preferred; ambiguous observations may remain at image level instead of being over-localized.
- Reason: emotion and aesthetics depend on structured presence, not just labels.
Availability-aware fusion principle
\[ J(x) = \sum_k m_k(x)\cdot w_k \cdot e_k(x) \]Stage 2. Exact-linked Experimental Subset
Before running full-scale experiments, we use a clean, strictly linked subset rather than the entire backbone all at once.
- Subset policy: only exact matches between ontology, events, instances, and images. No weak or ambiguous joins.
- Experimental role: this is a clean sandbox, not merely a smaller sample.
- Reason: we want falsifiable experiments in a controlled environment so we can distinguish better signals, better modeling, and noisy data.
- Defense logic: the subset is used because it is contract-locked and role-preserving, not because the full dataset is unavailable.
Subset framing
Exact-linked 30k subset for development and controlled reporting.
Illustrative counts
14,049 images · 30,000 instances · 678,548 events · 530,642 semantic events.
Reporting rule
Separate full-data statistics, subset statistics, and signal calculability instead of collapsing them into one headline number.
Stage 3. Signal Layer
At this stage, structured evidence is transformed into interpretable signals. These are treated as formative inputs, not hidden latent factors.
- Interpretability: each signal has explicit meaning.
- Controllability: signals can be inspected, adjusted, or ablated.
- Testability: redundancy and interactions can be studied directly.
- Validation requirement: monitor correlation, VIF, and condition index so formative inputs do not become estimation-instability traps.
Signal materialization idea
\[ s_i(x) = f_i\bigl(\text{ontology-grounded evidence},\ \text{instance support},\ \text{image support}\bigr) \]Aesthetic input construction
\[ S(x) = [s_1(x), s_2(x), \ldots, s_{13}(x)] \]Stage 4. Aesthetic Layer
Signals are mapped into a continuous aesthetic representation rather than a single beauty scalar. The system predicts a vector of aesthetic dimensions such as elegance, minimalism, boldness, and harmony.
- Core claim: aesthetic response is irreducibly multi-dimensional.
- Modeling choice: prototype-based representation plus shallow neural mapping.
- Why this hybrid: prototypes supply interpretable reference styles and sparse-tag priors; the neural branch supplies flexible correction.
- Judgment protocol: pairwise comparisons are emphasized because relative judgments are often more reliable than absolute ratings.
Prototype branch
\[ A_{\text{proto}}(x) = \mathrm{similarity}\bigl(S(x),\ \text{prototype centroids}\bigr) \]Neural branch
\[ A_{\text{mlp}}(x) = \mathrm{MLP}\bigl(S(x)\bigr) \]Final aesthetic score
\[ A(x) = \alpha \cdot A_{\text{proto}}(x) + (1-\alpha)\cdot A_{\text{mlp}}(x) \]Training objective
\[ L = \lambda_{\text{rank}}\cdot L_{\text{pairwise}} + \lambda_{\text{abs}}\cdot L_{\text{absolute}} \]Stage 5. Affect Layer (VAD space)
The aesthetic vector is then mapped into a continuous affective state. The recommended headline representation is VAD: valence, arousal, and dominance. Optional extensions such as an additional domain-specific axis or Hourglass-style auxiliary targets can remain secondary.
- Why VAD first: continuous affect space is more stable and generalizable than direct lexical labels.
- Reviewer-safe version: for conference framing, VAD or VAD plus D is the most robust headline; optional extra axes can be clearly marked as auxiliary.
- Appraisal note: a separate appraisal layer is intentionally not foregrounded because it overlaps with VAD while offering weaker supervision.
Affect mapping
\[ z_{\text{vad}}(x) = g\bigl(A(x)\bigr) \]Typical output space
\[ z_{\text{vad}}(x) = \bigl[ \mathrm{valence}(x),\ \mathrm{arousal}(x),\ \mathrm{dominance}(x) \bigr] \]Optional extension
\[ z_{\text{affect}}(x) = [V, A, D, N?] \]Stage 6. Lexical Projection (Explanation Layer)
Finally, affect states may be projected into words. Crucially, words are not predicted directly from images. They are projected from structured affect and aesthetic representations.
- Governance rule: use whitelist-based lexical projection rather than unconstrained direct lexical classification.
- Stability policy: terms can be governed as stable, extended, or disputed rather than treated as equally reliable labels.
- Interpretation benefit: if lexical output conflicts with VAD, the issue can be isolated as a projection-layer problem instead of corrupting the entire inference chain.
Lexical projection idea
\[ \mathrm{score}(\mathrm{word}\mid x) = \mathrm{compatibility}\bigl(\text{word centroid in VAD / affect space},\ z_{\text{vad}}(x)\bigr) + \mathrm{compatibility}\bigl(\text{word aesthetic profile},\ A(x)\bigr) \]Selection policy
\[ \mathrm{lexical}(x) = \arg\max_{\mathrm{word}\in\mathrm{governed\ whitelist}} \mathrm{score}(\mathrm{word}\mid x) \]Proposal: How we can integrate LLM reasoning
LLMs can be incorporated naturally, but only as a supporting layer. They should not replace the ontology and evidence substrate. FEL’s primary asset is that meaning is already structured before language enters the loop.
LLM role boundary
LLMs should not become the semantic constructor. They should operate as reasoning-data generators or explanation engines on top of already grounded evidence.
Why not image-first CoT
An image-score reasoning corpus alone would underuse FEL’s stronger inputs: ontology terms, evidence events, instance anchors, exact image linkage, and signal summaries.
Safety rationale
When predictions are fixed before the LLM explains them, hallucinations cannot directly alter the predictive core.
FEL-CoT
A better prompt substrate is ontology-grounded evidence summary + signal summary + image context, not raw image alone.
Proposal A. Reasoning generation between Signal and Aesthetic
At this point we already have structured evidence, semantic signals, and grounded image context. That makes the LLM input richer and better governed than raw image prompting.
- Generate explanations of aesthetic judgments.
- Generate candidate rationales before BWS annotation.
- Create exemplar descriptions for aesthetic tags.
- Produce textual critiques for centroid candidates.
- Build structured reasoning datasets without letting the LLM redefine the semantic backbone.
Proposal B. Explanation layer after Affect
This is the safest insertion point. The core prediction has already been made, so the LLM only rewrites the trace into human-readable explanation.
- Generate reviewer-facing rationale.
- Generate critique-aware explanation.
- Generate case-study narratives from the full trace chain.
- Keep prediction and explanation separable for higher falsifiability.
Recommended execution order
The safest execution order is to preserve the current scientific spine and add LLM functionality only where it improves reasoning artifacts without destabilizing the governed backbone.
Recommended sequence
- Keep the current FEL base pipeline unchanged: Ontology → Evidence → Subset → Signal → Aesthetic → VAD → Lexical.
- Add Stage 3.5 FEL-CoT reasoning corpus generation between Signal and Aesthetic.
- Add post-Lexical LLM explanation generation for reviewer and report-facing output.
- Only after substantial empirical validation, consider a later-stage reasoning-policy experiment between Aesthetic and Affect.
Final intuition
We are not building a model that guesses emotion from images. We are building a system that derives emotion from structured, grounded semantic evidence, and optionally explains that reasoning with language.
LLMs are not replacing the system. They are helping us articulate and leverage the reasoning already present in the structured pipeline.
Reference list
| Short name | Full citation |
|---|---|
| FashionBERT | Gao D. et al. (2020). FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval. SIGIR 2020. DOI:10.1145/3397271.3401430 |
| FashionViL | Han X. et al. (2022). FashionViL: Fashion-Focused Vision-and-Language Representation Learning. ECCV 2022. arXiv:2207.08150 |
| Distributional Emotion Embeddings | Liapis C.M. et al. (2025). Enhancing sentiment analysis with distributional emotion embeddings. Neurocomputing 634: 129822. DOI:10.1016/j.neucom.2025.129822 |
| AESTHEMOS | Schindler I. et al. (2017). Measuring aesthetic emotions: A review of the literature and a new assessment tool. PLOS ONE 12(6): e0178899. DOI:10.1371/journal.pone.0178899 |
| Aesthetic Emotion Lexicon | Beermann U. et al. (2021). Dimensions and Clusters of Aesthetic Emotions: A Semantic Profile Analysis. Frontiers in Psychology 12: 667173. |
| Circumplex Model of Affect | Russell J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology 39(6): 1161–1178. DOI:10.1037/h0077714 |
| Warriner VAD Norms | Warriner A.B., Kuperman V., Brysbaert M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods 45(4): 1191–1207. DOI:10.3758/s13428-012-0314-x |
| NRC VAD Lexicon | Mohammad S. (2018). Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words. ACL 2018, pp. 174–184. |
| BWS Reliability | Kiritchenko S., Mohammad S. (2017). Best-Worst Scaling More Reliable than Rating Scales. ACL 2017, pp. 465–470. |
| Hourglass of Emotions | Cambria E., Livingstone A., Hussain A. (2012). The Hourglass of Emotions. Cognitive Behavioural Systems, LNCS 7403, pp. 144–157. |
| Formative Measurement | Jarvis C.B., MacKenzie S.B., Podsakoff P.M. (2003). A Critical Review of Construct Indicators and Measurement Model Misspecification. Journal of Consumer Research 30(2): 199–218. DOI:10.1086/376806 |
| Uncertainty Weighting | Kendall A., Gal Y., Cipolla R. (2018). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. CVPR 2018, pp. 7482–7491. |
| Vanessa (MABSA) | Xiao L., Mao R., Zhang X., He L., Cambria E. (2024). Vanessa: Visual Connotation and Aesthetic Attributes Understanding Network for MABSA. Findings of EMNLP 2024, pp. 11486–11500. |
| AesRec | AesRec: A Dataset for Aesthetics-Aligned Clothing Outfit Recommendation. arXiv:2602.03416 (2025/2026). |