Improving Fashion Item Encoding and Retrieval

Research Project by Dr. Christian Bracher

Unlike a brick-and-mortar department store, online retailers like Zalando must use digital representations of their offerings to entice customers. Traditionally, this takes the form of information stored in a catalog database, which collects categorical properties (‘tags’), numerical descriptors (e.g., prices), and images of the items. Although this information is curated by ‘experts’ in-house, assignments are often subjective (what makes a dress ‘dark red,’ or ‘leisure’?), and article imagery is highly stylized. It is also by no means complete. More importantly, there is no straightforward relation between these article attributes, and the properties individual customers actually care about:
Does it fit?
How does it feel wearing?
Does it suit my style?

Fashion DNA

To make the properties of a fashion item more accessible, we collect the disjoint information in the catalog and map it into an abstract mathematical space, the ‘fashion space.’ There, the item is represented by a vector, its ‘Fashion DNA.’ Unlike the catalog data, Fashion DNA enables a meaningful geometrical notion of similarity between two articles, simply encoded by their distance in fashion space.

tsne_map
tsne map

Such a linear embedding yields access to a powerful arsenal of inference methods in the fashion universe. Generally, Fashion DNA encodings are extracted as hidden activations in a feedforward deep neural network (DNN) that is trained on a relevant target, e.g., prediction of brands and silhouettes from catalog images, or prediction of customer sales from attributes and images. The resulting Fashion DNA is correspondingly optimized for different tasks. For instance, the latter setup naturally leads to an enhanced collaborative filtering recommender system for fashion.

In addition to the choice of target data, we may select which input data to feed, and we have a wide latitude of neural network architectures to transform the data. Hence, we can experiment with Fashion DNA models that are tailored to specific use cases. We now cite a few prospective applications:

Example Applications

Catalog Debugging and Stratifying

As already mentioned, tagging fashion items is an error-prone and subjective task. One can use a deep learning approach to flag mistakes and streamline the process of generating new entries. The idea is to train a neural network on the images for an item, with the goal of predicting the catalog tags (like brand, color, silhouette, etc.) that have been used to label it. Inclusion of a narrow ‘bottleneck’ into the network forces to system to learn general rules that present a ‘uniform standard’ of article curation.

In this setup, feeding item images into the trained network results in a vector of probabilities for each tag, providing estimates for target groups, brand membership, color, pattern, material, etc. For articles already in the catalog, anomaly detection can be used to correct wrong tags and enforce consistent labeling of ‘weak,’ subjective categories like colors and patterns. The scheme also suggests tags for new additions to the catalog, based on imagery of the item.

Information Synthesis and Inference

Fashion information often is embedded in an implicit context that is easy to disentangle for a human observer with daily life experience, but presents a difficult challenge to machine learning algorithms. For instance, images may depict a model wearing a variety of fashion articles in front of a structured background, or detailed views of an item (seams, tags, brand logos, etc.) that are of particular interest to customers. It doesn’t take much effort for a human to focus on a specific element of the model outfit (the segmentation problem), or to imagine the three-dimensional shape of an article from one or a few two-dimensional snapshots.

The ease at which humans go about these tasks suggests that we are able to retrieve and apply previously acquired ‘related’ knowledge. Conventional neural network architectures, having no dedicated memory, are ill-equipped for this purpose. Memory-augmented neural networks that incorporate elements of attention and knowledge transfer are the subject of active research. The simplest and best-known examples are recurrent neural networks with built-in memory cells (LSTM), but more sophisticated approaches with external memory promise enhanced performance.

shoe_sequence
shoe sequence

A sample research question of interest in the fashion realm is how much spatial information is required to recognize that two images depict the same object. How many different views of an article must the network ‘see’ during training to infer its shape? Does such information transfer between objects of the same class? Zalando’s extensive archive of 360° views of shoes is a great dataset to examine these questions.