< Back to previous page

Project

Latent Variable Models for Language and Image Understanding in Social Media and E-Commerce Data

More content has been created in the past few years than in the entire history of humankind. With the exponential growth of user-contributed content, it becomes increasingly important to develop systems capable to intelligently process both language and images.

While understanding language appears effortless for humans from a young age, for computers, this is quite a challenging task. Inherently, languages are ambiguous and rich. Many words can be used to denote the same concept, and conversely, the same word can represent multiple things. This fact is accentuated on the wild, and noisy Web, where users playfully and organically create new words and assign new meaning to existing terms. Consider for example the word ``happy''. It has has many synonyms according to a standard English thesaurus: cheerful, glad, joyful, merry, etc.  However, on the Web users choose a wider range of terms to denote the same concept:  cheerio, cherry-merry, cheryl, grooved, psyched, stoked, and the list is ever evolving. If we had to find all the documents that refer to one particular concept, it would not be sufficient to rely on a thesaurus.  Instead, we wish to develop algorithms that can automatically learn semantically related words from organic and noisy data without relying on previous knowledge or dictionaries.

In this context, we address the task of cross-idiomatic linking of Web sources. This task consists of connecting textual content from different domains, where similar concepts are discussed but the language usage differs greatly.  Specifically, we focus on linking social media posts from the popular site Pinterest.com to e-commerce products from Amazon.com. The task is framed in an information retrieval setting, where pins (here pins are short snippets of text that a Pinterest user has posted online about something they are interested in) from Pinterest are used as queries, and Amazon products comprise the target collection. We develop novel textual representations based on the family of latent Dirichlet allocation (LDA) models.  Our core insight is that we can learn representations that allow us to bridge the query and target language by leveraging pairs of aligned documents. These are documents that discuss the same topic using different words. Our proposed multi-idiomatic latent Dirichlet allocation (MiLDA) model explicitly takes into account the shared topic distribution between sources, while modeling both the differences and similarities in the language. The first set of contributions of this work are as follows: 1) we constructed a new benchmark dataset composed of pins from Pinterest, Amazon product descriptions and corresponding users reviews. This dataset is accompanied by relevance annotations of randomly chosen pins with respect to the Amazon products. 2) We proposed, performed and assessed the novel task of cross-idiomatic linking, as described above. 3) We developed representations for cross-idiomatic modeling of noisy textual sources, as found on the Web. 4) We performed a systematic empirical comparison to evaluate the performance of different latent variable models for connecting cross-idiomatic sources.

In addition to language, understanding images is also challenging. While humans can easily ``translate'' visual concepts into words and vice-versa, machines are not quite skilled at this. The challenge is that the raw representations of images and text (as normally stored in a computer) do not reveal their actual meaning; they are just large arrays of numbers.  

We develop representations that allow us to semantically connect images and language. In this context we address the task of cross-modal search, i.e., given a query image, we aim to retrieve words that describe the visual content (image annotation), and given a set of textual descriptors, we aim to find images that display such attributes (image search).  Specifically, we perform this task within the fashion domain.  

To achieve this, we exploit the alignment between images and their surrounding text in natural language, as found on the Web.  Specifically, we investigate different image representations such as scale-invariant feature transform (SIFT) and convolutional neural networks (CNN); different textual representations such as bag of words (bow) and semantic word embeddings; and different latent variable alignment models, such as neural networks (NN), canonical correlation analysis (CCA) and bilingual latent Dirichlet allocation (BiLDA).

These yield to the second set of contributions of this work: 1) we constructed a new benchmark dataset composed of pairs of images and noisy textual descriptions in the fashion domain, as found on the Web. 2) We proposed, performed and assessed the novel task of fashion cross-modal search. 3) We developed representations that bridge the gap between noisy and heterogeneous multimodal content. 4) We performed a systematic empirical comparison to evaluate the performance of different latent variable models for connecting cross-modal sources in fashion. 

Date:1 Nov 2012 →  22 Dec 2016
Keywords:social media, e-commerce, topic models
Disciplines:Applied mathematics in specific fields, Computer architecture and networks, Distributed computing, Information sciences, Information systems, Programming languages, Scientific computing, Theoretical computer science, Visual computing, Other information and computing sciences
Project type:PhD project