Authors: Janna Lipenkova, Stephan Telschow (G-I-M)

Market research is inherently biased: each method comes with its own limitations on authenticity and representativity which are rooted in our psychology. For example, survey research can suffer from acquiescence bias, in which case the respondent automatically says “yes” to any question that comes his way. As another example, research based on product reviews is often polarized: users are more motivated to post extremely negative or extremely positive opinions. The “middle ground” of the average, neutral opinion is not worthy to mention and stays underrepresented.

One way to alleviate bias is to combine multiple data sources and methods to address the same research question. In the following case study, we apply a combination of two methods – a classical survey and a social media analysis using Natural Language Processing –  to analyze eating behavior in China. Our goal is to understand the complementarity of the two methods and to show how they can be combined to generate synergies.

The format of the classical closed-question survey is well understood in market research. Why do we opt to complement it with insights from social media? Beyond constituting a dataset with radically different characteristics, social media is omnipresent and thus highly relevant in the life of the modern consumer; this is especially true in China, where consumers heavily rely on social media when they make purchasing decisions. As described in this article, a brand can significantly improve on customer centricity by following and responding to the social conversation.

Characterization of the two datasets

Overview over the two datasets

The two methods rely on datasets with different underlying properties. The survey uses a demographically representative sample of 2000 respondents. The leading theme is “Please describe your last meal”, with more detailed questions covering aspects such as type of food, location, motivation etc. It can be assumed that the described situations converge to typical set of meal situationsthat are reflecting daily-life routines. All questions are closed, thus the data is structured. Since it is solicited explicitly, the respondents are motivated by external incentives. The questions are determined directly by the research goals and thus cover the information that is relevant to the researcher.

The social study uses a sample of 5M unstructured posts from Weibo related to food and eating behavior. There is no way to control demographics on social media, thus the sample is demographically not representative and, furthermore, biased towards the dominating user group of the platform. Users don’t have external incentives to post on social media and thus are intrinsically motivated. This augments the authenticity and customer-centricity of the data: only topics that are actually salient are discussed. Contrarily to survey answers which refer to the “average” meal situation, social media covers those cognitively prominent situations that are worthy to recall, describe and share publicly. Thus, this dataset contains a larger share of exceptional, non-routine situations.

Figure 1: Cognitive levels addressed by the two datasets


The size and structure of the two datasets call for different methods of analysis. The survey is structured and controlled in terms of content and can be evaluated using standard analytical methods. By contrast, the social data is noisy, unstructured and, furthermore, very large – thus, it requires an additional effort of cleaning, filtering and structuring. We apply two Natural Language Processing algorithms – concept extraction and sentiment analysis – to structure the dataset. The algorithms are built on the basis of Anacode’s ontology, which classifies all relevant concepts such as ingredients, brands and common food locations; it also contains psychological universals incl. emotions and motivations. Figure 2 summarized the setup of the two methods.

Results and insights

  • Granularity of parameter values

Inside of the individual parameters, social data allows for a much larger variety. In the survey data, the number of possible responses has to be limited to control survey length and avoid fatigue bias. For example, “Cooking Oil” is one generic ingredient without further variations. By contrast, in social media, the number of possible parameter values is virtually infinite, covering any aspect that is deemed relevant by the users. For example, the ontology we use contains 25 variations of cooking oil, such as peanut oil, coconut oil, soybean oil etc.

  • Completeness of the dataset

The survey allows to fill all parameters of the considered situations, whereas social data is inherently incomplete. The following table shows some posts and their analyses:

It can be seen that many variables remain unknown in the social dataset. As a consequence of this sparsity, it is difficult to represent the whole complexity of eating situations. By contrast, survey respondents are required to fill in all parameters, thus producing a matrix without empty cells.

  • Analysis results

When comparing the results of the two methods, the individual parameters show different distributions. For example, the following chart illustrates the social setting of eating situations:

The distributions are clearly different: for instance, meals with colleagues are frequently mentioned in the survey, whereas the social posts favor friends. We can speculate about the possible causes – for instance, the daily lunch situation with colleagues has a higher probability of being “caught” in the survey sample than the occasional, non-routine get-together with friends. The two types of situation also differ in their salience and emotional engagement.

Remarkably, distributions become very similar once contextual filters are set as an additional “control” on the social dataset. Once restricted by the location of the meal – i.e., whether it takes place at home or outside – the relative order of the social settings shows a striking difference:

How can the two methods be combined efficiently?

Our findings can be summarized as follows:

  • The two methods operate on different cognitive levels: survey data allows to measure awareness, whereas social data addresses salience, relevance and judgment.
  • Appropriate context filters on the social dataset lead to comparable results.
  • Surveys can be used to cover complex, differentiated questions with relatively few possible parameter values; the complexity of questions that can be covered by social data is limited.
  • The “answers” obtained from social data for individual questions and parameters are much more differentiated and exhaustively cover all aspects that actually appear relevant to the user.

How can the two methods be combined beyond a simple comparison? One natural synergy would result from applying the social media analysis in a first step to prepare the survey design. Social data can be mined exploratively to trace the questions, parameters and values which are relevant to consumers. With the relevant topics covered, the survey can be used to get structured, clean and complete answers to even complex questions. This approach increases customer centricity and allows to reduce subjectivity biases on the part of the researcher.  A relevance-driven survey design potentially also reinforces the intrinsic engagement of the respondents and motivates them to provide maximally authentic and truthful answers to the survey questions.


About: This study is a joint project with G-I-M. You can download the full presentation here.