Market research typically follows a pipeline from data to insights. Adopting the cooking metaphor, data constitutes the “raw ingredients” that go into the process, while insights can be compared to the final dish – an actionable output that integrates multiple levels of processing. On this continuum, different research providers exhibit different strengths. The following chart shows how a selection of providers is positioned across the data-insight continuum in Anacode’s dataset of public media articles for the timespan Jan 2017 – Apr 2020; the size of the bubbles reflects the relative mention frequency of the companies:
Figure 1: Market research providers between data and insights (source: Anacode)
In the following, we provide an interpretation of the chart and an explanation of the underlying methodology.
1. So what?
Let’s see how we can interpret the information in this chart:
- IDC appears in a rather extraordinary position, being strongly associated with both dimensions of research. Additionally, together with Gartner, it is one of the most frequently mentioned players.
- Traditional market research providers, such as Ipsos, Nielsen and Kantar, score fairly highly in terms of insights. However, they show strong variation on the data axis, with Nielsen having the strongest association.
- As expected, younger, data-driven research companies such as Statista, eMarketer and SuperData are positioned high on the data scale, but appear in the lower segment in terms of insight.
- “One-to-many” research companies such as Forrester and Gartner generate a lot of insight but are average on the data scale. This positioning is expected since the main asset of these companies is their pool of analyst talent.
2. Methodology: how the chart was created
2.1 Underlying dataset
The dataset consists ca. 1.4M of curated online articles from more than 50 English-language domains which were crawled for the timespan Januar 2017 – April 2020. The selected Web domains combine general content with content focussed on business, technology and science. The following chart shows the top-15 domains in our dataset:
Figure 2: Top-15 Web domains in the underlying dataset
2.2 Analysis method
The main NLP algorithm used for this analysis is the word embedding . We proceed as follows:
- We start by training 200-dimensional semantic word embeddings on the article texts; before training, we have normalized the data and preprocesed synonyms, such as IDC and International Data Corporation, to make sure that they are treated as one semantic entity.
- We are interested in the dimensions “data” and “insight”. For a better understanding of these dimensions, the following charts show the corresponding associations in our dataset:
Figure 3: Concepts and words closely associated with the dimensions “data” and “insight”
It can be seen that “data” is mostly associated with technical and mathematical concepts, whereas “insight” conveys connotations of understanding, knowledge and – particularly relevant amidst today’s abundance of data – actionability.
- Since it would be impossible and unreadable to represent all market research companies in one chart, we sample a reasonable number of representative companies. Specifically, we want to achieve a more or less balanced distribution across the quadrants with 4-5 companies per quadrant. You can see that the bottom-left quadrant is not as crowded – this is because our data contains only few companies that show weak associations with both data and insight. This observation is expected because the market research industry is highly competitive and requires a clear positioning.
- Finally, we plot the semantic vectors of the companies, taking the semantic vectors of “data” and “insight” as the two dimensions of our vector space.
 T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. Proceedings of NIPS 2013.