Download: Covid-19 Public Media Dataset

In response to the COVID-19 pandemic, Anacode has prepared the COVID-19 Public Media Dataset. The dataset is a resource of over 40,000 online articles with full texts which were scraped from online media in the timespan January – April 2020, focussed mainly on the non-medical aspects of COVID-19. This free dataset is provided to the global data science community to apply recent advances in NLP and data mining and analyze the overall impact, challenges and opportunities of the current crisis. We also hope that this dataset will encourage more work on information-related issues such as disinformation, rumours and fake news that shape the global response to the situation.

The dataset is formed by filtering the full data of the included Web domains by keywords related to COVID-19. It contains articles from four topic areas – general, business, finance and technology with the following distribution over time:

Please download the data from our Kaggle account. You can also check this interactive chart for a clustering of the articles.