CORD-19 an Open Research Dataset Made to Tackle the Coronavirus
Artificial Intelligence, Kaggle and a Global Research Community Coming Together in a Time of Crisis led by the Allen Institute for AI
When considering the Coronavirus it would not be surprising if the technology industry in the United States somehow came together in a shape or form to address the crisis. CORD-19 seems part of this effort being titled as a free and open resource for the Global Research Community.
“CORD-19: the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. The corpus is intended to be updated weekly as new research is published in peer-reviewed publications and archival services like bioRxiv, medRxiv, and others. The initiative, building on AI2’s Semantic Scholar project, uses natural language processing to analyze scientific papers about coronavirus, including the novel coronavirus that causes COVID-19.” — paraphrased from Semantic Scholar’s project page on CORD-19.
This is an interesting effort that seems almost unprecedented so far in terms of addressing the Coronavirus through a combined effort aimed towards AI+Data+Health, although being proven wrong on this point would bring much happiness rather than disappointment.
In addition to the Allen Institute for AI the partners are: Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine — National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.
Kaggle’s CORD-19 Challenge
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. They are well known for their competitions that can bring great renown or rewards to those who participate. They have released this dataset into a challenge they call: COVID-19 Open Research Dataset Challenge.
It is designed as a series of important questions to inspire the community to use CORD-19 to find new insights about the COVID-19 pandemic including the natural history, transmission, and diagnostics for the virus, management measures at the human-animal interface, lessons from previous epidemiological studies, and more.
In doing so they are:
“…issuing a call to action to the world’s artificial intelligence experts to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions.”
The important questions they have framed are as follows ranked according to their upvotes on Kaggle on the 18th of March 2020:
- What is known about transmission, incubation, and environmental stability? (150)
- What do we know about COVID-19 risk factors? (67)
- Sample task with sample submission (geography vs. virality) (34)
- What do we know about virus genetics, origin, and evolution? (31)
- What do we know about vaccines and therapeutics? (28)
- What has been published about ethical and social science considerations? (24)
- What do we know about non-pharmaceutical interventions? (24)
- What do we know about diagnostics and surveillance? (23)
- What has been published about medical care? (22)
- What has been published about information sharing and inter-sectoral collaboration? (22)
According to the Kaggle challenge these key scientific questions are drawn from the NASEM’s SCIED (National Academies of Sciences, Engineering, and Medicine’s Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats) research topics and the World Health Organization’s R&D Blueprint for COVID-19.
“Kaggle is sponsoring a $1,000 per task award to the winner whose submission is identified as best meeting the evaluation criteria. The winner may elect to receive this award as a charitable donation to COVID-19 relief/research efforts or as a monetary payment. More details on the prizes and timeline can be found on the discussion post.”
At the current time (18th of March 2020) the attention has been as follows:
- 184,803 views
- 5,252 downloads
- 52 kernels
- 70 topics
Although you can download from Kaggle (and it is recommended) there is a subset of papers from Semantic Scholar depending on usage.
Download here:
- Commercial use subset (includes PMC content) — 9000 papers, 186Mb
- Non-commercial use subset (includes PMC content) — 1973 papers, 36Mb
- PMC custom license subset — 1426 papers, 19Mb
- bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) — 803 papers, 13Mb
Each paper is represented as a single JSON object. The schema is available here.
They also provide a comprehensive metadata file of 29,000 coronavirus and COVID-19 research articles with links to PubMed, Microsoft Academic and the WHO COVID-19 database of publications (includes articles without open access full text):
- Metadata file (readme) — 47Mb
There is a clear encouragement to make the research open for this project to benefit the public good and they are looking for publishers to contribute to the CORD-19 corpus.
The page on Semantic Scholar also lists resources from the Allen Institute for AI:
- SciSpacy, a text processing toolkit optimized for scientific text
- SciBERT, a BERT model pretrained on scientific text
- Semantic Scholar API and Open Research Corpus
- Create an AI-powered customizable adaptive feed of COVID-19 research from arXiv
- View the latest search results for COVID-19 on Semantic Scholar
And additional Resources:
- COVID-19 Research Database (provided by the WHO)
- LitCOVID (provided by the NIH)
- COVID-19 Resource Page (provided by Microsoft Academic)
- COVID-19 Research Export File (provided by Dimensions)
- Day-Level COVID-19 Dataset (hosted on Kaggle)
- COVID-19 Global Cases (provided by Johns Hopkins University)
- Blog Post: Computer Scientists Are Building Algorithms to Tackle COVID-19
“AI and high tech in general have gotten something of a bad rap recently, but this crisis shows how AI can potentially do a world of good,” said Oren Etzioni, CEO of Seattle’s Allen Institute for Artificial Intelligence (AI2) and a University of Washington computer science professor.
-GeekWire, the 17th of March 2020
The White House announced the initiative along with a coalition that includes
- AI2,
- The Chan Zuckerberg Initiative,
- Georgetown University’s Center for Security and Emerging Technology,
- Microsoft Research,
- The National Library of Medicine,
- Kaggle, the machine learning and data science community owned by Google.
For more information you can check out the Health Tech Podcast, GeekWire’s Alan Boyle, covered the story, and explained the significance of the announcement, and what it could mean in the fight against COVID-19 and future outbreaks.
This is #500daysofAI and you are reading article 287. I am writing one new article about or related to artificial intelligence every day for 500 days. My current focus for 100 days 200–300 is national and international strategies for artificial intelligence. I have decided to spend the last 25 days of my AI strategy writing to focus on the climate crisis.