Covid-19 and why public datasets are so important

Brødtekst

Public data is meant to be accessible to everyone. However, in many cases, the data that should be available are merely summarized, or available in formats that are difficult to interpret by a machine. Public data is vital for obtaining a deeper understanding and uncovering its underlying secrets, which leads to new machine learning models that can be used to make valuable predictions about the future or try to save human lives by working with COVID-19 research articles.

Public data needs to be assembled by someone, and in many cases, the organizations that you think would be responsible for assembling this data are not really doing so.

You would think that the official organizations responsible for global outbreaks of e.g. virus, should have been the ones releasing detailed data on COVID-19. But from what we have seen lately, public datasets were not made available in the beginning of the outbreak and did not contain the necessary details.

An example here is that many countries release data on the number of people in quarantine, however, this is not reflected in any global public datasets so far.

Similarly, the type of diagnosis that was done for patients, types of testing that each country is currently doing, as well as the initial symptoms are not available in the form of data.

The power of Data

Data is one of our most important allies when it comes to fighting the outbreak of COVID-19.

John Hopkins Institute was the first to release a dataset that contained a global view of how the COVID-19 virus is spreading. The dashboards and dataset released on github was made available on January 22.

verdenskort.covid-19
The interactive web-based dashboard (static snapshot shown above) hosted by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, to visualize and track reported cases in real-time.
Illustration: Rasmus Hauch

The “coronavirus tracking map” was developed to monitor confirmed cases of coronavirus worldwide, to track and record cases in real-time. The dashboard currently receives an average of 1,2 billion interactions a day. The purpose of making this data available was to provide researchers, public health authorities, and the general public with a user-friendly tool to track the outbreak as it unfolds.

South Korea can be seen as a frontrunner in releasing detailed data on its COVID-19 situation, taking its point of departure in transparency and technology. The country had 7,869 confirmed cases of COVID-19 as of midday March 12 — the fourth highest number in the world outside China, Italy, and Iran. However, their handling of the crisis has been widely lauded as a benchmark in terms of both effective response and its open and democratic approach to using cutting edge technology.

The large number of cases in South Korea can be attributed to the country’s widespread testing including more than 200,000 people, and with a capacity to test up to 20,000 per day. This was made possible by deploying multiple technologies like diagnostic apps, innovative testing kits, and telecommuting solutions.

By testing multiple cases, and not only symptomatic people, South Korea has detected more asymptomatic and positive cases of coronavirus than Italy, particularly among young people. This data is valuable because it shows that younger people, who may not be tested for COVID-19 because they are asymptomatic, might be the ones that are spreading the virus. But by applying this public health measure, asymptomatic people with the virus can isolate even if they don’t feel sick, and prevent spreading the virus.

søjlediagram.covid-19
The graph shows younger people in South Korea, who are tested for the disease regardless of showing symptoms, are perhaps more likely to be asymptomatic.
Illustration: Medium/Andreas Backhaus

During this recent outbreak, there have been a lot of public new tech offerings.

In China, they have been using big data to stop the spread of the virus. The leading provider of internet, Tencent, added value services and has been rolling out a QR code system on the social network WeChat to track potential COVID-19 carriers on public transportation. Passengers entering a bus, subway or taxi can submit their information through Tencent’s “ride registration code” and the system will synchronize their real names with the vehicle’s license plate, boarding time and other information. When a passenger is discovered to have been infected, other passengers who might have been exposed will get a message warning them.

Open research data sets

In response to the pandemic is open research data sets. Offering freely available data sets to the global research community is a way to generate new insights in support of the ongoing fight against COVID-19. These data sets are open for the community to apply recent advancements like natural language processing and other AI techniques.

The White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19) — a resource of over 29,000 scholarly articles available on Kaggle for the world’s AI experts for developing text and data mining tools that can help the medical community develop answers to high priority scientific questions.

What about data privacy?

Looking at these new and innovative tech offerings and public datasets that are being rolled out to fight the outbreak of COVID-19, there is also a concern in terms of data privacy related to people and their behavior. This is why Denmark, for instance, has just recently started releasing information about its patients.

The concern for peoples’ privacy is understandable when releasing these data sets and, it is therefore necessary that a combination of k-anonymity principles and properly audited anonymization of datasets is carried out. This is done to scientifically guarantee that person-specific data cannot re-identify the individual who is the subject of the data, while the data remain useful by data scientists.

For the common fight against covid-19

Because of the rapid acceleration we are experiencing with COVID-19, which makes it difficult to keep up, there is a huge demand for information that can be trusted.

At the same time, we see a growing urgency for approaches such as new tech offerings as well as open research data sets for the research communities.

Public available data becomes even more important as it can be used for a variety of things that are crucial during an outbreak like this. But in the fight against COVID-19 open research data, that is available for global communities, data scientists and AI experts, give us a foundation for working toward a common goal — discovering new insights to fight the virus.

In many instances, these people will be working on their own, developing their projects and models that can be of great importance.

With the number of people working on this growing exponentially, adding the data sets to a platform can provide more significant advantages. For these communities and brightest minds within AI, a platform will allow them an opportunity to work together, collaborate, share projects, and models.

Let's build a better future together.

Dette synspunkt er tidligere blevet bragt på medium.com.