covid19.analytics Package: get access to and analyze the live worldwide data from the novel CoViD19 from JHU repository

Marcelo Ponce

2023-10-15

COVID19.Analytics

DOI –>

Table of Contents

  1. Introduction
  2. covid19.analytics Main Features
    1. Data Accessibility
      1. Data Structure
      2. Data Intregrity and Checks
      3. Genomics Data
    2. Analytical & Graphical Indicators
  3. Installation
  4. Examples
  5. About
    1. Media & Press
    2. References and Citation
      1. Citing covid19.analytics

Introduction

The “covid19.analytics” R package allows users to obtain live* worldwide data from the novel Coronavirus Disease originally reported in 2019, COVID-19.

One of the main goals of this package is to make the latest data about the COVID-19 pandemic promptly available to researchers and the scientific community.

The “covid19.analytics” package also provides basic analysis tools and functions to investigate these datasets.

The following sections briefly describe some of the covid19.analytics package main features, we strongly recomend users to read our paper “covid19.analytics: An R Package to Obtain, Analyze and Visualize Data from the Coronavirus Disease Pandemic” (https://arxiv.org/abs/2009.01091) where further details about the package are presented and discussed.

covid19.analytics Main Features

The covid19.analytics package is an open source tool, which its main implementation and API is the R package. In addition to this, the package has a few more adds-on:

Data Sources

The “covid19.analytics” package provides access to the following open-access data sources:

Click to Expand/Collapse

Data Accessibility

Click to Expand/Collapse

The covid19.data() function allows users to obtain realtime data about the COVID-19 reported cases from the JHU’s CCSE repository, in the following modalities: * “aggregated” data for the latest day, with a great ‘granularity’ of geographical regions (ie. cities, provinces, states, countries) * “time series” data for larger accumulated geographical regions (provinces/countries)

  • “deprecated”: we also include the original data style in which these datasets were reported initially.

The datasets also include information about the different categories (status) “confirmed”/“deaths”/“recovered” of the cases reported daily per country/region/city.

This data-acquisition function, will first attempt to retrieve the data directly from the JHU repository with the latest updates. If for what ever reason this fails (eg. problems with the connection) the package will load a preserved “image” of the data which is not the latest one but it will still allow the user to explore this older dataset. In this way, the package offers a more robust and resilient approach to the quite dynamical situation with respect to data availability and integrity.

Data retrieval options

argument description
aggregated latest number of cases aggregated by country
Time Series data
ts-confirmed time series data of confirmed cases
ts-deaths time series data of fatal cases
ts-recovered time series data of recovered cases
ts-ALL all time series data combined
Deprecated data formats
ts-dep-confirmed time series data of confirmed cases as originally reported (deprecated)
ts-dep-deaths time series data of deaths as originally reported (deprecated)
ts-dep-recovered time series data of recovered cases as originally reported (deprecated)
Combined
ALL all of the above
Time Series data for specific locations
ts-Toronto time series data of confirmed cases for the city of Toronto, ON - Canada
ts-confirmed-US time series data of confirmed cases for the US detailed per state
ts-deaths-US time series data of fatal cases for the US detailed per state

Data Structure

The TimeSeries data is organized in an specific manner with a given set of fields or columns, which resembles the following structure:

“Province.State” “Country.Region” “Lat” “Long” seq of dates

Using your own data and/or importing new data sets

If you have data structured in a data.frame organized as described above, then most of the functions provided by the “covid19.analytics” package for analyzing TimeSeries data will work with your data. In this way it is possible to add new data sets to the ones that can be loaded using the repositories predefined in this package and extend the analysis capabilities to these new datasets.

Be sure also to check the compatibility of these datasets using the Data Integrity and Consistency Checks functions described in the following section.

Data Integrity and Consistency Checks

Due to the ongoing and rapid changing situation with the COVID-19 pandemic, sometimes the reported data has been detected to change its internal format or even show some “anomalies” or “inconsistencies” (see https://github.com/CSSEGISandData/COVID-19/issues/).

For instance, in some cumulative quantities reported in time series datasets, it has been observed that these quantities instead of continuously increase sometimes they decrease their values which is something that should not happen, (see for instance, https://github.com/CSSEGISandData/COVID-19/issues/2165). We refer to this as inconsistency of “type II”.

Some negative values have been reported as well in the data, which also is not possible or valid; we call this inconsistency of “type I”.

When this occurs, it happens at the level of the origin of the dataset, in our case, the one obtained from the JHU/CCESGIS repository [1]. In order to make the user aware of this, we implemented two consistency and integrity checking functions:

  • consistency.check(), this function attempts to determine whether there are consistency issues within the data, such as, negative reported value (inconsistency of “type I”) or anomalies in the cumulative quantities of the data (inconsistency of “type II”)

  • integrity.check(), this determines whether there are integrity issues within the datasets or changes to the structure of the data

Alternatively we provide a data.checks() function that will run both functions on an specified dataset.

Data Integrity

It is highly unlikely that you would face a situation where the internal structure of the data, or its actual integrity may be compromised but if you think that this is the case or the integrity.check() function reports this, please we urge you to contact the developer of this package (https://github.com/mponce0/covid19.analytics/issues).

Data Consistency

Data consistency issues and/or anomalies in the data have been reported several times, see https://github.com/CSSEGISandData/COVID-19/issues/. These are claimed, in most of the cases, to be missreported data and usually are just an insignificant number of the total cases. Having said that, we believe that the user should be aware of these situations and we recommend using the consistency.check() function to verify the dataset you will be working with.

Nullifying Spurious Data

In order to deal with the different scenarios arising from incomplete, inconsistent or missreported data, we provide the nullify.data() function, which will remove any potential entry in the data that can be suspected of these incongruencies. In addition ot that, the function accepts an optional argument stringent=TRUE, which will also prune any incomplete cases (e.g. with NAs present).

Genomics Data

Similarly to the rapid developments and updates in the reported cases of the disease, the genetic sequencing of the virus is moving almost at equal pace. That’s why the covid19.analytics package provides access to a good number of the genomics data currently available.

The covid19.genomic.data() function allows users to obtain the COVID-19’s genomics data from NCBI’s databases [5]. The type of genomics data accessible from the package is described in the following table.

type description source
genomic a composite list containing different indicators and elements of the SARS-CoV-2 genomic information https://www.ncbi.nlm.nih.gov/sars-cov-2/
genome genetic composition of the reference sequence of the SARS-CoV-2 from GenBank https://www.ncbi.nlm.nih.gov/nuccore/NC_045512
fasta genetic composition of the reference sequence of the SARS-CoV-2 from a fasta file https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta
ptree phylogenetic tree as produced by NCBI data servers https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/precomptree
nucleotide / protein list and composition of nucleotides/proteins from the SARS-CoV-2 virus https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/
nucleotide-fasta / protein-fasta FASTA sequences files for nucleotides, proteins and coding regions https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/

Although the package attempts to provide the latest available genomic data, there are a few important details and differences with respect to the reported cases data. For starting, the amount of genomic information available is way larger than the data reporting the number of cases which adds some additional constraints when retrieving this data. In addition to that, the hosting servers for the genomic databases impose certain limits on the rate and amounts of downloads.

In order to mitigate these factors, the covid19.analytics package employs a couple of different strategies as summarized below: * most of the data will be attempted to be retrieved live from NCBI databases – same as using src='livedata' * if that is not possible, the package keeps a local version of some of the largest datasets (i.e. genomes, nucleotides and proteins) which might not be up-to-date – same as using src='repo'. * the package will attempt to obtain the data from a mirror server with the datasets updated on a regular basis but not necessarily with the latest updates – same as using src='local'.

Analytical & Graphical Indicators

Click to Expand/Collapse

In addition to the access and retrieval of the data, the package includes some basics functions to estimate totals per regions/country/cities, growth rates and daily changes in the reported number of cases.

Overview of the Main Functions from the “covid19.analytics” Package

Function Description Main Type of Output
Data Acquisition
covid19.data obtain live* worldwide data for COVID-19 virus, from the JHU’s CCSE repository [1] return dataframes/list with the collected data
covid19.Toronto.data obtain live* data for COVID-19 cases in the city of Toronto, ON Canada, from the City of Toronto reports [2] –or– Open Data Toronto [3] return dataframe/list with the collected data
covid19.Canada.data obtain live* Canada specific data for COVID-19 cases, from Health Canada [4] return dataframe/list with the collected data
covid19.US.data obtain live* US specific data for COVID-19 virus, from the JHU’s CCSE repository [1] return dataframe with the collected data
covid19.vaccination obtain up-to-date COVID-19 vaccination records from [5] return dataframe/list with the collected data
covid19.testing.data obtain up-to-date COVID-19 testing records from [5] return dataframe with the testing data or testing data details
pandemics.data obtain pandemics and pandemics vaccination historical records from [6] return dataframe with the collected data
covid19.genomic.data c19.refGenome.data c19.fasta.data c19.ptree.data c19.NPs.data c19.NP_fasta.data obtain covid19’s genomic sequencing data from NCBI [5] list, with the RNA seq data in the “$NC_045512.2” entry
Data Quality Assessment
data.checks run integrity and consistency checks on a given dataset diagnostics about the dataset integrity and consistency
consistency.check run consistency checks on a given dataset diagnostics about the dataset consistency
integrity.check run integrity checks on a given dataset diagnostics about the dataset integrity
nullify.data remove inconsistent/incomplete entries in the original datasets original dataset (dataframe) without “suspicious” entries
Analysis
report.summary summarize the current situation, will download the latest data and summarize different quantities on screen table and static plots (pie and bar plots) with reported information, can also output the tables into a text file
tots.per.location compute totals per region and plot time series for that specific region/country static plots: data + models (exp/linear, Poisson, Gamma), mosaic and histograms when more than one location are selected
growth.rate compute changes and growth rates per region and plot time series for that specific region/country static plots: data + models (linear,Poisson,Exp), mosaic and histograms when more than one location are selected
single.trend
mtrends
visualize different indicators of the “trends” in daily changes for a single or mutliple locations compose of static plots: total number of cases vs time, daily changes vs total changes in different representations
estimateRRs compute estimates for fatality and recovery rates on a rolling-window interval list with values for the estimates (mean and sd) of reported cases and recovery and fatality rates
Graphics and Visualization
total.plts plots in a static and interactive plot total number of cases per day, the user can specify multiple locations or global totoals static and interactive plot
itrends generates an interactive plot of daily changes vs total changes in a log-log plot, for the indicated regions interactive plot
live.map generates an interactive map displaying cases around the world static and interactive plot
Modelling
generate.SIR.model generates a SIR (Susceptible-Infected-Recovered) model list containing the fits for the SIR model
plt.SIR.model plot the results from the SIR model static and interactive plots
sweep.SIR.model generate multiple SIR models by varying parameters used to select the actual data list containing the values parameters, \(\beta, \gamma\) and \(R_0\)
Data Exploration
covid19Explorer launches a dashboard interface to explore the datasets provided by covid19.analytics web-based dashboard
Auxiliary functions
geographicalRegions determines which countries compose a given continent list of countries

API Documentation

Details and Specifications of the Analytical & Visualization Functions

Click to Expand/Collapse

Reports

The report.summary() generates an overall report summarizing the different datasets. It can summarize the “Time Series” data (cases.to.process="TS"), the “aggregated” data (cases.to.process="AGG") or both (cases.to.process="ALL"). It will display the top 10 entries in each category, or the number indicated in the Nentries argument, for displaying all the records set Nentries=0.

The function can also target specific geographical location(s) using the geo.loc argument. When a geographical location is indicated, the report will include an additional “Rel.Perc” column for the confirmed cases indicating the relative percentage among the locations indicated. Similarly the totals displayed at the end of the report will be for the selected locations.

In each case (“TS” or/and “AGG”) will present tables ordered by the different cases included, i.e. confirmed infected, deaths, recovered and active cases.

The dates when the report is generated and the date of the recorded data will be included at the beginning of each table.

It will also compute the totals, averages, standard deviations and percentages of various quantities: * it will determine the number of unique locations processed within the dataset * it will compute the total number of cases per case

  • Percentages: percentages are computed as follow:
    • for the “Confirmed” cases, as the ratio between the corresponding number of cases and the total number of cases, i.e. a sort of “global percentage” indicating the percentage of infected cases wrt the rest of the world

    • for “Confirmed” cases, when geographical locations are specified, a “Relative percentage” is given as the ratio of the confirmed cases over the total of the selected locations

    • for the other categories, “Deaths”/“Recovered”/“Active”, the percentage of a given category is computed as the ratio between the number of cases in the corresponding category divided by the “Confirmed” number of cases, i.e. a relative percentage with respect to the number of confirmed infected cases in the given region

  • For “Time Series” data:
    • it will show the delta (change or variation) in the last day, daily changes day before that (t-2), three days ago (t-3), a week ago (t-7), two weeks ago (t-14) and a month ago (t-30)
    • when possible, it will also display the percentage of “Recovered” and “Deaths” with respect to the “Confirmed” number of cases
    • The column “GlobalPerc” is computed as the ratio between the number of cases for a given country over the total of cases reported
    • The “Global Perc. Average (SD: standard deviation)” is computed as the average (standard deviation) of the number of cases among all the records in the data
    • The “Global Perc. Average (SD: standard deviation) in top X” is computed as the average (standard deviation) of the number of cases among the top X records

Typical structure of a summary.report() output for the Time Series data:

################################################################################ 
  ##### TS-CONFIRMED Cases  -- Data dated:  2020-04-12  ::  2020-04-13 12:02:27 
################################################################################ 
  Number of Countries/Regions reported:  185 
  Number of Cities/Provinces reported:  83 
  Unique number of geographical locations combined: 264 
-------------------------------------------------------------------------------- 
  Worldwide  ts-confirmed  Totals: 1846679 
-------------------------------------------------------------------------------- 
   Country.Region Province.State Totals GlobalPerc LastDayChange   t-2   t-3   t-7  t-14 t-30
1              US                555313      30.07         28917 29861 35098 29595 20922  548
2           Spain                166831       9.03          3804  4754  5051  5029  7846 1159
3           Italy                156363       8.47          4092  4694  3951  3599  4050 3497
4          France                132591       7.18          2937  4785  7120  5171  4376  808
5         Germany                127854       6.92          2946  2737  3990  3251  4790  910
.
.
.
-------------------------------------------------------------------------------- 
  Global Perc. Average:  0.38 (sd: 2.13) 
  Global Perc. Average in top  10 :  7.85 (sd: 8.18) 
-------------------------------------------------------------------------------- 

******************************************************************************** 
********************************  OVERALL SUMMARY******************************** 
******************************************************************************** 
  ****  Time Series TOTS **** 
     ts-confirmed    ts-deaths   ts-recovered 
     1846679          114091        421722 
                        6.18%          22.84% 
  ****  Time Series AVGS **** 
     ts-confirmed    ts-deaths   ts-recovered 
     6995            432.16     1686.89 
                         6.18%         24.12% 
  ****  Time Series SDS **** 
     ts-confirmed    ts-deaths   ts-recovered 
     39320.05        2399.5     8088.55 
                         6.1%           20.57% 

 * Statistical estimators computed considering 250 independent reported entries 
******************************************************************************** 

Typical structure of a summary.report() output for the Aggregated data:

################################################################################################################################# 
  ##### AGGREGATED Data  -- ORDERED BY  CONFIRMED Cases  -- Data dated:  2020-04-12  ::  2020-04-13 12:02:29 
################################################################################################################################# 
  Number of Countries/Regions reported: 185 
  Number of Cities/Provinces reported: 138 
  Unique number of geographical locations combined: 2989 
--------------------------------------------------------------------------------------------------------------------------------- 
                      Location Confirmed Perc.Confirmed Deaths Perc.Deaths Recovered Perc.Recovered Active Perc.Active
1                        Spain    166831           9.03  17209       10.32     62391          37.40  87231       52.29
2                        Italy    156363           8.47  19899       12.73     34211          21.88 102253       65.39
3                       France    132591           7.18  14393       10.86     27186          20.50  91012       68.64
4                      Germany    127854           6.92   3022        2.36     60300          47.16  64532       50.47
5  New York City, New York, US    103208           5.59   6898        6.68         0           0.00  96310       93.32
.
.
.
=================================================================================================================================
     Confirmed   Deaths   Recovered     Active 
  Totals 
     1846680     114090   421722        1310868 
  Average 
     617.83     38.17.      141.09      438.56 
  Standard Deviation 
     6426.31       613.69     2381.22     4272.19 
  
 * Statistical estimators computed considering 2989 independent reported entries

In both cases an overall summary of the reported cases is presented by the end, displaying totals, average and standard devitation of the computed quantities.

A full example of this report for today can be seen here (updated twice a day, daily).

In addition to this, the function will also generate some graphical outputs, including pie and bar charts representing the top regions in each category.

Totals per Location & Growth Rate

It is possible to dive deeper into a particular location by using the tots.per.location() and growth.rate() functions. Theses functions are capable of processing different types of data, as far as these are “Time Series” data. It can either focus in one category (eg. “TS-confirmed”,“TS-recovered”,“TS-deaths”,) or all (“TS-all”). When these functions detect different type of categories, each category will be processed separatedly. Similarly the functions can take multiple locations, ie. just one, several ones or even “all” the locations within the data. The locations can either be countries, regions, provinces or cities. If an specified location includes multiple entries, eg. a country that has several cities reported, the functions will group them and process all these regions as the location requested.

Totals per Location

This function will plot the number of cases as a function of time for the given locations and type of categories, in two plots: a log-scale scatter one a linear scale bar plot one.

When the function is run with multiple locations or all the locations, the figures will be adjusted to display multiple plots in one figure in a mosaic type layout.

Additionally, the function will attempt to generate different fits to match the data: * an exponential model using a Linear Regression method * a Poisson model using a General Linear Regression method * a Gamma model using a General Linear Regression method The function will plot and add the values of the coefficients for the models to the plots and display a summary of the results in screen.

It is possible to instruct the function to draw a “confidence band” based on a moving average, so that the trend is also displayed including a region of higher confidence based on the mean value and standard deviation computed considering a time interval set to equally dividing the total range of time over 10 equally spaced intervals.

The function will return a list combining the results for the totals for the different locations as a function of time.

Growth Rate

The growth.rate() function allows to compute daily changes and the growth rate defined as the ratio of the daily changes between two consecutive dates.

The growth.rate() shares all the features of the tots.per.location() function, i.e. can process the different types of cases and multiple locations.

The graphical output will display two plots per location: * a scatter plot with the number of changes between consecutive dates as a function of time, both in linear scale (left vertical axis) and log-scale (right vertical axis) combined * a bar plot displaying the growth rate for the particular region as a function of time.

When the function is run with multiple locations or all the locations, the figures will be adjusted to display multiple plots in one figure in a mosaic type layout. In addition to that, when there is more than one location the function will also generate two different styles of heatmaps comparing the changes per day and growth rate among the different locations (vertical axis) and time (horizontal axis).

The function will return a list combining the results for the “changes per day” and the “growth rate” as a function of time.

Plotting Totals

The function totals.plt() will generate plots of the total number of cases as a function of time. It can be used for the total data or for an specific or multiple locations. The function can generate static plots and/or interactive ones, as well, as linear and/or semi-log plots.

Plotting Cases in the World

The function live.map() will display the different cases in each corresponding location all around the world in an interactive map of the world. It can be used with time series data or aggregated data, aggregated data offers a much more detailed information about the geographical distribution.

Experimental: Modelling the evolution of the Virus spread

We are working in the development of modelling capabilities. A preliminary prototype has been included and can be accessed using the generate.SIR.model function, which implements a simple SIR (Susceptible-Infected-Recovered) ODE model using the actual data of the virus.

This function will try to identify the data points where the onset of the epidemy began and consider the following data points to generate a proper guess for the two parameters describing the SIR ODE system. After that, it will solve the different equations and provide details about the solutions as well as plot them in a static and interactive plot.

Sweeping models…

For exploring the parameter space of the SIR model, it is possible to produce a series of models by varying the conditions, i.e. range of dates considered for optimizing the parameters of the SIR equation, which will effectively sweep a range for the parameters \(\beta, \gamma\) and \(R_0\). This is implemented in the function sweep.SIR.models(), which takes a range of dates to be used as starting points for the number of cases used to feed into the generate.SIR.model() producing as many models as different ranges of dates are indicated. One could even use this in combination to other resampling or Monte Carlo techniques to estimate statistical variability of the parameters from the model.

Further Features

We will continue working on adding and developing new features to the package, in particular modelling and predictive capabilities.

Please contact us if you think of a functionality or feature that could be useful to add.

Installation

Click to Expand/Collapse

For using the “covi19.analytics” package, first you will need to install it.

The stable version can be downloaded from the CRAN repository:

install.packages("covid19.analytics")

To obtain the development version you can get it from the github repository, i.e.

# need devtools for installing from the github repo
install.packages("devtools")

# install covid19.analytics from github
devtools::install_github("mponce0/covid19.analytics")

For using the package, either the stable or development version, just load it using the library function:

# load "covid19.analytics"
library(covid19.analytics)

Examples

In this section, we include basic examples of the main features of the covid19.analytics package.

Click to Expand/Collapse

Reading data

# obtain all the records combined for "confirmed", "deaths" and "recovered" cases -- *aggregated* data
 covid19.data.ALLcases <- covid19.data()

# obtain time series data for "confirmed" cases
 covid19.confirmed.cases <- covid19.data("ts-confirmed")

# reads all possible datasets, returning a list
 covid19.all.datasets <- covid19.data("ALL")

# reads the latest aggregated data
 covid19.ALL.agg.cases <- covid19.data("aggregated")

# reads time series data for casualties
 covid19.TS.deaths <- covid19.data("ts-deaths")

# reads testing data
 testing.data <- covid19.testing.data()

Read covid19’s genomic data

# obtain covid19's genomic data
 covid19.gen.seq <- covid19.genomic.data()

# display the actual RNA seq
 covid19.gen.seq$NC_045512.2

Obtaining Pandemics data

# Pandemic historical records
 pnds <- pandemics.data(tgt="pandemics")

# Pandemics vaccines development times
 pnds.vacs <- pandemics.data(tgt="pandemics_vaccines")

Some basic analysis

Summary Report

# a quick function to overview top cases per region for time series and aggregated records
report.summary()

# save the tables into a text file named 'covid19-SummaryReport_CURRENTDATE.txt'
# where *CURRRENTDATE* is the actual date
report.summary(saveReport=TRUE)

E.g. today’s report is available here

# summary report for an specific location with default number of entries
report.summary(geo.loc="Canada")

# summary report for an specific location with top 5
report.summary(Nentries=5, geo.loc="Canada")

# it can combine several locations
report.summary(Nentries=30, geo.loc=c("Canada","US","Italy","Uruguay","Argentina"))

Totals per Country/Region/Province

# totals for confirmed cases for "Ontario"
tots.per.location(covid19.confirmed.cases,geo.loc="Ontario")

# total for confirmed cases for "Canada"
tots.per.location(covid19.confirmed.cases,geo.loc="Canada")

# total nbr of deaths for "Mainland China"
tots.per.location(covid19.TS.deaths,geo.loc="China")

# total nbr of confirmed cases in Hubei including a confidence band based on moving average
tots.per.location(covid19.confirmed.cases,geo.loc="Hubei", confBnd=TRUE)