This document will walk you through how to create a plot similar to the one that you see below. It will use the data from Johns Hopkins University to loosely replicate a plot that shows the outbreak of COVID-19 for some countries, and for four provinces in Canada.

Then, in Week 2, you will be able to create your own plots to show the outbreak in an area that you’re interested in.

Creating a Plot in R

1. Load Libraries and Set a Theme

In order to work with R, you’ll need access to many different packages that allow the user to use different tools. Each of these packages below has a function. Most importantly, the tidyverse package allows the user to import, tidy, transform, and visualize data using commands that are relatively intuitive.

The other packages each perform some function in this module.

library(tidyverse)
library(janitor)
library(lubridate)
library(tidymodels)
library(ggrepel)
library(knitr)

I have set the theme to light to avoid the gray background in graphics.

theme_set(theme_light())

2. Load the Data

We will load the data using the data source of the following website Johns Hopkins CSSE Map

In order to get the data, we will use a link to the Johns Hopkins CSSE data in GitHub. The following command will load the data for us using the read_csv command.

covid_cases <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

Once you run this line of code, you should see the covid_cases object in the Environment pane. If you click the object in the environment pane, you can view the data in a tabel format in the Source pane in the upper left.

3. Tidy and Transform the Data to Make a Plot

A. Clean column Names

You’ll notice that when we view the file, it is not tidy data because each day is listed as a column rather than as an observation. We can inspect the column names with the command below.

colnames(covid_cases)[5:ncol(covid_cases)]

It’s also important to see that the dates are not in a nice format. We would pefer to use YYYY-MM-DD as our date input, so we will use the following command to rename the columns with the dates formatted correctly.

old_names <- colnames(covid_cases)[5:ncol(covid_cases)]
new_names <- parse_date(colnames(covid_cases)[5:ncol(covid_cases)], "%m/%d/%y")

covid_cases<- covid_cases %>% 
  rename_at(vars(old_names), function(x) new_names)

Then we will reshape the data into a long tidy format using the pivot_longer() command. Now for each county observation (row), we will move the dates under the country. We will call this new dataset covid_cases_tidy. You can inspect it in the environment to see how it looks different from the covid_cases that we input from GitHub.

covid_cases_tidy <- covid_cases %>% 
  pivot_longer(
    cols = -`Province/State`:-Long,  
    names_to = "date", 
    values_to = "cases"
  ) %>% as_tibble() %>% 
  mutate(date = as.Date(date)) %>% 
  clean_names()

B. Transform and Summarise the Data Using the Pipe Operator

For the particular plot we’re trying to make, we only want to use the data for the Provinces in Canada. So we need to filter the data to look only at what we’re interested in. Once again, we can see the data by pressing on the covid_cases_tidy object in the environment pane in the upper right.

Let’s look at the data of the top 10 most affected countries in the world and find that data using some commands that we can find in the dplyr package, which is part of the tidyverse. Since we know the data is tidy, that being each row represents a country’s day of total cases, we can use use some statistical analysis to pull some information out of the data

Let’s see what are most affected countries are in the world are, and how many cases they have. We will use the pipe operator %>% to use the previous output at the input in the next line of code.

country_region	cases
US	1417774
Russia	252245
United Kingdom	233151
Spain	229540
Italy	223096
Brazil	203165
France	176712
Germany	174478
Turkey	144749
Iran	114533
India	81997
Peru	80604
China	68134
Belgium	54288
Saudi Arabia	46869
Netherlands	43481
Mexico	42595
Canada	40732
Chile	37040
Pakistan	35788

Create a Summary Plot

Now that we have this data output, let’s make our first plot of the data. This will be a bar graph that will show the twenty most-affected countries in the world. It will essentially take the output of the last code and use it to make a plot.

Some things to keep in mind. The output is a tibble which is a type of data frame that R likes work with. The data was initially 30324 by 6 tibble and it’s been converted into a 20 by 2 tibble.

The following plot, take the data from the chart and creates a plot to show the countries that are most affected by the COVID-19 outbreak.

COVID-19 Outbreak In A Plot

Now that we have shown the countries most affected by COVID-19 in a summary plot, let’s look at how some of this data in a time-series plot. In general, the growth of COVID-19 is an exponetial growth curve. For instance, we can look at some of the most affected countries in the world and see how their growth curves look. Let’s choose, the US, Spain, Italy, and the UK.

We can view this data using the filter command, and then plotting it. One of the nice features in R, is that we are still using the basic dataset that we started with rather than needing to change it a bunch and create new sheets like we might need to in Excel.

In this plot, we can see the different growth rates of the countries, but it’s difficult to determine where on the growth curve these countries are, so it’s helpful to make a plot by the number of days since a certain number of cases.

COVID-19 Plots

The plots below replicate many of the features of the financial times plot. Using RStudio.Cloud and by following along, you will be able to replicate the main features of the plot.

Week01-Intro

Chris Papalia

4/9/2020