This document will walk you through how to create a plot similar to the one that you see below. It will use the data from Johns Hopkins University to loosely replicate a plot that shows the outbreak of COVID-19 for some countries, and for four provinces in Canada.
Here is the plot that we will try and replicate:
Financial Times Plot of Deaths on a Log Scale
Then, in Week 2, you will be able to create your own plots to show the outbreak in an area that you’re interested in.
In order to work with R, you’ll need access to many different packages that allow the user to use different tools. Each of these packages below has a function. Most importantly, the tidyverse package allows the user to import, tidy, transform, and visualize data using commands that are relatively intuitive.
The other packages each perform some function in this module.
library(tidyverse)
library(janitor)
library(lubridate)
library(tidymodels)
library(ggrepel)
library(knitr)
I have set the theme to light to avoid the gray background in graphics.
theme_set(theme_light())
We will load the data using the data source of the following website Johns Hopkins CSSE Map
In order to get the data, we will use a link to the Johns Hopkins CSSE data in GitHub. The following command will load the data for us using the read_csv
command.
covid_cases <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
Once you run this line of code, you should see the covid_cases
object in the Environment pane. If you click the object in the environment pane, you can view the data in a tabel format in the Source pane in the upper left.
You’ll notice that when we view the file, it is not tidy data because each day is listed as a column rather than as an observation. We can inspect the column names with the command below.
colnames(covid_cases)[5:ncol(covid_cases)]
It’s also important to see that the dates are not in a nice format. We would pefer to use YYYY-MM-DD as our date input, so we will use the following command to rename the columns with the dates formatted correctly.
old_names <- colnames(covid_cases)[5:ncol(covid_cases)]
new_names <- parse_date(colnames(covid_cases)[5:ncol(covid_cases)], "%m/%d/%y")
covid_cases<- covid_cases %>%
rename_at(vars(old_names), function(x) new_names)
Then we will reshape the data into a long tidy format using the pivot_longer()
command. Now for each county observation (row), we will move the dates under the country. We will call this new dataset covid_cases_tidy
. You can inspect it in the environment to see how it looks different from the covid_cases
that we input from GitHub.
covid_cases_tidy <- covid_cases %>%
pivot_longer(
cols = -`Province/State`:-Long,
names_to = "date",
values_to = "cases"
) %>% as_tibble() %>%
mutate(date = as.Date(date)) %>%
clean_names()
For the particular plot we’re trying to make, we only want to use the data for the Provinces in Canada. So we need to filter the data to look only at what we’re interested in. Once again, we can see the data by pressing on the covid_cases_tidy
object in the environment pane in the upper right.
Let’s look at the data of the top 10 most affected countries in the world and find that data using some commands that we can find in the dplyr
package, which is part of the tidyverse. Since we know the data is tidy, that being each row represents a country’s day of total cases, we can use use some statistical analysis to pull some information out of the data
Let’s see what are most affected countries are in the world are, and how many cases they have. We will use the pipe operator %>%
to use the previous output at the input in the next line of code.
country_region | cases |
---|---|
US | 1417774 |
Russia | 252245 |
United Kingdom | 233151 |
Spain | 229540 |
Italy | 223096 |
Brazil | 203165 |
France | 176712 |
Germany | 174478 |
Turkey | 144749 |
Iran | 114533 |
India | 81997 |
Peru | 80604 |
China | 68134 |
Belgium | 54288 |
Saudi Arabia | 46869 |
Netherlands | 43481 |
Mexico | 42595 |
Canada | 40732 |
Chile | 37040 |
Pakistan | 35788 |
Now that we have this data output, let’s make our first plot of the data. This will be a bar graph that will show the twenty most-affected countries in the world. It will essentially take the output of the last code and use it to make a plot.
Some things to keep in mind. The output is a tibble
which is a type of data frame that R likes work with. The data was initially 30324 by 6 tibble and it’s been converted into a 20 by 2 tibble.
The following plot, take the data from the chart and creates a plot to show the countries that are most affected by the COVID-19 outbreak.
Now that we have shown the countries most affected by COVID-19 in a summary plot, let’s look at how some of this data in a time-series plot. In general, the growth of COVID-19 is an exponetial growth curve. For instance, we can look at some of the most affected countries in the world and see how their growth curves look. Let’s choose, the US, Spain, Italy, and the UK.
We can view this data using the filter command, and then plotting it. One of the nice features in R, is that we are still using the basic dataset that we started with rather than needing to change it a bunch and create new sheets like we might need to in Excel.
In this plot, we can see the different growth rates of the countries, but it’s difficult to determine where on the growth curve these countries are, so it’s helpful to make a plot by the number of days since a certain number of cases.
The plots below replicate many of the features of the financial times plot. Using RStudio.Cloud and by following along, you will be able to replicate the main features of the plot.