For Week 50 of the R4datascience project, I used the provided diseases data mentioned in this SimplyStats blog post. The dataset contains state level data of vaccine preventable infectious diseases over the span of ~70 years.
Packages
#packages
library(tidyverse)
library(gganimate)
library(ggplot2)
library(grid)
library(sf)
library(usmap)
library(scales)
library(Hmisc)
Read in the diseases file
diseases<-read.csv(file="~/GitHub/TidyTuesday/data/2019/2019-12-10/diseases.csv")
View the dataset
diseases %>%dim()
## [1] 18870 6
The diseases dataset contains 18870 rows and 6 columns
diseases %>% colnames()
## [1] "disease" "state" "year" "weeks_reporting"
## [5] "count" "population"
The diseases dataset contains columns for: disease, state, year, weeks_reporting, count, state population
head(diseases)
## disease state year weeks_reporting count population
## 1 Hepatitis A Alabama 1966 50 321 3345787
## 2 Hepatitis A Alabama 1967 49 291 3364130
## 3 Hepatitis A Alabama 1968 52 314 3386068
## 4 Hepatitis A Alabama 1969 49 380 3412450
## 5 Hepatitis A Alabama 1970 51 413 3444165
## 6 Hepatitis A Alabama 1971 51 378 3481798
The first 6 rows of the diseases dataset
diseases %>% summary()
## disease state year weeks_reporting
## Hepatitis A:2346 Alabama : 370 Min. :1928 Min. : 0.00
## Measles :3876 Alaska : 370 1st Qu.:1956 1st Qu.:14.00
## Mumps :1836 Arizona : 370 Median :1977 Median :44.00
## Pertussis :3774 Arkansas : 370 Mean :1974 Mean :33.28
## Polio :3774 California: 370 3rd Qu.:1992 3rd Qu.:50.00
## Rubella :1938 Colorado : 370 Max. :2011 Max. :52.00
## Smallpox :1326 (Other) :16650
## count population
## Min. : 0.0 Min. : 86853
## 1st Qu.: 1.0 1st Qu.: 1046542
## Median : 47.0 Median : 2824918
## Mean : 1367.5 Mean : 4242911
## 3rd Qu.: 440.8 3rd Qu.: 5153640
## Max. :132342.0 Max. :37607525
## NA's :204
unique (diseases$disease)
## [1] Hepatitis A Measles Mumps Pertussis Polio Rubella
## [7] Smallpox
## Levels: Hepatitis A Measles Mumps Pertussis Polio Rubella Smallpox
The diseases listed in the diseases dataset
diseases %>% select("count") %>% by(diseases$disease,sum)
## diseases$disease: Hepatitis A
## [1] 976362
## ------------------------------------------------------------
## diseases$disease: Measles
## [1] 18670996
## ------------------------------------------------------------
## diseases$disease: Mumps
## [1] 832110
## ------------------------------------------------------------
## diseases$disease: Pertussis
## [1] 2329020
## ------------------------------------------------------------
## diseases$disease: Polio
## [1] 2329020
## ------------------------------------------------------------
## diseases$disease: Rubella
## [1] 434966
## ------------------------------------------------------------
## diseases$disease: Smallpox
## [1] 232828
I summarized the cases of each disease by counts to get an idea of which one had the most cases in all of US over the span of the 76 years in the dataset. As anticipated, measles had the greatest number, with 18,670,996 cases in the United States from 1928 to 2003.
Calculate incidence rate
To compare the number of cases between states, I mutated the dataset to add an incidence rate that takes into account the case count over state population by 100,000 people for each state per year.
#Calculate incidence
diseases<- diseases %>%
mutate(incidence=(count/population*100000*(ifelse(weeks_reporting==0,0,52/weeks_reporting))))
Filter measles from the dataset
Because measles had the most cases of all diseases in the dataset, I decided to filter the dataset and explore measles for my visualization
#filter by measles/summary
diseases %>% filter(disease=="Measles") %>% summary()
## disease state year weeks_reporting
## Hepatitis A: 0 Alabama : 76 Min. :1928 Min. : 0.00
## Measles :3876 Alaska : 76 1st Qu.:1947 1st Qu.:32.00
## Mumps : 0 Arizona : 76 Median :1966 Median :47.00
## Pertussis : 0 Arkansas : 76 Mean :1966 Mean :37.45
## Polio : 0 California: 76 3rd Qu.:1984 3rd Qu.:50.00
## Rubella : 0 Colorado : 76 Max. :2003 Max. :52.00
## Smallpox : 0 (Other) :3420
## count population incidence
## Min. : 0 Min. : 86853 Min. : 0.0000
## 1st Qu.: 8 1st Qu.: 966309 1st Qu.: 0.3458
## Median : 543 Median : 2590843 Median : 27.7595
## Mean : 4817 Mean : 3858493 Mean : 166.6918
## 3rd Qu.: 3986 3rd Qu.: 4696902 3rd Qu.: 226.1823
## Max. :132342 Max. :34861711 Max. :2964.4269
## NA's :64 NA's :64
#create a measles dataset
measles<-diseases %>% filter(disease=="Measles")
measles<-measles %>% mutate (incidence_rounded = round(incidence, digits=1))
Since the dataset has data for each state by year, I decided it would be ideal to use a chloropleth map with gganimate to show how the incidence rate varies by each state over the years. The usmaps package provides one of many options to display state level data on a US map, which can also be done using the spData package, however with additional steps to include Alaska and Hawaii.
Create a FIPS column using fips() from usmaps package
The usmaps package has a fips() function that outputs the FIPs code for a state name string input. To have the state level data plotted onto the map, a FIPs column corresponding to each state needs to be created.
#Create a FIPS column using the fips() function from usmap
measles$fips<-fips(measles$state)
Create a new column to classify year by vaccine availability-will be used for subtitle in ggplot
There is another important piece of information to account for in the visualization to make sense of the changes in trends: vaccine introduction. The measles vaccine was introduced in 1963 in the United States with a second dose recomended by 1989. To show this on the map, I created a factor variable corresponding each year to three phases of vaccine availability: the time frame when there was no vaccine, the time frame when the vaccine was introduced and the years onwards when the second dose was reccomended. This column will be used as a the subtitle on the map to demonstrate the trends relative to vaccine availability over the course of the years.
#Create a column to classify year by vaccine availability
measles$vaccine<-ifelse((measles$year>=1963) & (measles$year<1989),2,
ifelse((measles$year>=1989),3,
ifelse((measles$year<1963), 1,1)))
measles$vaccine<-recode(measles$vaccine, '1'="",
'2'="Vaccine introduced",
'3'="Second dose recommended")
Creating quantiles for gradient breaks
measles %>% select(incidence) %>% summary()
## incidence
## Min. : 0.0000
## 1st Qu.: 0.3458
## Median : 27.7595
## Mean : 166.6918
## 3rd Qu.: 226.1823
## Max. :2964.4269
## NA's :64
If you take a look at the summary stats for the incidence in the measles data, you can see how variability between the rates would unevenly polarize rates above and below the third quartile, which would obscure interpretation of the rates on a gradient scale. Instead of using the default scale in scale_color_gradient() in the ggplot2 package which creates breaks by evenly spacing the intervals, quartiles are a better quantification of breaks in the dataset. a breaks vector is created to be read into the’breaks’ argument of the scale_color_gradient() function in the plot.
#Quantile breaks for gradient scale
breaks<- quantile(measles$incidence_rounded, probs=seq(0,1,.25), na.rm=TRUE) %>% unname() %>% round(digits=1)
Display
A few other things I added here in my code is a vector for a manual gradient color scheme and the resolution for the plot .gif file.
#Color palette
gradient<-c("#B1C055","#6CC682","#39C2B6","#6BB4D7","#B69DD1","#E685A8","#ED7F72")
#Set these image quality options
options(gganimate.dev_args = list(width = 6, height = 4, units = 'in', res=300))
Create the plot
#Create the plot
measles_plot<-plot_usmap(data=measles, color="#262626", size=.3, values="incidence_rounded")+
theme_void()+
scale_fill_gradientn(colors = gradient, trans="pseudo_log",
#The pseudo_log allows for log transformation even though 0 is in the dataset
na.value="grey",limits=c(min(breaks), max(breaks)),breaks=breaks[c(1,3:5)], labels=breaks[c(1,3:5)])+
#only including min, 50th, 75th and max values
guides(fill = guide_colorbar(title="",
frame.colour = "black",
label.position="top",
barwidth = 8,
barheight = 1,
ticks=FALSE,
keywidth=15,
label.hjust = 0.5,
label.vjust = 0.3,
label.theme = element_text(angle = 45, size=10)))+
labs(title = "Measles Incidence per 100,000 in {frame_time}",
subtitle="{unique(measles$vaccine[measles$year==(frame_time)])}")+
#Allows for vaccine column to display as a subtitle relative to plot animation
theme(legend.position="bottom", legend.justification=c(.8,0),
plot.title=element_text(face="bold", size=14, color="#262626",hjust=.5),
plot.subtitle=element_text(hjust=.5))+
transition_time(year)
Save
anim<-animate(measles_plot, nframes=76, fps=1)
#76 frames because 76 years
anim_save("measlesmap.gif", anim)
#save the map
The Map


