I’ve been looking for a simple and reliable access to public health data
for a while now. Eventually, I bumped into the
This package allows downloading directly from the World Health Organization’s Global Health Observatory in a dynamic and reproducible way. The data are accessible after you installed the package either from the CRAN or GitHub.
# From CRAN install.packages("WHO") # From Github library(devtools) install_github("expersso/WHO")
Nothing fancy and the rest is pretty straightforward too: only two functions necessary.
get_codesto return a dataframe with series, codes, and descriptions for all available series
get_datato retrieve the data from the internet and create directly your dataframe
This post was first published on my previous website under the excerpt Everything you’ll ever need to analyze public health data, well almost and has been slightly revamped for this new Hugo blog.
Table of Contents
How to retrieve series from the WHO
# Retrieve series, codes, and descriptions codes <- get_codes() str(codes) ## Classes 'tbl_df' and 'data.frame': 2789 obs. of 3 variables: ## $ label : chr "MDG_0000000001" "MDG_0000000003" "MDG_0000000005" "MDG_0000000006" ... ## $ display: chr "Infant mortality rate (probability of dying between birth and age 1 per 1000 live births)" "Adolescent birth rate (per 1000 women aged 15-19 years)" "Contraceptive prevalence (%)" "Unmet need for family planning (%)" ... ## $ url : chr "http://apps.who.int/gho/indicatorregistry/App_Main/view_indicator.aspx?iid=1" "http://apps.who.int/gho/indicatorregistry/App_Main/view_indicator.aspx?iid=4669" "http://apps.who.int/gho/indicatorregistry/App_Main/view_indicator.aspx?iid=5" "http://apps.who.int/gho/indicatorregistry/App_Main/view_indicator.aspx?iid=6" ...
So far they are 2789 datasets available, which are all easily retrieved
get_data function. But first we need to pick-up series of
interest. Let’s say we want to analyze breast cancer data and search for
them among the series with a regular expression.
# Search for series about breast cancer breastCancer <- codes[grep("[Bb]reast [Cc]ancer", codes$display),] breastCancer$display ##  "Age-standardized DALYs, breast cancer, per 100,000" ##  "Age-standardized death rates, breast cancer, per 100,000" ##  "General availability of breast cancer screening (by palpation or mammogram) at the primary health care level" ##  "Existence of national screening program for breast cancer"
So we have age-standardized disability-adjusted life years (DALYs),
age-standardized death rates, and general availability of breast cancer
screening at the primary health care level there. Certainly enough to
run some test analyses. Okay let’s first fetch the data through the
# Retrieve the dataframes from the internet DALYs <- get_data(breastCancer$label) deathRates <- get_data(breastCancer$label) cancerScreening <- get_data(breastCancer$label)
Maybe you are relatively new to
R. If you recently installed it on
your computer and didn’t have time to explore the
CRAN you might want
to run the following code to ensure you have all the required packages
installed. All of them are very useful anyway: you won’t regret it!
# Required packages from CRAN .pkgs = c("dplyr", "ggplot2", "RColorBrewer") # Install required packages from CRAN (if not) .inst <- .pkgs %in% installed.packages() if(length(.pkgs[!.inst]) > 0) install.packages(.pkgs[!.inst])
After that, be sure to load them all.
# Load required packages library(dplyr) library(ggplot2) library(RColorBrewer)
Let’s create our dataframe
After loading the data, we surely want to combine our three dataframes together. Male breast cancers are relatively rare, about 1% of all breast cancers only and are usually diagnosed at a more advanced stage. Therefore, I choose to filter them out and to return here the female breast cancers.
# Filter, combine and group together df <- data.frame(deathRates %>% filter(sex == "Female") %>% group_by(year, country) %>% summarise(region, value), DALYs %>% filter(sex == "Female") %>% group_by(year, country) %>% summarise(value), cancerScreening %>% filter(country %in% DALYs$country) %>% group_by(year, country) %>% summarise(value))
There is some redundancy in the country and year columns that needs to be removed. A simple way to do that is to use a regular expression again. Once you’ve selected the redundant columns, it becomes easy to clean the dataframe.
# Search and remove redondancy sel.cl <- grep("*[yr].", colnames(df), invert = TRUE) df <- df[,sel.cl]
Finally, let’s quickly adjust the column’s names of our dataframe.
# Rename columns colnames(df) <- c("year", "country", "region", "deathRates", "DALYs", "cancerScreening") df[1:10,] ## year country region deathRates DALYs ## 1 2004 Afghanistan Eastern Mediterranean 29.6 506 ## 2 2004 Albania Europe 28.7 384 ## 3 2004 Algeria Africa 17.5 212 ## 4 2004 Andorra Europe 18.4 267 ## 5 2004 Angola Africa 34.5 410 ## 6 2004 Antigua and Barbuda Americas 37.7 549 ## 7 2004 Argentina Americas 25.8 316 ## 8 2004 Armenia Europe 38.6 552 ## 9 2004 Australia Western Pacific 20.3 337 ## 10 2004 Austria Europe 20.1 259 ## cancerScreening ## 1 Yes ## 2 Yes ## 3 Yes ## 4 Yes ## 5 No data received ## 6 Yes ## 7 Yes ## 8 Yes ## 9 Yes ## 10 Yes
Well, how about plotting the data now?
ggplot(df, aes(x = deathRates, y = DALYs, color = region, shape = cancerScreening)) + geom_point() + theme_minimal() + ggtitle("") + xlab("Death rates (per 100,000)") + ylab("DALYs (per 100,000)") + scale_shape_manual(values = c(1:5), name = "Screening") + scale_color_brewer(palette = "Set1", name = "Region") + theme(legend.position = "bottom")
Two assumptions were made here: death rates and DALYs were published in 2004, whereas the data about the availability of screening is from 2013. We can reasonably assume that countries with no screening at the primary health care level in 2013 didn’t have screening back in 2004 either. But there is no way to be sure that countries organizing screening for breast cancer nowadays were already doing it in 2004. Additionally, there is no data about patient’s sex in the breast cancer screening data set. But one can assume that it was a least available to female patients, since they are usually the only population targeted by routine screening for breast cancer.