Obtaining Data from Gapminder

Gapminder makes available a number of data sets at http://www.gapminder.org/data/.
Links there will lead you to publicly viewable google spreadsheets.
In the example below, we will use the first one in the list, which provides data for many countries about the percentage of the population (age 14-49) that had HIV over a number of years. We will use this data to make a plot something like this one

You can view this data here.

If you look at it, you will see that we have some work ahead of us to make this usable (i.e. glyph ready) for our plot. But first let’s import the data, then we will figure out how to get it glyph ready.

Grabbing the data directly from google

We can use the googlesheets package makes it possible to grab the data directly from google.

First we establish a connection to the google spreadsheet.

require(googlesheets)
google_connection <-
  gs_url("https://docs.google.com/spreadsheets/d/1kWH_xdJDM4SMfT_Kzpkk-1yuxWChfurZuWYjfmv51EA/pub?gid=0", visibility="public")

## Sheet-identifying info appears to be a browser URL.
## googlesheets will attempt to extract sheet key from the URL.
## Putative key: 1kWH_xdJDM4SMfT_Kzpkk-1yuxWChfurZuWYjfmv51EA
## Authentication will not be used.
## Worksheets feed constructed with public visibility

Then we import the data from the first worksheet at this connection. (You can specify other spreadsheets if you don’t want the first one, but the first one is the default.)

HIVdata <- gs_read(google_connection)

## Accessing worksheet titled "Data"

## Error in gsheets_GET(this_ws$exportcsv, to_xml = FALSE, use_auth = !ss$is_public): Not expecting content-type to be:
## text/html; charset=UTF-8

Unfortunately, this particular spreadsheet doesn’t read in quite right. But is we specify the range of cells that contain data, then it works. The connection object reports a bunch of information, including the number of rows and columns in the worksheets.

google_connection$ws$row_extent

## [1] 276  24  20  20  20  20

google_connection$ws$col_extent

## [1] 34  6  6  6  6  6

We can use this information to select a range, and then things work.

HIVdata <- 
  gs_read(google_connection,
          range = cell_limits(c(1,1), c(276, 34))
  )

## Accessing worksheet titled "Data"

dim(HIVdata)

## [1] 275  34

Grabbing the data via csv

If you have trouble getting the data from google, you can load it from this CSV file instead.

HIVdata <- read.csv("HIV.csv", header = TRUE, stringsAsFactors = FALSE)

Getting the Data Glyph Ready

If we consider a case to to an observation of a country in a given year, then these data are not tidy. We would like to have a data table with variables country, year, and HIV.perc There are several problems that need fixing.

head(HIVdata, 3)

## Source: local data frame [3 x 34]
## 
##   Estimated.HIV.Prevalence.....Ages.15.49. X1979 X1980 X1981 X1982 X1983
##                                      (chr) (chr) (chr) (chr) (chr) (chr)
## 1                                 Abkhazia    NA    NA    NA    NA    NA
## 2                              Afghanistan    NA    NA    NA    NA    NA
## 3                    Akrotiri and Dhekelia    NA    NA    NA    NA    NA
## Variables not shown: X1984 (chr), X1985 (chr), X1986 (chr), X1987 (chr),
##   X1988 (chr), X1989 (chr), X1990 (chr), X1991 (chr), X1992 (chr), X1993
##   (chr), X1994 (chr), X1995 (chr), X1996 (chr), X1997 (chr), X1998 (chr),
##   X1999 (chr), X2000 (chr), X2001 (chr), X2002 (chr), X2003 (chr), X2004
##   (chr), X2005 (chr), X2006 (chr), X2007 (chr), X2008 (chr), X2009 (chr),
##   X2010 (chr), X2011 (chr)

The “name” of the first variable is really a title for the data set and should be changed to country. We can do this by selecting all of the columns and renaming the ones that begin with "Estimated". (We’ll deal with those names that start with X at a different point in the process – those columns have more important issues to fix first.)

HIVdata2 <- 
  HIVdata %>% 
  select(country = starts_with("Estimated"), starts_with("X"))
head(HIVdata2, 3)

## Source: local data frame [3 x 34]
## 
##                 country X1979 X1980 X1981 X1982 X1983 X1984 X1985 X1986
##                   (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1              Abkhazia    NA    NA    NA    NA    NA    NA    NA    NA
## 2           Afghanistan    NA    NA    NA    NA    NA    NA    NA    NA
## 3 Akrotiri and Dhekelia    NA    NA    NA    NA    NA    NA    NA    NA
## Variables not shown: X1987 (chr), X1988 (chr), X1989 (chr), X1990 (chr),
##   X1991 (chr), X1992 (chr), X1993 (chr), X1994 (chr), X1995 (chr), X1996
##   (chr), X1997 (chr), X1998 (chr), X1999 (chr), X2000 (chr), X2001 (chr),
##   X2002 (chr), X2003 (chr), X2004 (chr), X2005 (chr), X2006 (chr), X2007
##   (chr), X2008 (chr), X2009 (chr), X2010 (chr), X2011 (chr)

For each row of the form:

country      X1979  X1980  X1981  ... X2011
countryname     v2     v3     v4  ...   v34

we would like to create multiple rows that look like

    country    year  HIV.perc
countryname   X1979        v2
countryname   X1980        v3
countryname   X1981        v4    
    ...        ...        ...
countryname   X2011       v34

That is, we want to gather the values from multiple columns into a single column using the names of those columns as another column (generically called the the key). That is we are turning the column names and column values into key-value pairs. All the key-value pairs from a given row of the wide data will be associated with copies of the variables that were not part of the key-value pair (in this case, the country gets repeated 33 times).

To do this we

use the gather() function from the tidyr package
specify names for the key and value columns (year and HIV.perc)
specify which columns are being gathered (all the ones that start with X, or equivalently, all columns except the country column).

HIVdata3 <-
  HIVdata2 %>% gather( year, HIV.perc, -country)
head(HIVdata3, 3)

## Source: local data frame [3 x 3]
## 
##                 country   year HIV.perc
##                   (chr) (fctr)    (chr)
## 1              Abkhazia  X1979       NA
## 2           Afghanistan  X1979       NA
## 3 Akrotiri and Dhekelia  X1979       NA

Now we need to remove the X from the years. We can do this with the extract_numeric() function from tidyr:

HIVdata4 <-
  HIVdata3 %>% mutate( year = extract_numeric(year), HIV.perc = extract_numeric(HIV.perc))
head(HIVdata4, 3)

## Source: local data frame [3 x 3]
## 
##                 country  year HIV.perc
##                   (chr) (dbl)    (dbl)
## 1              Abkhazia  1979       NA
## 2           Afghanistan  1979       NA
## 3 Akrotiri and Dhekelia  1979       NA

Now the data are ready for a plot. Let’s compare just a few countries. There is very little data before 1990, so we’ll filter the data to include just the more recent years.

HIVdata4 %>%
  filter(country %in% c("Uganda", "Kenya", "Tanzania", "South Africa",
                        "Zimbabwe", "United States")) %>%
  filter(year > 1988) %>%
  ggplot(aes(x = year, y = HIV.perc, color = country)) +
  geom_line(size=2, alpha=0.5)

## Warning: Removed 6 rows containing missing values (geom_path).

If we reorder the countries based on the HIV prevalence, then our legend will be in a better order compared with the plot.

HIVdata4 %>%
  filter(country %in% c("Uganda", "Kenya", "Tanzania", "South Africa",
                        "Zimbabwe", "United States")) %>%
  filter(year > 1988) %>%
  mutate(country = 
           reorder(country, HIV.perc, 
                   function(x) - max(x, na.rm=TRUE))) %>% 
  ggplot(aes(x = year, y = HIV.perc, color = country)) +
  geom_line(size=2, alpha=0.5)

## Warning: Removed 6 rows containing missing values (geom_path).

Exercises

Download an additional data file from Gapminder, tidy it, and merge it with the HIV data so that you can make a scatter plot of HIV prevelance vs some other variable that might reveal and interesting pattern? What will you do about the fact that you have multiple years?
Download still more data files and use them to add additional aesthetics to the plot.

Note: We haven’t talked much about dynamic plots, but Hans Rosling’s famous moving bubble plots are one way to deal with the time variable.