Create and Analyse a Dataset

This article provides an example of how to use owidR to create a country level dataset consisting of multiple variables. It does this in the context of trying to answer the research question: does higher internet use lead to higher levels of democracy? This is based on research done by (citation) We will do a similar analysis using data gathered from Our World in Data. This analysis is only intended to be an example of how to use owidR and not a robust research paper. The article assumes some basic knowledge of the tidyverse, especially dplyr and ggplot2. If you aren’t familiar with either of these you should still be able to follow along but I would recommend reading R for Data Science. An understanding of basic R is essential.

To begin we’ll load owidR using the library() function. We’ll also load dplyr, ggplot2, plm as well as texreg which we’ll be using to do the analysis.

library(owidR)
library(dplyr)
library(ggplot2)
library(plm)
library(texreg)

Searching for and importing data using owidR is very easy. First, we can search for data on a topic using owid_search(). We’ll start by searching for data to use as our outcome variable: internet To do this I enter the keyword “internet” as the argument in owid_search().

owid_search("internet")

When running this line of code around 10 datasets about the internet are returned. Let’s use the dataset with the title: Share of the population using the Internet. The corresponding chart_id to this data is: “Share of the population using the Internet”. Using the chart_id as an argument to owid() imports that data into R, assigning it to an object called internet. We use the rename argument to give the value column a shorter a clean name.

internet <- owid("share-of-individuals-using-the-internet", rename = "internet_use")
internet
#> # A tibble: 7,246 × 4
#>    entity      code   year internet_use
#>  * <chr>       <chr> <int>        <dbl>
#>  1 Afghanistan AFG    1990      0      
#>  2 Afghanistan AFG    1991      0      
#>  3 Afghanistan AFG    1992      0      
#>  4 Afghanistan AFG    1993      0      
#>  5 Afghanistan AFG    1994      0      
#>  6 Afghanistan AFG    1995      0      
#>  7 Afghanistan AFG    2001      0.00472
#>  8 Afghanistan AFG    2002      0.00456
#>  9 Afghanistan AFG    2003      0.0879 
#> 10 Afghanistan AFG    2004      0.106  
#> # … with 7,236 more rows

We can find information about the source of data by using owid_source(), with the owid dataset object as the argument. This gives us the original publisher of the data as well as a link to the data. For some datasets additional information about how the variables is calculated is also provided. Using view_chart() takes you to the Our World in Data webpage for that dataset, where there is also additional information and a pretty graph.

owid_source(internet)
#> Dataset Name: International Telecommunication Union (via World Bank)
#> 
#> Published By: World Development Indicators - World Bank (2021.07.30)
#> 
#> Link: http://data.worldbank.org/data-catalog/world-development-indicators
view_chart(internet)

To create simple plots to see how internet use has changed over time simply use owid_plot(), filtering to give the World total. Given that this function is a wrapper around ggplot2 you can use normal ggplot2 functions to further manipulate the graph. I’m going to add a title using labs() and change the y axis scale so that it starts from 0 (this makes the graph clearer to interpret given that the value is a percentage, otherwise small variations can appear large).

owid_plot(internet, filter = "World") +
  labs(title = "Share of the World Population using the Internet") +
  scale_y_continuous(limits = c(0, 100))
#> Loading required namespace: showtext

We can see how internet use varies between countries by creating a choropleth map using owid_map. It shows that, in 2018, there is still a large variation in the level of internet use in countries, with many African countries having particularly low use.

owid_map(internet, year = 2017) +
  labs(title = "Share of Population Using the Internet, 2017")

It’s also possible to compare countries level of internet use across time, again using owid_plot(). By using the argument summarise = FALSE, owid_plot() will show individual countries instead of aggregating them into the total. You can then use the filter argument to select which countries you want to be displayed.

owid_plot(internet, summarise = FALSE, filter = c("United Kingdom", "Spain", "Russia", "Egypt", "Nigeria")) +
  labs(title = "Share of Population with Using the Internet") +
  scale_y_continuous(limits = c(0, 100), labels = scales::label_number(suffix = "%")) # The labels argument allows you to make it clear that the value is a percentage

Now let’s get data on democracy, first searching for a data source and then importing it using owid(). Using that data we’ll do some similar exploration to what we did with internet use data.

owid_search("democrac")
democracy <- owid("political-regime-updated2016-distinction-democracies-and-full-democracies",
                  rename = "polity") %>% 
  filter(year %in% 1960:2020)
democracy
#> # A tibble: 8,817 × 4
#>    entity      code   year polity
#>    <chr>       <chr> <int>  <int>
#>  1 Afghanistan AFG    1960    -10
#>  2 Afghanistan AFG    1961    -10
#>  3 Afghanistan AFG    1962    -10
#>  4 Afghanistan AFG    1963    -10
#>  5 Afghanistan AFG    1964     -7
#>  6 Afghanistan AFG    1965     -7
#>  7 Afghanistan AFG    1966     -7
#>  8 Afghanistan AFG    1967     -7
#>  9 Afghanistan AFG    1968     -7
#> 10 Afghanistan AFG    1969     -7
#> # … with 8,807 more rows

owid_source(democracy)
#> Dataset Name: Political Regime (OWID based on Polity IV and Wimmer & Min)
#> 
#> Published By: Our World In Data combined two datasets: Wimmer and Min (2006) for information on whether a country was colonized; Center for Systemic Peace for a measure of the political regime.
#> 
#> Link: 
#> 
#> Polity 2 Measure ranges from -10 (autocracy) to +10 (full democracy). If a country was colonized in a given year is encoded as -20. In cases in which there was data from both Min and Wimmer and also Polity IV the Polity IV data is shown.

owid_map(democracy, palette = "YlGn") +
  labs(title = "Political Regime (Polity IV)")

So we’ve done some nice exploratory analysis and produced some pretty graphs, but now let’s get into some more in depth analysis. We’ll use a fixed effect (FE) regression analysis to estimate the average effect that an increase in internet use has on democracy within a country. If you aren’t familiar with FE regression this article explain its purpose well and this chapter from Introduction to Econometrics with R shows how to implement it in R. To estimate the effect of internet use we’re going to use a within-unit fixed effects model.

This model will require us to adjust for confounding factors, so we’ll need some extra data. I’m going to use data on variables that I think might be confounding the relationship between internet use and democracy. These are: GDP per Capita, Government Expenditure, Age Dependency and Unemployment. There are almost certainly more confounding factors so feel free to use owid_search() to find data on other variables you think might be confounders and add them to the analysis.

gdp <- owid("gdp-per-capita-worldbank", rename = "gdp")

gov_exp <- owid("total-gov-expenditure-gdp-wdi", rename = "gov_exp")

age_dep <- owid("age-dependency-ratio-of-working-age-population", rename = "age_dep")

unemployment <- owid("unemployment-rate", rename = "unemp")

In order to create an FE model, all these separate dataframes now need to combined into one. To do this I’m going to use the left_join() function from dplyr and create a new dataframe called data that combines all the other dataframes.

data <- internet %>% 
  left_join(democracy) %>% 
  left_join(gdp) %>% 
  left_join(gov_exp) %>% 
  left_join(age_dep) %>% 
  left_join(unemployment)
#> Joining, by = c("entity", "code", "year")
#> Joining, by = c("entity", "code", "year")
#> Joining, by = c("entity", "code", "year")
#> Joining, by = c("entity", "code", "year")
#> Joining, by = c("entity", "code", "year")

Now that we have a combined dataset we can get to the analysis. First, let’s use ggplot2 create a graph to see the correlation between internet access and democracy in 2015.

data %>% 
  filter(year == 2015) %>% 
  ggplot(aes(internet_use, polity)) +
  geom_point(colour = "#57677D") +
  geom_smooth(method = "lm", colour = "#DC5E78") +
  labs(title = "Relationship Between Internet Use and Polity IV Score", x = "Internet Use", y = "Polity IV") +
  theme_owid()
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 90 rows containing non-finite values (stat_smooth).
#> Warning: Removed 90 rows containing missing values (geom_point).

There appears to be some relationship but this could easily be explained by countries with higher development also being more democratic and not actually the result of internet access. That’s why we control for GDP and the other confounders. Next, we’ll create two models, one with just internet use and democracy, and the other with the confounders added.

fe_model <- plm(polity ~ internet_use, data, 
                effect = c("individual"), index = "entity")

fe_model_2 <- plm(polity ~ internet_use + gdp + gov_exp + age_dep + unemp, data, 
                  effect = c("individual"), index = "entity")

htmlreg(list(fe_model, fe_model_2))
Statistical models
  Model 1 Model 2
internet_use 0.02*** -0.00
  (0.00) (0.00)
gdp   0.00
    (0.00)
gov_exp   0.03**
    (0.01)
age_dep   -0.09***
    (0.01)
unemp   0.02
    (0.02)
R2 0.02 0.09
Adj. R2 -0.02 0.03
Num. obs. 4144 2073
p < 0.001; p < 0.01; p < 0.05

You can see that internet use has a significant positive effect in the first model, but once the confounders are added the effect is insignificant. This means that our model provides no evidence that internet use has an effect on democracy. However, feel free to play around with this data yourself and see if you get a different result when other variables are used.