Be sure to load the packages you require. We use the “Tidyverse” of packages.
dplyr to use summarise and group_bytidyr for gather.ggplot2 for ggplot.library(tidyverse)plm (remember you can install the package plm by running install.packages("plm"))stargazer which we’ll discuss later in the lab.We now need to import the data. You can download the csv of the file from Google drive here: https://drive.google.com/file/d/0B9jjwkjdUJU7cElvcVI5U3hhdEU/view?usp=sharing.
The .csv is a version of the supplementary data provided by Sutter at the AER here: https://www.aeaweb.org/articles?id=10.1257/aer.99.5.2247) that I cleaned up to make it easier to import for this exercise.
Download the original paper here to make sense of the exercises below: https://drive.google.com/file/d/0B9jjwkjdUJU7eE5MVWRBZktlTm8/view?usp=sharing.
When you call on read.csv() or read_csv() be sure to include the file extension – .csv – at the end. Notice we use uppercase letters to name the data table we’ve prepared. The command as written uses your “more” folder.
Or alternatively, using read_csv:
We want to have consistent naming protocols. So we shall quickly clean the data to make sure that all the variables are consistently labeled with lower case first letters to ensure we don’t confuse them with data tables.
There are a couple of ways to do it, the shortes and easiest is as follows using the base R tolower() function which takes the input and change it to lowercase (hence its name tolower()):
Alternatively, but more verbosely, we could use the rename() command for each variable:
Now that we have an object with the data in it called SutExp, we want a data table that contains the means for each treatment for each round.
group_by and summarise.group_by() with our Treatment variable (from Sutter’s paper)summarise() to create the means of the variable in which we’re interested: r1 through r9 (corresponding to rounds 1 through 9) respectively.groupAves <- #new data table called GroupAves
  SutExp %>%  #Using the SutExp data table
  group_by(treatment) %>% #Group the data by treatment 
  summarise(r1 = mean(r1), r2 = mean(r2), r3 = mean(r3), r4 = mean(r4), r5 = mean(r5), r6 = mean(r6), r7 = mean(r7), r8 = mean(r8), r9 = mean(r9)) gather command.Once you have narrowed your data you will then be able to call on ggplot to plot the data in a sensible way.
Now we need narrow data to plot
narrowAve <- #name for the new data table
  groupAves %>% #start with the old data table 
  gather(round, average, r1:r9) #gather the data to narrow itThis would be equivalent to:
And now we can filter this data and put it into a plot.
We need to filter for individuals and teams, then construct the ggplot.
Plot1 <- 
  narrowAve %>%
  filter(treatment == "individual" | treatment == "teamtreat" ) %>%
  group_by(treatment) %>%
  arrange(treatment)
Plot1 %>%
  ggplot(aes(x = round, y = average, color = treatment)) + 
  geom_point(aes(shape = treatment)) +
  geom_line(aes(group = treatment)) + 
  ylim(20, 80) +
  xlab("Round") +
  ylab("Average amount allocated") + 
  scale_x_discrete(labels = seq(1, 9, by = 1)) +  
  scale_colour_discrete(name = "Treatment",
                         breaks = c("individual", "teamtreat"),
                         labels = c("Indidividuals", "Teams")) +
  scale_shape_discrete(name = "Treatment",
                         breaks = c("individual", "teamtreat"),
                         labels = c("Indidividuals", "Teams"))We need to filter for individuals, paycomm and message, then construct the ggplot.
Plot2 <- 
  narrowAve %>%
  filter(treatment == "individual" | treatment == "paycomm" |  treatment == "message") %>%
  group_by(treatment) %>%
  arrange(treatment)
Plot2 %>%
  ggplot(aes(x = round, y = average, color = treatment)) + 
  geom_point(aes(shape = treatment)) +
  geom_line(aes(group = treatment)) + 
  ylim(20, 90) +
  xlab("Round") +
  ylab("Average amount allocated") + 
  scale_x_discrete(labels = seq(1, 9, by = 1)) +  
  scale_colour_discrete(name = "Treatment",
                         breaks = c("individual", "paycomm", "message"),
                         labels = c("Indidividuals", "Pay-Comm", "Message")) +
  scale_shape_discrete(name = "Treatment",
                         breaks = c("individual", "paycomm", "message"),
                         labels = c("Indidividuals", "Pay-Comm", "Message"))We need to filter for individuals and mixed, then construct the ggplot.
Plot3 <- 
  narrowAve %>%
  filter(treatment == "individual" | treatment == "mixed" ) %>%
  group_by(treatment) %>%
  arrange(treatment)
Plot3 %>%
  ggplot(aes(x = round, y = average, color = treatment)) + 
  geom_point(aes(shape = treatment)) +
  geom_line(aes(group = treatment)) + 
  ylim(20, 80) +
  xlab("Round") +
  ylab("Average amount allocated") + 
  scale_x_discrete(labels = seq(1, 9, by = 1)) +  
  scale_colour_discrete(name = "Treatment",
                         breaks = c("individual", "mixed"),
                         labels = c("Indidividuals", "Mixed")) +
  scale_shape_discrete(name = "Treatment",
                         breaks = c("individual", "mixed"),
                         labels = c("Indidividuals", "Mixed"))We’re now going to go about re-creating the averages that Sutter tests against each other in the paper. To do this, we’re first going to need unique identifiers for each subject. We’ll then look at different methods to find the average and diagnose why one doesn’t work.
We first need to create unique identifiers for each subject so that we don’t get confused across individuals (we will use this uniqid later in the regressions).
SutExpUniq <- 
  SutExp %>% 
  mutate(uniqid = paste(session, treatment, subject, sep = "_"))
head(SutExpUniq)## # A tibble: 6 x 14
##   session subject    r1    r2    r3    r4    r5    r6    r7    r8    r9
##     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1       1       1     0     0     0     0    10    10     0     0     0
## 2       1       2     0     0    30    40    40     0     0     0    20
## 3       1       3    30    30     0     0     0    60    60    10     0
## 4       1       4    20     0   100     0     0    30    75   100   100
## 5       1       5   100   100   100   100   100   100   100   100   100
## 6       1       6   100   100   100   100   100   100   100     0     0
## # … with 3 more variables: treatment <chr>, team <dbl>, uniqid <chr>NarUniq <- 
  SutExpUniq %>%
  #select()
  filter(treatment == "individual") %>% 
  gather(round, value, -treatment, -uniqid, -session)
  ggplot(aes(x = ))We need individual averages for round, so we need to group_by() with uniqid:
SutUniqAves and think about what it contains and write a brief description of the table.SutUniqAves informative? What are we comparing with what?In the paper, there’s a statistical test that may be unfamiliar to many of you: the Mann-Whitney U-Test. Look up what the Mann-Whitney test is on Google. It is also known as the Wilcoxon Rank Sum test.
Wilcoxon tests check the differences of samples across a factor variable, such as an experimental treatment. The problem is it only wants two levels of that treatment, so we have to filter out the treatments we don’t want to use. We only want “teamtreat” and “individuals” for now.
We could run a Wilcoxon test for the entire sample of the data for subjects for plot 1 and treat the rounds for each individual as if they are independent.
First, we need to narrow the data from our original SutExp data table:
SutNarrow <- 
  SutExpUniq %>%
  gather(round, value, r1:r9) %>%
  arrange(session, subject, treatment, team) #arranging isn't necessary, it's just more attractive
Plot1data <- #define the object 'plot 1 data'
  SutNarrow %>% #use SutNarrow to get that data
  filter(treatment == "individual" | treatment == "teamtreat" ) #filter on the categories we want
wilcox.test(value ~ treatment, data = Plot1data) # run the M-W test## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  value by treatment
## W = 52876, p-value = 3.838e-10
## alternative hypothesis: true location shift is not equal to 0An alternative might be to find an average for each subject. We’ve already done this with creating the data table SutUniqAves. We now filter out the rows we don’t need and run a Wilcoxon (Mann-Whitney) test.
Plot1AveData <- 
  SutUniqAves %>% 
  filter(treatment == "individual" | treatment == "teamtreat" )
wilcox.test(meanval ~ treatment, data = Plot1AveData)## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  meanval by treatment
## W = 660, p-value = 0.04532
## alternative hypothesis: true location shift is not equal to 0What might the commands below tell us? Is it helpful?
We now want to run a regression using value as the dependent variable (LHS) and each of the treatments as a dummy variable (or categorical variable). We can think of dummy variables as variables that take 1 value of ‘1’ if you are in that category and ‘0’ if you are not in that category. So we have categorical (dummy) variables for the participants in each experimental treatment. We then use R’s command for an OLS regression, lm() to find out what the value is predicted by participating in each treatment. The coefficient of each treatment will tell us what the average amount to add to the intercept of value should be for participating in that treatment relative to an excluded category.
Running a regression in R requires specifying an equation. The equation takes the form \(y \sim x + z + \ldots\). Because we only have one categorical variable we are interested in, we just have to specify that variable. Notice that R is intelligent in how it reads this. It will specify a list of dummy variables corresponding to each category (each treatment). This is because categories are encoded as ‘factors’ in R, and R is programmed to treat factor (category) variables in this way when dealing with factor variables.
## 
## Call:
## lm(formula = value ~ treatment, data = SutNarrow)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.370 -29.385  -0.542  38.630  60.615 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          39.385      1.451  27.152  < 2e-16 ***
## treatmentmessage     21.985      1.994  11.028  < 2e-16 ***
## treatmentmixed       10.609      1.925   5.510 3.92e-08 ***
## treatmentpaycomm     10.886      2.144   5.077 4.09e-07 ***
## treatmentteamtreat   16.313      2.629   6.204 6.34e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.81 on 2713 degrees of freedom
## Multiple R-squared:  0.04473,    Adjusted R-squared:  0.04333 
## F-statistic: 31.76 on 4 and 2713 DF,  p-value: < 2.2e-16The first command assigns to “m1” a linear model/OLS model (using the lm() function) where value is a function of treatment. Treatment is a “factor”, so R automatically converts the different components of the factor into dummy variables in a linear regression.
You then call on summary() to get a summary of the object m1 created by the linear model. summary() provides us the basic details we need to know about coefficient size, intercept, statistical significance and goodness of fit.
An alternative way of getting this regression output is to use the stargazer package, which makes much prettier output tables. (Be sure to use install.packages("stargazer") if you don’t have it installed. )
| Dependent variable: | |
| value | |
| treatmentmessage | 21.985*** | 
| (1.994) | |
| treatmentmixed | 10.609*** | 
| (1.925) | |
| treatmentpaycomm | 10.886*** | 
| (2.144) | |
| treatmentteamtreat | 16.313*** | 
| (2.629) | |
| Constant | 39.385*** | 
| (1.451) | |
| Observations | 2,718 | 
| R2 | 0.045 | 
| Adjusted R2 | 0.043 | 
| Residual Std. Error | 34.813 (df = 2713) | 
| F Statistic | 31.762*** (df = 4; 2713) | 
| Note: | p<0.1; p<0.05; p<0.01 | 
As you can see, the output from stargazer() looks much nicer in our html file than the ASCII text output of the summary() function. Notice that we had to specify the option ’results = “asis” in the code chunk option. We also had to specify "type = "html" for the stargazer function because we are using html output.
Note, if we call stargazer on a data table, then it immediately will produce summary statistics of the relevant table. Check this by calling stargazer on a relevant data table in which you are interested. Note, stargazer can be a little finicky as it was written by people outside of the tidyverse, so it likes you to specify that your object is a dataframe rather than a tibble (the structure of the data table that we have used so far with tidyverse).
What did I do here about specifying the type of output stargazer produced? How is it different to the previous time I ran the command?
SutExpDF <- as.data.frame(SutExp) #stargazer likes dataframes not tibbles
stargazer(SutExpDF, type = "text")## 
## ===========================================================
## Statistic  N   Mean  St. Dev.  Min  Pctl(25) Pctl(75)  Max 
## -----------------------------------------------------------
## session   302 2.089   0.948     1      1        3       4  
## subject   302 12.563  7.371     1      6        18     28  
## r1        302 48.639  30.816    0      25       66     100 
## r2        302 44.394  34.167    0     15.8     66.8    100 
## r3        302 47.818  35.917    0      15       80     100 
## r4        302 55.212  33.229    0      30      100     100 
## r5        302 52.050  32.818    0      30       80     100 
## r6        302 50.742  36.227    0      20      100     100 
## r7        302 59.970  36.994    0      30      100     100 
## r8        302 49.805  38.007    0      15      100     100 
## r9        302 50.702  39.505    0      10      100     100 
## team      238 4.504   2.440   1.000  2.000    6.000   9.000
## -----------------------------------------------------------Another way of doing this regression is as a “panel regression.” For a panel regresion, you treat each variable as if it comes from the same individual over time (hence a panel). We have to specify a few options if we are running a panel regression. We need to tell R the equation and data, as previously, but we also need to tell it what variable uniquely identifies each individual over time and we need to tell it what kind of model we want (as there are several kinds of models you can use for panel regressions).
Here’s one way of doing it using the plm package.
## Pooling Model
## 
## Call:
## plm(formula = value ~ treatment, data = SutNarrow, model = "pooling", 
##     index = c("uniqid"))
## 
## Balanced Panel: n = 302, T = 9, N = 2718
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -61.37037 -29.38542  -0.54191  38.62963  60.61458 
## 
## Coefficients:
##                    Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept)         39.3854     1.4505 27.1523 < 2.2e-16 ***
## treatmentmessage    21.9850     1.9936 11.0279 < 2.2e-16 ***
## treatmentmixed      10.6093     1.9254  5.5102 3.921e-08 ***
## treatmentpaycomm    10.8862     2.1442  5.0770 4.095e-07 ***
## treatmentteamtreat  16.3130     2.6293  6.2043 6.335e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    3441900
## Residual Sum of Squares: 3288000
## R-Squared:      0.044734
## Adj. R-Squared: 0.043325
## F-statistic: 31.7615 on 4 and 2713 DF, p-value: < 2.22e-16And another similar way of doing it with the same package:
m3 <- plm(value ~ treatment, data = SutNarrow, index = c("uniqid"), model = "random", effect = "time")
summary(m3)## Oneway (time) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = value ~ treatment, data = SutNarrow, effect = "time", 
##     model = "random", index = c("uniqid"))
## 
## Balanced Panel: n = 302, T = 9, N = 2718
## 
## Effects:
##                    var  std.dev share
## idiosyncratic 1197.631   34.607 0.987
## time            16.062    4.008 0.013
## theta: 0.555
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -61.9325 -28.6639  -2.5889  33.7276  64.3014 
## 
## Coefficients:
##                    Estimate Std. Error z-value  Pr(>|z|)    
## (Intercept)         39.3854     1.9657 20.0367 < 2.2e-16 ***
## treatmentmessage    21.9850     1.9818 11.0936 < 2.2e-16 ***
## treatmentmixed      10.6093     1.9140  5.5430 2.973e-08 ***
## treatmentpaycomm    10.8862     2.1315  5.1072 3.270e-07 ***
## treatmentteamtreat  16.3130     2.6138  6.2412 4.342e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    3403100
## Residual Sum of Squares: 3249200
## R-Squared:      0.045244
## Adj. R-Squared: 0.043836
## Chisq: 128.563 on 4 DF, p-value: < 2.22e-16m2 and m3 using stargazer.m2 and m3 are specified? How are they different to m1?