1 Randomized Experiments
1.1 RAND Health Insurance Experiment
1.1.1 Working with .RDS files
The first step is to set up your working directory. To better organize things, you can create a folder named Rlabs
on your desktop. Inside of it, you might want to have different folders for each lab - in this case, lab1
. To change the working directory, use setwd()
with the path that leads to the folder you want, or press ctrl+shift+h
and find the preferred folder.
In this lab, we will use data from the RAND Health Insurance Experiment (HIE), and there are two datasets. Here you have demographic information about the subjects in the study and also health variables (outcomes) both before and after the experiment. The other file (here) has information about health care spending. These notes are heavily based on Angrist and Pischke (2014) chapter 1, and here you have a summary of the RAND HIE.
setwd("C:/Users/User/Desktop/474-Rlab/datasets")
<-readRDS("rand_sample.RDS")
rand_sample<-readRDS("rand_spend.RDS") rand_spend
If you want to see the first values on that dataset, you can use the function head() or use View(rand_sample)
to open the dataframe in a new tab.
#View(rand_sample)
head(rand_sample,5)
View(rand_spend)
#head(rand_spend,5)
Another way to check the dataset is using the function glimpse()
, using tidyverse. Then you have a good look at all the 319 columns in this dataset.
library(tidyverse)
%>%glimpse() rand_spend
Besides the column plantype
, which identifies the assigned insurance group of each individual, the variables that we are looking for are displayed in 1.1:
Variable | Definition |
---|---|
rand_sample file | |
female | Female |
blackhisp | Nonwhite |
age | Age |
educper | Education |
income1cpi | Family Income |
hosp | Hospitalized last year |
ghindx | General Health Index (before) |
cholest | Cholesterol (mg/dl) (before) |
systol | Systolic blood pressure (mm Hg) (before) |
mhi | Mental Health Index (before) |
ghindxx | General Health Index (after) |
cholestx | Cholesterol (mg/dl) (after) |
systolx | Systolic blood pressure (mm Hg) (after) |
mhix | Mental Health Index (after) |
rand_spend file | |
ftf | Face-to-face visits |
out_inf | Outpatient expenses |
totadm | Hospital admissions |
inpdol_inf | Inpatient expenses |
tot_inf | Total expenses |
Using the function select()
, you can focus only on the columns you will be using in this exercise:
<-rand_sample%>%select(person, plantype, female, blackhisp,age,educper,
rand_sample
income1cpi,hosp,ghindx,cholest,systol,
mhi,ghindxx,cholestx,systolx,mhix)
<-rand_spend%>%select(person,plantype ,ftf,out_inf,totadm,inpdol_inf,tot_inf) rand_spend
1.1.2 Summarizing data
Let’s say you want to compare demographic characteristics of the individuals in the RAND HIE across health insurance groups. To do that, you just need the functions group_by()
and summarize()
from the tidyverse
package. Since there are some missing observations (NA), allow the function mean() to ignore those NAs.
library(tidyverse)
%>%group_by(plantype)%>%
rand_samplesummarize(Female=mean(female, na.rm=T),
Nonwhite=mean(blackhisp, na.rm=T),
Age=mean(age, na.rm=T),
Education=mean(educper, na.rm=T),
`Family Income`=mean(income1cpi, na.rm=T),
`Hospitalized last year`=mean(hosp, na.rm=T),
`General Health Index`=mean(ghindx, na.rm=T),
`Cholesterol (mg/dl)`=mean(cholest, na.rm=T),
`Systolic blood pressure (mm Hg)`=mean(systol, na.rm=T),
`Mental Health Index`=mean(mhi, na.rm=T),
`Number enrolled`=n())
## # A tibble: 4 x 12
## plantype Female Nonwhite Age Education `Family Income` `Hospitalized last year`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Catastrophic 0.560 0.172 32.4 12.1 31603. 0.115
## 2 Deductible 0.537 0.153 32.9 11.9 29499. 0.120
## 3 Coinsurance 0.535 0.145 33.3 12.0 32573. 0.113
## 4 Free 0.522 0.144 32.8 11.8 30627. 0.116
## # ... with 5 more variables: General Health Index <dbl>, Cholesterol (mg/dl) <dbl>,
## # Systolic blood pressure (mm Hg) <dbl>, Mental Health Index <dbl>,
## # Number enrolled <int>
You can see that those values are the same as the ones in the lecture notes.
1.1.3 Checking for Balance
Although you can see the average values of demographic characteristics, we are unsure whether the difference in means across groups is statistically different from zero. We can perform a standard t-test comparing two groups. In this example, we compare the Catastrophic with the free plan. Let’s try education first:
<-rand_sample%>%filter(plantype=="Catastrophic"|plantype=="Free")
cat_vs_free
t.test(educper~plantype, data=cat_vs_free, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: educper by plantype
## t = 1.8039, df = 1478.5, p-value = 0.07145
## alternative hypothesis: true difference in means between group Catastrophic and group Free is not equal to 0
## 95 percent confidence interval:
## -0.02296019 0.54840275
## sample estimates:
## mean in group Catastrophic mean in group Free
## 12.10483 11.84211
According to the t-test, the difference of \(12.10483-11.84211=0.2627\) is not statistically significant at the 5% level, and we do not reject the null of equal means between groups.
What about family income?
t.test(income1cpi~plantype, data=cat_vs_free, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: income1cpi by plantype
## t = 1.1661, df = 1431, p-value = 0.2438
## alternative hypothesis: true difference in means between group Catastrophic and group Free is not equal to 0
## 95 percent confidence interval:
## -665.9016 2618.2711
## sample estimates:
## mean in group Catastrophic mean in group Free
## 31603.21 30627.02
Again, the p-value is higher than 0.05, and we cannot reject the null: there is no evidence that family income is different between the Catastrophic and the Free insurance groups.
As an exercise, try to compare all the demographic characteristics between insurance levels. Use Catastrophic as “control” and Deductible, Coinsurance and Free as “treatment” - do it using pairwise comparisons, e.g., Catastrophic x Deductible, Catastrophic x Coinsurance, and so on.
1.1.4 Results of the Experiment
As we saw in class, subjects assigned to more generous insurance plans used substantially more health care. Let’s compare outpatient expenses and face-to-face visits between the Catastrophic group and the other groups together (we call it any_ins
).
The ifelse()
function helps you to assemble all groups that are different from Catastrophic in only group - Any Insurance
.
$any_ins<-ifelse(rand_spend$plantype=="Catastrophic", "Catastrophic","Any Insurance") rand_spend
In this case, you are creating a new column in the dataset called any_ins
. The first input is rand_spend$plantype=="Catastrophic"
. That means for all individuals with Catastrophic plan, write Catastrophic
. Otherwise, write Any Insurance
. The same can be done with the mutate()
function:
<-rand_spend%>%
rand_spendmutate(any_ins2=ifelse(rand_spend$plantype=="Catastrophic", "Catastrophic","Any Insurance"))
You can check whether any_ins
and any_ins2
gives you the same result:
all(rand_spend$any_ins==rand_spend$any_ins2)
## [1] TRUE
Running the t.test()
to check if there are differences in face-to-face visits between groups:
t.test(ftf~any_ins, data=rand_spend,alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: ftf by any_ins
## t = 8.6922, df = 6290.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
## 0.6961629 1.1016118
## sample estimates:
## mean in group Any Insurance mean in group Catastrophic
## 3.682990 2.784103
The almost zero p-value gives us confidence that the difference in face-to-face visits between those with some insurance and the Catastrophic group is statistically significant. One can see the same for outpatient expenses below:
t.test(out_inf~any_ins, data=rand_spend, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: out_inf by any_ins
## t = 10.992, df = 6274.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
## 82.67062 118.56030
## sample estimates:
## mean in group Any Insurance mean in group Catastrophic
## 348.4137 247.7983
1.1.5 Equivalence of Differences in Means and Regression
Instead of performing a t-test for differences in means, one can run regressions and get the same results. Regression plays an important role in empirical economic research and can be easily applied to experimental data. The advantage is that you can add controls and fix standard errors (we will talk about that later).
Let’s first create a dummy that is equal to 1 if the individual has “any insurance” (i.e., is assigned to the Deductible, Coinsurance, or Free group) and zero otherwise:
$dummy_ins<-ifelse(rand_spend$any_ins=="Any Insurance", 1,0) rand_spend
Then, use the lm()
to perform a linear regression of Face-to-face visits on the dummy that identifies the comparison groups:
<-lm(ftf~dummy_ins, data=rand_spend)
reg1summary(reg1)
##
## Call:
## lm(formula = ftf ~ dummy_ins, data = rand_spend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.683 -2.784 -1.683 0.317 140.317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7841 0.1036 26.876 < 2e-16 ***
## dummy_ins 0.8989 0.1147 7.837 4.85e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.322 on 20201 degrees of freedom
## Multiple R-squared: 0.003031, Adjusted R-squared: 0.002982
## F-statistic: 61.42 on 1 and 20201 DF, p-value: 4.849e-15
The coefficient 0.8989
represents the difference in face-to-face visits between the insurance groups. As one can see, the coefficient is statistically significant (p-value<0.05).
When you perform the t-test for difference in means with the option var.equal=TRUE
(i.e., assuming equal variance), you get the same standard errors/p-value/t statistic. Notice that running the standard OLS, you assume homoskedasticity, and that is why you need to set var.equal=TRUE
.
t.test(ftf~any_ins, data=rand_spend,alternative="two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: ftf by any_ins
## t = 7.8368, df = 20201, p-value = 4.849e-15
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
## 0.6740646 1.1237101
## sample estimates:
## mean in group Any Insurance mean in group Catastrophic
## 3.682990 2.784103
Doing the same for outpatient expenses:
<-lm(out_inf~dummy_ins, data=rand_spend)
reg2summary(reg2)
##
## Call:
## lm(formula = out_inf ~ dummy_ins, data = rand_spend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -348.4 -290.3 -175.5 69.2 12527.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 247.798 9.153 27.073 <2e-16 ***
## dummy_ins 100.615 10.135 9.928 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 558.6 on 20201 degrees of freedom
## Multiple R-squared: 0.004855, Adjusted R-squared: 0.004806
## F-statistic: 98.56 on 1 and 20201 DF, p-value: < 2.2e-16
t.test(out_inf~any_ins, data=rand_spend, alternative="two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: out_inf by any_ins
## t = 9.9278, df = 20201, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
## 80.75058 120.48034
## sample estimates:
## mean in group Any Insurance mean in group Catastrophic
## 348.4137 247.7983
What about the health outcomes? Compare the average health outcomes after the experiment - ghindxx
, cholestx
, systolx
, mhix
- between the Catastrophic and any insurance groups using regression. Do you see any statistically significant coefficient?
1.2 A/B Testing
A/B tests are heavily used in industry. Companies run thousands of experiments every year, sometimes involving millions of customers to test changes in user experience, the relevance of algorithms, etc. (see some examples here). Before you start designing those experiments, you need to understand some basic concepts related to hypothesis testing1.
For now, let’s talk about possible results when conducting a test. Say you want to check customer engagement in the Active Illini website. To participate in Intramural Activities, students must pay a $50 fee. The user needs to click on the “Memberships” box in the middle of the homepage at the bottom left and follow some instructions. Working as a data scientist at Campus Recreation, you suggest placing the “Memberships” at the top left instead to see whether this causes an increase in intramural memberships. The experimental unit is user: some will be randomly selected to see the “Memberships” box at the top left, others will see the current version (at the bottom left). You get the mean of intramural memberships in each group and then compare those values, testing the following:
\[\begin{array}{rcl} H_{o} & \mu_{1}=\mu_{0} \\ H_{a} & \mu_{1}\neq \mu_{0} \end{array}\]where \(\mu_{1}\) is the average number of intramural memberships in the treated group, and \(\mu_{0}\) is the average number of intramural memberships in the control group. Conducting this kind of test, you have four possible outcomes:
If the null hypothesis is false and the statistical test leads us to reject it, you’ve made a correct decision. You’ve correctly determined that intramural memberships are affected by the box position.
If the null hypothesis is true and you don’t reject it, you’ve made a correct decision. There is no difference between having the box at the top left or at the bottom left.
If the null hypothesis is true, but you reject it, you’ve committed a Type I error. You’ve concluded that the box position affects intramural memberships when it doesn’t.
If the null hypothesis is false and you fail to reject it, you’ve committed a Type II error. Box position affects intramural memberships, but you’ve been unable to discern this.
These outcomes are illustrated below:
Fail to Reject \(H_{0}\) | Reject \(H_{0}\) | |
---|---|---|
\(H_{0}\) is true | Good decision \((Prob=1-\alpha)\) | Type I error \((Prob=\alpha)\) |
\(H_{a}\) is true | Type II error \((Prob=\beta)\) | Good decision \((Prob=1-\beta)\) |
We would like to have a good power \((1-\beta\text{, usually 80%})\) - i.e., to be able to detect an effect if there is a true effect to be detected. In that sense, power analysis is an important aspect of experimental design - it allows us to determine the minimum sample size to detect an effect of a given size with a certain degree of confidence. For simple experiments, the package pwr
can be used to perform power analysis.
For example, to compute the sample size required to reach a good power in the intramural memberships experiment, we can use the following:
## install it first using install.packages("pwr")
library(pwr)
pwr.t.test(d=0.2, sig.level=0.05, power=0.8)
##
## Two-sample t test power calculation
##
## n = 393.4057
## d = 0.2
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
For each group, we would need 394 users. In total, the sample size would be 788 users. The necessary ingredients to cook the required sample size are i) d ii) \(\alpha=0.05\), and iii) \(power=0.8\). The first input \(d\) is the effect size - a standardized measure of the difference you are trying to detect. In other words, you want to be able to detect a difference of 0.2 (a change that would be considered important for Campus Recreation) with a probability of 0.8.
Now, let’s say you only had 92 users in each group accessing the Active Illini website during the last two weeks. Is 92 x 2 a large enough sample for your experiment?
## install it first using
# install.packages("pwr")
library(pwr)
pwr.t.test(d=0.2, sig.level=0.05, n=92)
##
## Two-sample t test power calculation
##
## n = 92
## d = 0.2
## sig.level = 0.05
## power = 0.2711829
## alternative = two.sided
##
## NOTE: n is number in *each* group
The power of this test is only 27.11%. You probably won’t detect that small effect \((d=0.2)\) with so few people in each group. One alternative is to keep running the experiment until you reach 394 users per group.
References
In chapter 2, we provide an extensive review of inference.↩︎