1 RAND Health Insurance Experiment
1.1 Working with .RDS files
The first step is to set up your working directory. To better organize things, I have a folder named Rlabs
on my desktop. Inside of it, I also have different folders for each lab - in this case, lab1
. To change the working directory, use setwd()
with the path that leads to the folder you want.
In this lab, we will use data from the RAND Health Insurance Experiment (HIE), and there are two datasets. Here you have demographic information about the subjects in the study and also health variables (outcomes) both before and after the experiment. The other file (here) has information about health care spending.
setwd("C:/Users/User/Desktop/474-Rlab/datasets")
<-readRDS("rand_sample.RDS")
rand_sample<-readRDS("rand_spend.RDS") rand_spend
If you want to see the first values on that dataset, you can use the function head() or use View(rand_sample)
to open the dataframe in a new tab.
#View(rand_sample)
head(rand_sample,5)
View(rand_spend)
#head(rand_spend,5)
Besides the column plantype
, which identifies the assigned insurance group of each individual, the variables that we are looking for are displayed in 1.1:
Variable | Definition |
---|---|
rand_sample file | |
female | Female |
blackhisp | Nonwhite |
age | Age |
educper | Education |
income1cpi | Family Income |
hosp | Hospitalized last year |
ghindx | General Health Index (before) |
cholest | Cholesterol (mg/dl) (before) |
systol | Systolic blood pressure (mm Hg) (before) |
mhi | Mental Health Index (before) |
ghindxx | General Health Index (after) |
cholestx | Cholesterol (mg/dl) (after) |
systolx | Systolic blood pressure (mm Hg) (after) |
mhix | Mental Health Index (after) |
rand_spend file | |
ftf | Face-to-face visits |
out_inf | Outpatient expenses |
totadm | Hospital admissions |
inpdol_inf | Inpatient expenses |
tot_inf | Total expenses |
1.2 Summarizing data
Let’s say you want to compare demographic characteristics of the individuals in the RAND HIE across health insurance groups. To do that, you just need the functions group_by()
and summarize()
from the tidyverse
package. Since there are some missing observations (NA), allow the function mean() to ignore those NAs.
library(tidyverse)
%>%group_by(plantype)%>%
rand_samplesummarize(Female=mean(female, na.rm=T),
Nonwhite=mean(blackhisp, na.rm=T),
Age=mean(age, na.rm=T),
Education=mean(educper, na.rm=T),
`Family Income`=mean(income1cpi, na.rm=T),
`Hospitalized last year`=mean(hosp, na.rm=T),
`General Health Index`=mean(ghindx, na.rm=T),
`Cholesterol (mg/dl)`=mean(cholest, na.rm=T),
`Systolic blood pressure (mm Hg)`=mean(systol, na.rm=T),
`Mental Health Index`=mean(mhi, na.rm=T),
`Number enrolled`=n())
## # A tibble: 4 x 12
## plantype Female Nonwhite Age Education `Family Income` `Hospitalized last year`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Catastrophic 0.560 0.172 32.4 12.1 31603. 0.115
## 2 Deductible 0.537 0.153 32.9 11.9 29499. 0.120
## 3 Coinsurance 0.535 0.145 33.3 12.0 32573. 0.113
## 4 Free 0.522 0.144 32.8 11.8 30627. 0.116
## # ... with 5 more variables: General Health Index <dbl>, Cholesterol (mg/dl) <dbl>,
## # Systolic blood pressure (mm Hg) <dbl>, Mental Health Index <dbl>,
## # Number enrolled <int>
You can see that those values are the same as the ones in the lecture notes.
1.3 Checking for Balance
Although you can see the average values of demographic characteristics, we are unsure whether the difference in means across groups is statistically different from zero. We can perform a standard t-test comparing two groups. In this example, we compare the Catastrophic with the free plan. Let’s try education first:
<-rand_sample%>%filter(plantype=="Catastrophic"|plantype=="Free")
cat_vs_free
t.test(educper~plantype, data=cat_vs_free, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: educper by plantype
## t = 1.8039, df = 1478.5, p-value = 0.07145
## alternative hypothesis: true difference in means between group Catastrophic and group Free is not equal to 0
## 95 percent confidence interval:
## -0.02296019 0.54840275
## sample estimates:
## mean in group Catastrophic mean in group Free
## 12.10483 11.84211
According to the t-test, the difference of \(12.10483-11.84211=0.2627\) is not statistically significant at the 5% level, and we do not reject the null of equal means between groups.
What about family income?
t.test(income1cpi~plantype, data=cat_vs_free, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: income1cpi by plantype
## t = 1.1661, df = 1431, p-value = 0.2438
## alternative hypothesis: true difference in means between group Catastrophic and group Free is not equal to 0
## 95 percent confidence interval:
## -665.9016 2618.2711
## sample estimates:
## mean in group Catastrophic mean in group Free
## 31603.21 30627.02
Again, the p-value is higher than 0.05, and we cannot reject the null: there is no evidence that family income is different between the Catastrophic and the Free insurance groups.
As an exercise, try to compare all the demographic characteristics between insurance levels. Use Catastrophic as “control” and Deductible, Coinsurance and Free as “treatment” - do it using pairwise comparisons, e.g., Catastrophic x Deductible, Catastrophic x Coinsurance, and so on.
1.4 Results of the Experiment
As we saw in class, subjects assigned to more generous insurance plans used substantially more health care. Let’s compare outpatient expenses and face-to-face visits between the Catastrophic group and the other groups together (we call it any_ins
).
$any_ins<-ifelse(rand_spend$plantype=="Catastrophic", "Catastrophic","Any Insurance")
rand_spendt.test(ftf~any_ins, data=rand_spend,alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: ftf by any_ins
## t = 8.6922, df = 6290.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
## 0.6961629 1.1016118
## sample estimates:
## mean in group Any Insurance mean in group Catastrophic
## 3.682990 2.784103
The almost zero p-value gives us confidence that the difference in face-to-face visits between those with some insurance and the Catastrophic group is statistically significant. One can see the same for outpatient expenses below:
t.test(out_inf~any_ins, data=rand_spend, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: out_inf by any_ins
## t = 10.992, df = 6274.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
## 82.67062 118.56030
## sample estimates:
## mean in group Any Insurance mean in group Catastrophic
## 348.4137 247.7983
1.5 Equivalence of Differences in Means and Regression
Instead of performing a t-test for differences in means, one can run regressions and get the same results. Regression plays an important role in empirical economic research and can be easily applied to experimental data. The advantage is that you can add controls and fix standard errors (we will talk about that later).
Let’s first create a dummy that is equal to 1 if the individual has “any insurance” (i.e., is assigned to the Deductible, Coinsurance, or Free group) and zero otherwise:
$dummy_ins<-ifelse(rand_spend$any_ins=="Any Insurance", 1,0) rand_spend
Then, use the lm()
to perform a linear regression of Face-to-face visits on the dummy that identifies the comparison groups:
<-lm(ftf~dummy_ins, data=rand_spend)
reg1summary(reg1)
##
## Call:
## lm(formula = ftf ~ dummy_ins, data = rand_spend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.683 -2.784 -1.683 0.317 140.317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7841 0.1036 26.876 < 2e-16 ***
## dummy_ins 0.8989 0.1147 7.837 4.85e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.322 on 20201 degrees of freedom
## Multiple R-squared: 0.003031, Adjusted R-squared: 0.002982
## F-statistic: 61.42 on 1 and 20201 DF, p-value: 4.849e-15
The coefficient 0.8989
represents the difference in face-to-face visits between the insurance groups. As one can see, the coefficient is statistically significant (p-value<0.05).
When you perform the t-test for difference in means with the option var.equal=TRUE
(i.e., assuming equal variance), you get the same standard errors/p-value/t statistic. Notice that running the standard OLS, you assume homoskedasticity, and that is why you need to set var.equal=TRUE
.
t.test(ftf~any_ins, data=rand_spend,alternative="two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: ftf by any_ins
## t = 7.8368, df = 20201, p-value = 4.849e-15
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
## 0.6740646 1.1237101
## sample estimates:
## mean in group Any Insurance mean in group Catastrophic
## 3.682990 2.784103
Doing the same for outpatient expenses:
<-lm(out_inf~dummy_ins, data=rand_spend)
reg2summary(reg2)
##
## Call:
## lm(formula = out_inf ~ dummy_ins, data = rand_spend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -348.4 -290.3 -175.5 69.2 12527.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 247.798 9.153 27.073 <2e-16 ***
## dummy_ins 100.615 10.135 9.928 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 558.6 on 20201 degrees of freedom
## Multiple R-squared: 0.004855, Adjusted R-squared: 0.004806
## F-statistic: 98.56 on 1 and 20201 DF, p-value: < 2.2e-16
t.test(out_inf~any_ins, data=rand_spend, alternative="two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: out_inf by any_ins
## t = 9.9278, df = 20201, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
## 80.75058 120.48034
## sample estimates:
## mean in group Any Insurance mean in group Catastrophic
## 348.4137 247.7983
What about the health outcomes? Compare the average health outcomes after the experiment - ghindxx
, cholestx
, systolx
, mhix
- between the Catastrophic and any insurance groups using regression. Do you see any statistically significant coefficient?