1 RAND Health Insurance Experiment

1.1 Working with .RDS files

The first step is to set up your working directory. To better organize things, I have a folder named Rlabs on my desktop. Inside of it, I also have different folders for each lab - in this case, lab1. To change the working directory, use setwd() with the path that leads to the folder you want.

In this lab, we will use data from the RAND Health Insurance Experiment (HIE), and there are two datasets. Here you have demographic information about the subjects in the study and also health variables (outcomes) both before and after the experiment. The other file (here) has information about health care spending.

setwd("C:/Users/User/Desktop/474-Rlab/datasets")
rand_sample<-readRDS("rand_sample.RDS")
rand_spend<-readRDS("rand_spend.RDS")

If you want to see the first values on that dataset, you can use the function head() or use View(rand_sample) to open the dataframe in a new tab.

#View(rand_sample)
head(rand_sample,5)
View(rand_spend)
#head(rand_spend,5)

Besides the column plantype, which identifies the assigned insurance group of each individual, the variables that we are looking for are displayed in 1.1:

Table 1.1: Variables Description
Variable Definition
rand_sample file
female Female
blackhisp Nonwhite
age Age
educper Education
income1cpi Family Income
hosp Hospitalized last year
ghindx General Health Index (before)
cholest Cholesterol (mg/dl) (before)
systol Systolic blood pressure (mm Hg) (before)
mhi Mental Health Index (before)
ghindxx General Health Index (after)
cholestx Cholesterol (mg/dl) (after)
systolx Systolic blood pressure (mm Hg) (after)
mhix Mental Health Index (after)
rand_spend file
ftf Face-to-face visits
out_inf Outpatient expenses
totadm Hospital admissions
inpdol_inf Inpatient expenses
tot_inf Total expenses

1.2 Summarizing data

Let’s say you want to compare demographic characteristics of the individuals in the RAND HIE across health insurance groups. To do that, you just need the functions group_by() and summarize() from the tidyverse package. Since there are some missing observations (NA), allow the function mean() to ignore those NAs.

library(tidyverse)
rand_sample%>%group_by(plantype)%>%
summarize(Female=mean(female, na.rm=T), 
Nonwhite=mean(blackhisp, na.rm=T),                    
Age=mean(age, na.rm=T), 
Education=mean(educper, na.rm=T), 
`Family Income`=mean(income1cpi, na.rm=T),
`Hospitalized last year`=mean(hosp, na.rm=T), 
`General Health Index`=mean(ghindx, na.rm=T),
`Cholesterol (mg/dl)`=mean(cholest, na.rm=T),
`Systolic blood pressure (mm Hg)`=mean(systol, na.rm=T),
`Mental Health Index`=mean(mhi, na.rm=T),
`Number enrolled`=n())
## # A tibble: 4 x 12
##   plantype     Female Nonwhite   Age Education `Family Income` `Hospitalized last year`
##   <fct>         <dbl>    <dbl> <dbl>     <dbl>           <dbl>                    <dbl>
## 1 Catastrophic  0.560    0.172  32.4      12.1          31603.                    0.115
## 2 Deductible    0.537    0.153  32.9      11.9          29499.                    0.120
## 3 Coinsurance   0.535    0.145  33.3      12.0          32573.                    0.113
## 4 Free          0.522    0.144  32.8      11.8          30627.                    0.116
## # ... with 5 more variables: General Health Index <dbl>, Cholesterol (mg/dl) <dbl>,
## #   Systolic blood pressure (mm Hg) <dbl>, Mental Health Index <dbl>,
## #   Number enrolled <int>

You can see that those values are the same as the ones in the lecture notes.

1.3 Checking for Balance

Although you can see the average values of demographic characteristics, we are unsure whether the difference in means across groups is statistically different from zero. We can perform a standard t-test comparing two groups. In this example, we compare the Catastrophic with the free plan. Let’s try education first:

cat_vs_free<-rand_sample%>%filter(plantype=="Catastrophic"|plantype=="Free")

t.test(educper~plantype, data=cat_vs_free, alternative="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  educper by plantype
## t = 1.8039, df = 1478.5, p-value = 0.07145
## alternative hypothesis: true difference in means between group Catastrophic and group Free is not equal to 0
## 95 percent confidence interval:
##  -0.02296019  0.54840275
## sample estimates:
## mean in group Catastrophic         mean in group Free 
##                   12.10483                   11.84211

According to the t-test, the difference of \(12.10483-11.84211=0.2627\) is not statistically significant at the 5% level, and we do not reject the null of equal means between groups.

What about family income?

t.test(income1cpi~plantype, data=cat_vs_free, alternative="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  income1cpi by plantype
## t = 1.1661, df = 1431, p-value = 0.2438
## alternative hypothesis: true difference in means between group Catastrophic and group Free is not equal to 0
## 95 percent confidence interval:
##  -665.9016 2618.2711
## sample estimates:
## mean in group Catastrophic         mean in group Free 
##                   31603.21                   30627.02

Again, the p-value is higher than 0.05, and we cannot reject the null: there is no evidence that family income is different between the Catastrophic and the Free insurance groups.

As an exercise, try to compare all the demographic characteristics between insurance levels. Use Catastrophic as “control” and Deductible, Coinsurance and Free as “treatment” - do it using pairwise comparisons, e.g., Catastrophic x Deductible, Catastrophic x Coinsurance, and so on.

1.4 Results of the Experiment

As we saw in class, subjects assigned to more generous insurance plans used substantially more health care. Let’s compare outpatient expenses and face-to-face visits between the Catastrophic group and the other groups together (we call it any_ins).

rand_spend$any_ins<-ifelse(rand_spend$plantype=="Catastrophic", "Catastrophic","Any Insurance")
t.test(ftf~any_ins, data=rand_spend,alternative="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  ftf by any_ins
## t = 8.6922, df = 6290.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
##  0.6961629 1.1016118
## sample estimates:
## mean in group Any Insurance  mean in group Catastrophic 
##                    3.682990                    2.784103

The almost zero p-value gives us confidence that the difference in face-to-face visits between those with some insurance and the Catastrophic group is statistically significant. One can see the same for outpatient expenses below:

t.test(out_inf~any_ins, data=rand_spend, alternative="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  out_inf by any_ins
## t = 10.992, df = 6274.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
##   82.67062 118.56030
## sample estimates:
## mean in group Any Insurance  mean in group Catastrophic 
##                    348.4137                    247.7983

1.5 Equivalence of Differences in Means and Regression

Instead of performing a t-test for differences in means, one can run regressions and get the same results. Regression plays an important role in empirical economic research and can be easily applied to experimental data. The advantage is that you can add controls and fix standard errors (we will talk about that later).

Let’s first create a dummy that is equal to 1 if the individual has “any insurance” (i.e., is assigned to the Deductible, Coinsurance, or Free group) and zero otherwise:

rand_spend$dummy_ins<-ifelse(rand_spend$any_ins=="Any Insurance", 1,0)

Then, use the lm() to perform a linear regression of Face-to-face visits on the dummy that identifies the comparison groups:

reg1<-lm(ftf~dummy_ins, data=rand_spend)
summary(reg1)
## 
## Call:
## lm(formula = ftf ~ dummy_ins, data = rand_spend)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -3.683  -2.784  -1.683   0.317 140.317 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.7841     0.1036  26.876  < 2e-16 ***
## dummy_ins     0.8989     0.1147   7.837 4.85e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.322 on 20201 degrees of freedom
## Multiple R-squared:  0.003031,   Adjusted R-squared:  0.002982 
## F-statistic: 61.42 on 1 and 20201 DF,  p-value: 4.849e-15

The coefficient 0.8989 represents the difference in face-to-face visits between the insurance groups. As one can see, the coefficient is statistically significant (p-value<0.05).

When you perform the t-test for difference in means with the option var.equal=TRUE (i.e., assuming equal variance), you get the same standard errors/p-value/t statistic. Notice that running the standard OLS, you assume homoskedasticity, and that is why you need to set var.equal=TRUE.

t.test(ftf~any_ins, data=rand_spend,alternative="two.sided", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  ftf by any_ins
## t = 7.8368, df = 20201, p-value = 4.849e-15
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
##  0.6740646 1.1237101
## sample estimates:
## mean in group Any Insurance  mean in group Catastrophic 
##                    3.682990                    2.784103

Doing the same for outpatient expenses:

reg2<-lm(out_inf~dummy_ins, data=rand_spend)
summary(reg2)
## 
## Call:
## lm(formula = out_inf ~ dummy_ins, data = rand_spend)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -348.4  -290.3  -175.5    69.2 12527.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  247.798      9.153  27.073   <2e-16 ***
## dummy_ins    100.615     10.135   9.928   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 558.6 on 20201 degrees of freedom
## Multiple R-squared:  0.004855,   Adjusted R-squared:  0.004806 
## F-statistic: 98.56 on 1 and 20201 DF,  p-value: < 2.2e-16
t.test(out_inf~any_ins, data=rand_spend, alternative="two.sided", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  out_inf by any_ins
## t = 9.9278, df = 20201, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Any Insurance and group Catastrophic is not equal to 0
## 95 percent confidence interval:
##   80.75058 120.48034
## sample estimates:
## mean in group Any Insurance  mean in group Catastrophic 
##                    348.4137                    247.7983

What about the health outcomes? Compare the average health outcomes after the experiment - ghindxx, cholestx, systolx, mhix - between the Catastrophic and any insurance groups using regression. Do you see any statistically significant coefficient?