Subclassification and Matching

Subclassification Example

In this example, we are going to follow Cunningham (2021) and use a dataset of Titanic passengers. Download the .RDS file here.

The variables’ description is the following:

Variables Description
Variable Definition
class Passenger’s class: 1 if located in the upper decks
age Passenger’s age: 0 if Child, 1 if Adult
sex Passenger’s gender: 0 if woman, 1 if man
survived 1 if survived, 0 otherwise

Out of 2,201 people among passengers and crew, only 711 survived. It is well known the role that wealth and norms - women and children had priority for lifeboats - played in passengers’ probability of survival, so let’s check whether or not being part of the first class made someone more likely to survive the tragedy. First, let’s compare survival rates between passengers in the upper deck and the others.

Create a dummy variable first_class that takes 1 if class==1 and 0 otherwise. Then, use group_by() and summarize() the information calculating the average values of survived

library(tidyverse)
titanic<-readRDS("titanic.RDS")
titanic$first_class<-ifelse(titanic$class==1,"Upper Deck","Lower Decks")
titanic%>%group_by(first_class)%>%summarize(survival_rate=mean(survived))
## # A tibble: 2 x 2
##   first_class survival_rate
##   <chr>               <dbl>
## 1 Lower Decks         0.271
## 2 Upper Deck          0.625

The naive comparison gives a difference in survival rates of 35.4% for first class seats. That difference is considerable and also statistically significant:

t.test(survived~first_class, data=titanic)
## 
##  Welch Two Sample t-test
## 
## data:  survived by first_class
## t = -12.289, df = 423.61, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Lower Decks and group Upper Deck is not equal to 0
## 95 percent confidence interval:
##  -0.4104198 -0.2972332
## sample estimates:
## mean in group Lower Decks  mean in group Upper Deck 
##                 0.2707889                 0.6246154

However, we also know that women and children were explicitly given priority for boarding the very limited number of lifeboats. Hence, that difference might be biased if those groups - lower and upper deck - are fundamentally different. If women and children are more likely to be seated in the first class, then that difference is driven by social norms.

What if we adjust for those two observable confounders (age and gender)? To use subclassification weighting:

  1. Stratify the data into four groups: male children, female children, male adult, female adult

  2. Calculate the difference in survival probabilities for each group

  3. Calculate the number of people in the non-first-class groups and divide by the total number of non-first-class population. These are our strata-specific weights

  4. Calculate the weighted average survival rate using the strata weights

titanic%>%group_by(age, sex, first_class)%>%summarize(outcome=mean(survived), count=n())
## # A tibble: 8 x 5
## # Groups:   age, sex [4]
##          age       sex first_class outcome count
##    <dbl+lbl> <dbl+lbl> <chr>         <dbl> <int>
## 1 0 [child]  0 [women] Lower Decks   0.614    44
## 2 0 [child]  0 [women] Upper Deck    1         1
## 3 0 [child]  1 [man]   Lower Decks   0.407    59
## 4 0 [child]  1 [man]   Upper Deck    1         5
## 5 1 [adults] 0 [women] Lower Decks   0.626   281
## 6 1 [adults] 0 [women] Upper Deck    0.972   144
## 7 1 [adults] 1 [man]   Lower Decks   0.188  1492
## 8 1 [adults] 1 [man]   Upper Deck    0.326   175
diff1=1-0.614
diff2=1-0.407
diff3=0.972-0.626
diff4=0.326-0.188


weightedATE=44/1876*diff1+59/1876*diff2+281/1876*diff3+1492/1876*diff4
weightedATE
## [1] 0.189282

Using the subclassification weighting scheme, the average treatment effect (ATE) is 18.9%, compared to 35.4% using a simple comparison between first and non-first classes.

Propensity Score Example

In this example, we follow Cunningham (2021). Download the NSW experimental data here and let’s replicate the table we saw in class.

The variables’ description is the following:

Variables Description
Variable Definition
data_id Individual’s Identification
treat Treatment: 1 if participated in the Job training, 0 otherwise
age Age
educ Years of Education
black Race: 1 if Black, 0 otherwise
hisp Race: 1 if Hispanic, 0 otherwise
marr Marital status: 1 if married, 0 otherwise
nodegree 1 if participant does not have a degree, 0 otherwise
re74 Earnings in 1974 (pre-program)
re75 Earnings in 1975 (pre-program)
re78 Earnings in 1978 (post-program)

First, read the .RDS file. Then, summarize the baseline characteristics to check for balance and get the difference in means of earnings in 1978 re78:

library(tidyverse)
exper_data<-readRDS("NSW_EXPER.RDS")
exper_data%>%group_by(treat)%>%summarize(Age=mean(age), Education=mean(educ), 
                                     Black=mean(black), Hispanic=mean(hisp),
                                     Married=mean(marr), `No degree`=mean(nodegree), 
                                     `Earnings in 1974`=mean(re74),
                                     `Earnings in 1975`=mean(re75), 
                                     `Earnings in 1978`=mean(re78), 
                                     `Number of Observations`=n())
## # A tibble: 2 x 11
##   treat   Age Education Black Hispanic Married `No degree` `Earnings in 1974`
##   <dbl> <dbl>     <dbl> <dbl>    <dbl>   <dbl>       <dbl>              <dbl>
## 1     0  25.1      10.1 0.827   0.108    0.154       0.835              2107.
## 2     1  25.8      10.3 0.843   0.0595   0.189       0.708              2096.
## # ... with 3 more variables: Earnings in 1975 <dbl>, Earnings in 1978 <dbl>,
## #   Number of Observations <int>

Since the NSW was randomized, the independence assumption \((Y_{1i}, Y_{0i}) \perp\!\!\!\perp D_{i}\) is satisfied and simple comparisons between means give the average treatment effect (ATE):

\[\frac{1}{N_{Treat}}\sum_{D_{i}=1} Y_{i}-\frac{1}{N_{Control}}\sum_{D_{i}=0}Y_{i} \approx \underbrace{E[Y_{1i}-Y_{0i}]}_{\text{Average Treatment Effect}}\]

mean_treated<-exper_data%>%filter(treat==1)%>%summarize(avg_earnings78=mean(re78))
mean_control<-exper_data%>%filter(treat==0)%>%summarize(avg_earnings78=mean(re78))

mean_treated-mean_control
##   avg_earnings78
## 1       1794.342

The causal effect of the NSW job-training program is $1,794.

What if we replace the experimental control group with a random sample of Americans from the Current Population Survey? Download this data here and let’s see how it goes.

nonexper_data<-readRDS("NSW_NE.RDS")

nonexper_data%>%group_by(treat)%>%summarize(Age=mean(age), Education=mean(educ), 
                                     Black=mean(black), Hispanic=mean(hisp),
                                     Married=mean(marr), `No degree`=mean(nodegree), 
                                     `Earnings in 1974`=mean(re74),
                                     `Earnings in 1975`=mean(re75), 
                                     `Earnings in 1978`=mean(re78), 
                                     `Number of Observations`=n())
## # A tibble: 2 x 11
##   treat   Age Education  Black Hispanic Married `No degree` `Earnings in 1974`
##   <dbl> <dbl>     <dbl>  <dbl>    <dbl>   <dbl>       <dbl>              <dbl>
## 1     0  33.2      12.0 0.0735   0.0720   0.712       0.296             14017.
## 2     1  25.8      10.3 0.843    0.0595   0.189       0.708              2096.
## # ... with 3 more variables: Earnings in 1975 <dbl>, Earnings in 1978 <dbl>,
## #   Number of Observations <int>

As you can see, the NSW participants are younger, less educated, more black, less likely to be married, more likely to have no degree, among others - those two groups are fundamentally different.

Nearest-Neighbor Matching

One possibility is to use a matched sample that better approximates the NSW participants using propensity scores - the conditional probability of receiving the treatment given covariate values (in this case, education, age, past earnings, etc.). To get those estimated probabilities, we can use a logit regression:

# Use the glm() function with
# the family = binomial(link = 'logit') option
logit_reg <- glm( treat ~ age + educ + marr + nodegree + black + 
              hisp + re74 + re75, data = nonexper_data,
          family = binomial(link = 'logit'))

And get its fitted values:

probs <- data.frame(prop_score = predict(logit_reg, type = "response"),
                     job_training = logit_reg$model$treat)
#View(probs)
head(probs,5)
##   prop_score job_training
## 1 0.24751114            1
## 2 0.07257909            1
## 3 0.25039795            1
## 4 0.47895432            1
## 5 0.44753066            1

One of the assumptions for matching is the existence of common support: there must be units in both the treatment and control groups for any probability. In other words, common support ensures enough overlap in the characteristics of treated and control units to find suitable matches.

You can look at the distributions of the propensity score in the treated and control groups:

library(patchwork)
hist1<-probs  %>% 
  filter(job_training == 0) %>% 
  ggplot() +
  geom_histogram(aes(x = prop_score))+
  xlab("Probability of getting Job-training: control")+
  theme_bw()


hist2<-probs  %>% 
  filter(job_training == 1) %>% 
  ggplot() +
  geom_histogram(aes(x = prop_score))+
  xlab("Probability of getting Job-training: treated")+
  theme_bw()

hist1+hist2

It does not look good! The characteristics of individuals in the treatment group are scarce in the CPS sample we are using as a non-experimental control group.

A simple method for estimating the average treatment effect of the NSW training is to restrict the sample finding pairs of observations with very similar propensity scores but with different treatment statuses:

library(MatchIt)
m_out <- matchit(treat ~ age + educ +
                    marr + nodegree +
                   black + hisp + re74 + re75,
                 data = nonexper_data, method = "nearest", 
                 distance = "logit")

m_data <- match.data(m_out)
#View(m_data)
dim(m_data)
## [1] 370  14

The matched sample has 370 individuals: 185 in the treatment group and 185 in the control group. You might want to see if there is indeed a balance between covariates in the new dataset:

m_data%>%group_by(treat)%>%summarize(Age=mean(age), Education=mean(educ), 
                                     Black=mean(black), Hispanic=mean(hisp),
                                     Married=mean(marr), `No degree`=mean(nodegree), 
                                     `Earnings in 1974`=mean(re74),
                                     `Earnings in 1975`=mean(re75), 
                                     `Earnings in 1978`=mean(re78), 
                                     `Number of Observations`=n())
## # A tibble: 2 x 11
##   treat   Age Education Black Hispanic Married `No degree` `Earnings in 1974`
##   <dbl> <dbl>     <dbl> <dbl>    <dbl>   <dbl>       <dbl>              <dbl>
## 1     0  23.8      10.4 0.827   0.0432   0.135       0.681              1905.
## 2     1  25.8      10.3 0.843   0.0595   0.189       0.708              2096.
## # ... with 3 more variables: Earnings in 1975 <dbl>, Earnings in 1978 <dbl>,
## #   Number of Observations <int>

You can also run t.test() and see if the differences are statistically significant (they shouldn’t be). To get the average treatment effect, you can calculate the difference in 1978 average earnings between the treatment and control group or run a regression:

library(fixest)
reg1<-feols(re78~treat, data=m_data, se="hetero")
reg2<-feols(re78~treat+age+educ+black+hisp+marr+nodegree+re74+re75, data=m_data, se="hetero")

etable(reg1, reg2, signifCode = c("***"=0.01, "**"=0.05, "*"=0.10))
##                               reg1              reg2
## Dependent Var.:               re78              re78
##                                                     
## (Intercept)     4,586.9*** (423.8)   199.9 (3,716.3)
## treat            1,762.2** (717.1) 1,678.8** (697.5)
## age                                   -6.903 (43.98)
## educ                                  428.4* (231.7)
## black                               -689.8 (1,050.8)
## hisp                               1,152.2 (1,872.4)
## marr                                 377.2 (1,089.4)
## nodegree                             271.5 (1,162.5)
## re74                                -0.0108 (0.1801)
## re75                                0.3189* (0.1764)
## _______________ __________________ _________________
## S.E. type       Heteroskedas.-rob. Heteroskeda.-rob.
## Observations                   370               370
## R2                         0.01615           0.05328
## Adj. R2                    0.01347           0.02961

Weighting on the Propensity Score

Besides nearest-neighbor matching, there are several other ways to get average treatment effects using the estimated propensity score. Assuming the CIA holds, one can use the individuals’ propensity score to weigh their outcomes.

The first step is to generate the propensity scores using the logit regression:

nonexper_data<-readRDS("NSW_NE.RDS")

# Logit model to predict probability of treatment
logit_reg <- glm( treat ~ age + educ + marr + nodegree + black + 
              hisp + re74 + re75, data = nonexper_data,
          family = binomial(link = 'logit'))

# Generate propensity scores
nonexper_data$prop_scores<-logit_reg$fitted.values

After that, trim the data using a good rule of thumb: we keep observations on the interval [0.1,0.9]. Then, calculate the inverse probability weights (IPW) following this formula:

\[\frac{Treatment}{Propensity}-\frac{1-Treatment}{1-Propensity}\]

## Trimming the data 
nonexper_data<-nonexper_data%>%filter(prop_scores>=.1 &prop_scores<=.9 )
## Generating the IPW
nonexper_data<-nonexper_data %>% 
  mutate(ipw = (treat / prop_scores) + ((1 - treat) / (1 - prop_scores)))

Finally, run a weighted regression using ipw as weights:

reg3 <- feols(re78 ~ treat, data = nonexper_data,  se="hetero",  weights = nonexper_data$ipw)
reg4 <- feols(re78 ~ treat+age+educ+black+hisp+marr+nodegree+re74+re75, data = nonexper_data,  se="hetero",  weights = nonexper_data$ipw)
etable(reg3,reg4,signifCode = c("***"=0.01, "**"=0.05, "*"=0.10))
##                               reg3              reg4
## Dependent Var.:               re78              re78
##                                                     
## (Intercept)     4,425.6*** (312.3) 2,542.1 (2,937.0)
## treat            1,713.8** (732.8) 1,686.3** (740.4)
## age                                   -42.71 (36.13)
## educ                                  299.6* (181.8)
## marr                                 74.42 (1,125.0)
## nodegree                            -448.5 (1,104.9)
## re74                                -0.1361 (0.1192)
## re75                                 0.3328 (0.2175)
## _______________ __________________ _________________
## S.E. type       Heteroskedas.-rob. Heteroskeda.-rob.
## Observations                   504               504
## R2                         0.01541           0.04135
## Adj. R2                    0.01345           0.02782

The average treatment effect using IPW is $1,686-1,714, close to the actual causal effect we got from the randomized trial.