class: center, middle, inverse, title-slide # Econ 474 - Econometrics of Policy Evaluation ## Subclassification and Matching ### Marcelino Guerra ### February 09-14, 2021 --- # Example: Smoking and Lung Cancer I .pull-left[ * In the mid to late `\(20^{th}\)` century, the rising in lung cancer was a significant public health concern. During that time, studies started to suggest a causal link between smoking and lung cancer. We know that naive comparisons of lung cancer incidence between smokers and non-smokers may lead to biased estimates: those two groups might be fundamentally different in ways that are related to the incidence of lung cancer * For example, there might be an unobservable genetic element that both causes people to smoke and independently causes people to develop lung cancer * Even though no one conducted an experimental study on the subject, the causal link between smoking and lung cancer is widely accepted nowadays ] .pull-right[ The table shows the mortality rates by country and smoking types. You can see that the highest death rates among countries are related to cigars/pipes, not to cigarettes, which is weird since cigar and pipe smokers often do not inhale. Hence, they accumulate less tar (the toxic that damages the lungs), and we would expect higher mortality rates among cigarette smokers. <table class="table table-striped table-condensed" style="font-size: 22px; margin-left: auto; margin-right: auto;"> <thead> <tr><th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Death rates per 1,000 people</div></th></tr> <tr> <th style="text-align:left;"> Smoking group </th> <th style="text-align:right;"> Canada </th> <th style="text-align:right;"> UK </th> <th style="text-align:right;"> US </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Non-smokers </td> <td style="text-align:right;"> 20.2 </td> <td style="text-align:right;"> 11.3 </td> <td style="text-align:right;"> 13.5 </td> </tr> <tr> <td style="text-align:left;"> Cigarettes </td> <td style="text-align:right;"> 20.5 </td> <td style="text-align:right;"> 14.1 </td> <td style="text-align:right;"> 13.5 </td> </tr> <tr> <td style="text-align:left;"> Cigars/pipes </td> <td style="text-align:right;"> 35.5 </td> <td style="text-align:right;"> 20.7 </td> <td style="text-align:right;"> 17.4 </td> </tr> </tbody> </table> .center[.small[**Note: Table 5.1, Cunningham (2021) based on Cochran (1968).**]] ] --- # Example: Smoking and Lung Cancer II .pull-left[ * Those naive comparisons ignore the fact that the three groups in question are fundamentally different. Let's take a look at the average ages of people in each group * As one can see, older people were more likely to smoke cigars and pipes. Obviously, age also contributes to mortality rates * In this case, since older people die at a higher rate and for other reasons than smoking cigars, maybe age is responsible for the higher mortality rate instead of cigars. The same reasoning applies to cigarette smokers: they are, on average, younger; hence, the low mortality rate ] .pull-right[ The selection bias arises with age - the omitted variable. What if we condition mortality rates on age? <table class="table table-striped table-condensed" style="font-size: 22px; margin-left: auto; margin-right: auto;"> <thead> <tr><th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Average age</div></th></tr> <tr> <th style="text-align:left;"> Smoking group </th> <th style="text-align:right;"> Canada </th> <th style="text-align:right;"> UK </th> <th style="text-align:right;"> US </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Non-smokers </td> <td style="text-align:right;"> 54.9 </td> <td style="text-align:right;"> 49.1 </td> <td style="text-align:right;"> 57.0 </td> </tr> <tr> <td style="text-align:left;"> Cigarettes </td> <td style="text-align:right;"> 50.5 </td> <td style="text-align:right;"> 49.8 </td> <td style="text-align:right;"> 53.2 </td> </tr> <tr> <td style="text-align:left;"> Cigars/pipes </td> <td style="text-align:right;"> 65.9 </td> <td style="text-align:right;"> 55.7 </td> <td style="text-align:right;"> 59.7 </td> </tr> </tbody> </table> .center[.small[**Note: Table 5.2, Cunningham (2021) based on Cochran (1968).**]] ] --- # Subclassification .pull-left[ In this example, subclassification works in the following way: 1. Stratify the data into age groups. For instance, ages 20-40, ages 41-70, and 71+ 2. Calculate death rates within age groups for each smoking group 3. Weight the mortality rate for the treatment group by strata-specific weight that corresponds to the control group. This procedure gives us the age-adjusted mortality rate for the treatment group <table class="table table-striped table-condensed" style="font-size: 19px; margin-left: auto; margin-right: auto;"> <thead> <tr><th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Adjusted Death rates per 1,000 people</div></th></tr> <tr> <th style="text-align:left;"> Smoking group </th> <th style="text-align:right;"> Canada </th> <th style="text-align:right;"> UK </th> <th style="text-align:right;"> US </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Non-smokers </td> <td style="text-align:right;"> 20.2 </td> <td style="text-align:right;"> 11.3 </td> <td style="text-align:right;"> 13.5 </td> </tr> <tr> <td style="text-align:left;"> Cigarettes </td> <td style="text-align:right;"> 29.5 </td> <td style="text-align:right;"> 14.8 </td> <td style="text-align:right;"> 21.2 </td> </tr> <tr> <td style="text-align:left;"> Cigars/pipes </td> <td style="text-align:right;"> 19.8 </td> <td style="text-align:right;"> 11.0 </td> <td style="text-align:right;"> 13.7 </td> </tr> </tbody> </table> .center[.small[**Note: Table 5.4, Cunningham (2021) based on Cochran (1968).**]] ] .pull-right[ * One can see that the death rates adjusted by age distribution are higher for cigarette smokers among any group. That "adjustment" raises a question: which variables should we use for it? We need to choose a set of variables that give us a credible identification to lean on the Conditional Independence Assumption * As the number of covariates grows, strata-specific weights might be unfeasible, i.e., many cells having missing information. **This is called the curse of dimensionality** ] --- class: inverse, middle, center # Exact Matching --- # Exact Matching I .pull-left[ * The table shows a list of participants in a job training program and a list of non-participants. You might be curious about the effects of job training on salaries comparing the average earnings of the treatment (trainees) and control (non-trainees) groups * The naive comparison gives a $26.25 difference in salaries favoring the non-participants in the job training. However, in the same table, you also see that non-participants are, on average, older. Since wages usually rise with age, maybe people in the control group have higher earnings because they are older. Hence, that is not an apples-to-apples comparison ] .pull-right[ <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:500px; overflow-x: scroll; width:100%; "><table class="table table-striped table-condensed" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; position: sticky; top:0; background-color: #FFFFFF;" colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Trainees</div></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; position: sticky; top:0; background-color: #FFFFFF;" colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Non-Trainees</div></th> </tr> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Unit </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Age </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Earnings </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Unit </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Age </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Earnings </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 18 </td> <td style="text-align:left;"> 9500 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 20 </td> <td style="text-align:left;"> 8500 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 12250 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 10075 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 24 </td> <td style="text-align:left;"> 11000 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 21 </td> <td style="text-align:left;"> 8725 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 11750 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 39 </td> <td style="text-align:left;"> 12775 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> 13250 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 38 </td> <td style="text-align:left;"> 12550 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 22 </td> <td style="text-align:left;"> 10500 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 10525 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> 9750 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 39 </td> <td style="text-align:left;"> 12775 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 20 </td> <td style="text-align:left;"> 10000 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> 11425 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 21 </td> <td style="text-align:left;"> 10250 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 24 </td> <td style="text-align:left;"> 9400 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 12500 </td> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 10750 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> 11425 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 12 </td> <td style="text-align:left;"> 36 </td> <td style="text-align:left;"> 12100 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 13 </td> <td style="text-align:left;"> 22 </td> <td style="text-align:left;"> 8950 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 14 </td> <td style="text-align:left;"> 18 </td> <td style="text-align:left;"> 8050 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 15 </td> <td style="text-align:left;"> 43 </td> <td style="text-align:left;"> 13675 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> 39 </td> <td style="text-align:left;"> 12775 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 17 </td> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> 8275 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 18 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 9000 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> 51 </td> <td style="text-align:left;"> 15475 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 20 </td> <td style="text-align:left;"> 48 </td> <td style="text-align:left;"> 14800 </td> </tr> <tr grouplength="1"><td colspan="6" style="border-bottom: 1px solid;"><strong>Summary Statistics</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Mean </td> <td style="text-align:left;"> 24.3 </td> <td style="text-align:left;"> $11,075 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 31.95 </td> <td style="text-align:left;"> $11,101.25 </td> </tr> </tbody> </table></div> ] --- # Exact Matching II .pull-left[ * One way to achieve balance is to find the missing counterfactuals for each treatment group unit according to the covariate *age*. Looking at the first treatment unit, the individual is 18 years old. Unit 14 in the control group also has age 18, and there we have an exact match. We proceed to move unit 14 in the control group to the top. Keep doing the same for units 2 to 9 in the treatment group, and you will find exact matches for them * Treatment unit 10 is a little bit different. Looking at the non-trainees group, one can find two individuals at age 30 - units 10 and 18. When we have a situation where there is more than one control group unit very close to the treatment unit, then we can average over them. Hence, our exact match for treatment unit 10 has earnings equal to `\(\frac{9,000+10,750}{2}=9,875\)` ] .pull-right[ <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:500px; overflow-x: scroll; width:100%; "><table class="table table-striped table-condensed" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; position: sticky; top:0; background-color: #FFFFFF;" colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Trainees</div></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; position: sticky; top:0; background-color: #FFFFFF;" colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Non-Trainees</div></th> </tr> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Unit </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Age </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Earnings </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Unit </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Age </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Earnings </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 18 </td> <td style="text-align:left;"> 9500 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 20 </td> <td style="text-align:left;"> 8500 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 12250 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 10075 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 24 </td> <td style="text-align:left;"> 11000 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 21 </td> <td style="text-align:left;"> 8725 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 11750 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 39 </td> <td style="text-align:left;"> 12775 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> 13250 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 38 </td> <td style="text-align:left;"> 12550 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 22 </td> <td style="text-align:left;"> 10500 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 10525 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> 9750 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 39 </td> <td style="text-align:left;"> 12775 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 20 </td> <td style="text-align:left;"> 10000 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> 11425 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 21 </td> <td style="text-align:left;"> 10250 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 24 </td> <td style="text-align:left;"> 9400 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 12500 </td> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 10750 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> 11425 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 12 </td> <td style="text-align:left;"> 36 </td> <td style="text-align:left;"> 12100 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 13 </td> <td style="text-align:left;"> 22 </td> <td style="text-align:left;"> 8950 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 14 </td> <td style="text-align:left;"> 18 </td> <td style="text-align:left;"> 8050 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 15 </td> <td style="text-align:left;"> 43 </td> <td style="text-align:left;"> 13675 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> 39 </td> <td style="text-align:left;"> 12775 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 17 </td> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> 8275 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 18 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 9000 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> 51 </td> <td style="text-align:left;"> 15475 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 20 </td> <td style="text-align:left;"> 48 </td> <td style="text-align:left;"> 14800 </td> </tr> <tr grouplength="1"><td colspan="6" style="border-bottom: 1px solid;"><strong>Summary Statistics</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Mean </td> <td style="text-align:left;"> 24.3 </td> <td style="text-align:left;"> $11,075 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 31.95 </td> <td style="text-align:left;"> $11,101.25 </td> </tr> </tbody> </table></div> ] --- # Exact Matching III .pull-left[ * When the matching is done, you end up with 10 control group units - one for each treatment unit -, and now the two groups are precisely balanced on age. The difference between trainees and non-trainees earnings is $1,695 favoring the ones who decided to participate in the job training * To summarize, we had two groups that were different in ways that likely affected the potential outcomes. However, assuming that the treatment assignment conditional on age was "as good as random", the matching on age generated an apples-to-apples comparison ] .pull-right[ <table class="table table-striped table-condensed" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Trainees</div></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Matched Sample</div></th> </tr> <tr> <th style="text-align:left;"> Unit </th> <th style="text-align:left;"> Age </th> <th style="text-align:left;"> Earnings </th> <th style="text-align:left;"> Unit </th> <th style="text-align:left;"> Age </th> <th style="text-align:left;"> Earnings </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 18 </td> <td style="text-align:left;"> 9500 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 18 </td> <td style="text-align:left;"> 8050 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 12250 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 10525 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 24 </td> <td style="text-align:left;"> 11000 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 24 </td> <td style="text-align:left;"> 9400 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 11750 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 10075 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> 13250 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> 11425 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 22 </td> <td style="text-align:left;"> 10500 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 22 </td> <td style="text-align:left;"> 8950 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> 9750 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> 8275 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 20 </td> <td style="text-align:left;"> 10000 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 20 </td> <td style="text-align:left;"> 8500 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 21 </td> <td style="text-align:left;"> 10250 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 21 </td> <td style="text-align:left;"> 8725 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 12500 </td> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 9875 </td> </tr> <tr grouplength="1"><td colspan="6" style="border-bottom: 1px solid;"><strong>Summary Statistics</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Mean </td> <td style="text-align:left;"> 24.3 </td> <td style="text-align:left;"> $11,075 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 24.3 </td> <td style="text-align:left;"> $9,380 </td> </tr> </tbody> </table> ] --- # Exact Matching IV: Age Distribution .pull-left[ **Before Matching**
] .pull-right[ **After Matching**
] --- class: inverse, middle, center # Approximate Matching --- # Matching on Covariates * In the last example, we found each unit in the control group using the exact value of the covariate age. After that, it is straightforward to take the difference of average values in the treatment and control groups to find the causal effect of the job training program * Real-world is more complicated than that, and you will frequently face units that do not have the same value for given covariates - for instance, think about matching people using exact income values. Also, you might want to match more than one covariate, which complicates things * When we need to match with multiple covariates, a measure of distance comes in handy. One frequently used is the *Mahalanobis distance*: `$$||X_{i}-X_{j}||=\sqrt{(X_{i}-X_{j})'\widehat{\sum}_{X}^{-1}(X_{i}-X_{j})}$$` where `\(\widehat{\sum}_{X}\)` is the covariance matrix for all the matching variables. As you can imagine, some matched units will have different values of covariates (e.g., person `\(i\)` has age 26 while person `\(j\)` has age 25). Sometimes the discrepancies are small, sometimes significant. As the differences increase, that introduces bias in the estimates. For the discrepancies to be trivially minor, you need a considerable sample size. Finally, the larger the number of covariates, the greater likelihood of matching discrepancies --- # Propensity Score * The propensity score is a widespread way of aggregating multiple matching variables into a single value that can be matched on. It is prevalent in the medical sciences, but economists usually prefer other quasi-experimental methods like difference-in-differences and regression discontinuity. One reason for the skepticism is that economists are typically worried about selection on unobservables instead of selection on observables * Propensity score is the estimated probability that a given observation would have gotten treated. More precisely, the propensity score is the conditional probability of receiving the treatment given covariate values: `$$p(X)=Pr(D=1|X)$$` * In practice, propensity scores are estimated by logit or probit regressions using the "necessary covariates" - the ones that determine the likelihood a unit receives the treatment. * Consider two units A and B. A received the treatment, but not B. You calculate the propensity score for A and B and find 60%: conditional on the covariates, the probability of being assigned to treatment is 60% for both. So you can compare those units with similar propensity scores `\(p(X)\)` and the observed difference can be attributable to the treatment --- # Assumptions for Matching on the Propensity Score Just like regression, matching relies on the **conditional independence assumption** (CIA): `$$Y_{1i}, Y_{0i} \perp\!\!\!\perp D_{i}|X_{i} \implies Y_{1i}, Y_{0i} \perp\!\!\!\perp D_{i}|\text{ } p(X_{i})$$` In other words, the CIA states that the set of matching variables you have chosen is enough to make the treatment conditionally independent of potential outcomes. If the CIA holds, then we can use `\(p(X_{i})\)` instead of `\(X_{i}\)`: with CIA, the propensity score tells us everything we need to know about whether or not unit `\(i\)` receives treatment. Another assumption we make is the **common support**. Formally, `$$0<Pr(D=1|X)<1 \text{ with probability one}$$` This means that the probability of treatment is between 0 and 1 for each strata. The assumption of common support says that there must be substantial overlap in the distributions of the matching variables (or in the propensity score) comparing the treated and control observations. One approach to check for common support is to look at the distributions of a variable for the treated and untreated groups. Finally, one needs to check for the balance in the matched data: whether there are differences between the treatment and control groups in variables for which there should be no differences. That is relatively easy to do - summary statistics and difference-of-means tests for each variable in the treated and control groups should be enough. --- class: inverse, middle, center # NSW Job Training Program --- # NSW Program I .pull-left[ * The NSW program was a mid-1970s program that provided work experience and counseling in a sheltered environment to disadvantaged workers lacking basic job skills. Surprisingly for its time, the NSW was evaluated in a randomized trial * The researchers collected earning and demographic information from both treatment and control groups at baseline and every nine months after that * Despite the apparent unbalance in the proportion of participants without a degree in this particular sample, we can get the causal effect of the job-training program taking the difference in means of earnings in 1978 between treatment and control group: **$1,800 increase due to NSW** ] .pull-right[ <table class="table table-striped table-condensed" style="font-size: 18px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Control Group </th> <th style="text-align:right;"> Treatment Group </th> <th style="text-align:right;"> Difference </th> </tr> </thead> <tbody> <tr grouplength="8"><td colspan="4" style="border-bottom: 1px solid;"><strong>Baseline Characteristics</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Age </td> <td style="text-align:right;"> 25.05 </td> <td style="text-align:right;"> 25.82 </td> <td style="text-align:right;"> 0.76 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Education </td> <td style="text-align:right;"> 10.09 </td> <td style="text-align:right;"> 10.35 </td> <td style="text-align:right;"> 0.26 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Black </td> <td style="text-align:right;"> 0.83 </td> <td style="text-align:right;"> 0.84 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Hispanic </td> <td style="text-align:right;"> 0.11 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> -0.05 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Married </td> <td style="text-align:right;"> 0.15 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.04 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> No degree </td> <td style="text-align:right;"> 0.83 </td> <td style="text-align:right;"> 0.71 </td> <td style="text-align:right;"> -0.13 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Earnings in 1974 </td> <td style="text-align:right;"> 2107.03 </td> <td style="text-align:right;"> 2095.57 </td> <td style="text-align:right;"> -11.45 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Earnings in 1975 </td> <td style="text-align:right;"> 1266.91 </td> <td style="text-align:right;"> 1532.06 </td> <td style="text-align:right;"> 265.15 </td> </tr> <tr grouplength="1"><td colspan="4" style="border-bottom: 1px solid;"><strong>Outcome</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;font-weight: bold;" indentlevel="1"> Earnings in 1978 </td> <td style="text-align:right;font-weight: bold;"> 4554.80 </td> <td style="text-align:right;font-weight: bold;"> 6349.14 </td> <td style="text-align:right;font-weight: bold;"> 1794.34 </td> </tr> <tr> <td style="text-align:left;"> Number of Observations </td> <td style="text-align:right;"> 260.00 </td> <td style="text-align:right;"> 185.00 </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> .small[**Note: based on Dehejia and Wahba (2002).**] ] --- class: regression # NSW Program II .panelset[ .panel[.panel-name[Is *ceteris* really *paribus*?] .pull-left[ * Having the causal effect of the NSW program in hands, Lalonde (1986) decided to compare the results from the randomized study to econometric results using nonexperimental control groups drawn from the Current Population Survey (CPS) and the Panel Survey of Income Dynamics (PSID) * Lalonde (1986) found that using nonexperimental comparison groups, the results of observational studies were consistently horrible - very different magnitudes and also wrong sings * The pessimistic conclusions of the paper was influential in policy circles and pushed for more experimental evaluations ] .pull-right[ <table class="table table-striped table-condensed" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Control Group (NE) </th> <th style="text-align:right;"> Treatment Group </th> <th style="text-align:right;"> Difference </th> </tr> </thead> <tbody> <tr grouplength="8"><td colspan="4" style="border-bottom: 1px solid;"><strong>Baseline Characteristics</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Age </td> <td style="text-align:right;"> 33.23 </td> <td style="text-align:right;"> 25.82 </td> <td style="text-align:right;"> -7.41 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Education </td> <td style="text-align:right;"> 12.03 </td> <td style="text-align:right;"> 10.35 </td> <td style="text-align:right;"> -1.68 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Black </td> <td style="text-align:right;"> 0.07 </td> <td style="text-align:right;"> 0.84 </td> <td style="text-align:right;"> 0.77 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Hispanic </td> <td style="text-align:right;"> 0.07 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> -0.01 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Married </td> <td style="text-align:right;"> 0.71 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> -0.52 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> No degree </td> <td style="text-align:right;"> 0.30 </td> <td style="text-align:right;"> 0.71 </td> <td style="text-align:right;"> 0.41 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Earnings in 1974 </td> <td style="text-align:right;"> 14016.80 </td> <td style="text-align:right;"> 2095.57 </td> <td style="text-align:right;"> -11921.23 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Earnings in 1975 </td> <td style="text-align:right;"> 13650.80 </td> <td style="text-align:right;"> 1532.06 </td> <td style="text-align:right;"> -12118.75 </td> </tr> <tr grouplength="1"><td colspan="4" style="border-bottom: 1px solid;"><strong>Outcome</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;font-weight: bold;" indentlevel="1"> Earnings in 1978 </td> <td style="text-align:right;font-weight: bold;"> 14846.66 </td> <td style="text-align:right;font-weight: bold;"> 6349.14 </td> <td style="text-align:right;font-weight: bold;"> -8497.52 </td> </tr> <tr> <td style="text-align:left;"> Number of Observations </td> <td style="text-align:right;"> 15992.00 </td> <td style="text-align:right;"> 185.00 </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> ] ] .panel[.panel-name[Regression Can't Save You] .pull-left[ * The table shows regression results from three models. The first one considers only the treatment variable - job-training participation. The second adds individual's demographic characteristics, and the third also included earnings in 1974-1975 * All results are very far from the causal effect estimated by the randomized trial ($1,800). As you can imagine, the stark difference is due to **selection bias**. With a control group coming from a random sample of Americans from that period, that would not function as counterfactuals for the distressed group of workers who were selected into the NSW program ] .pull-right[ <style type="text/css"> .regression table { font-size: 13px; } </style> <table style="text-align:center"><tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3">Earnings in 1978</td></tr> <tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr> <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">treat</td><td>-8,497.516<sup>***</sup> (712.021)</td><td>-4,305.372<sup>***</sup> (727.934)</td><td>678.571 (546.828)</td></tr> <tr><td style="text-align:left">age</td><td></td><td>146.934<sup>***</sup> (6.751)</td><td>-100.780<sup>***</sup> (5.567)</td></tr> <tr><td style="text-align:left">educ</td><td></td><td>262.426<sup>***</sup> (37.797)</td><td>163.525<sup>***</sup> (28.333)</td></tr> <tr><td style="text-align:left">black</td><td></td><td>-2,348.248<sup>***</sup> (281.839)</td><td>-817.238<sup>***</sup> (211.458)</td></tr> <tr><td style="text-align:left">nodegree</td><td></td><td>-2,073.285<sup>***</sup> (235.008)</td><td>366.275<sup>**</sup> (177.374)</td></tr> <tr><td style="text-align:left">re74</td><td></td><td></td><td>0.290<sup>***</sup> (0.012)</td></tr> <tr><td style="text-align:left">re75</td><td></td><td></td><td>0.471<sup>***</sup> (0.012)</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>16,177</td><td>16,177</td><td>16,177</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.009</td><td>0.064</td><td>0.476</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] ] ] --- # NSW Program III .pull-left[ * Can propensity score matching improve the estimates of treatment effects with non-experimental data? * The table shows that when we trim the non-experimental control group using the nearest neighbor matching to choose the closest control units in terms of the propensity score, the baseline characteristics are balanced, and the treatment effect of the job training is $1,678.8-1,762.5 - very close to the actual causal effect we found with the experimental design * There are other ways of getting average treatment effects using the estimated propensity score beside the nearest neighbor matching ] .pull-right[ <table class="table table-striped table-condensed" style="font-size: 17px; margin-left: auto; margin-right: auto;"> <thead> <tr><th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Balance After Matching</div></th></tr> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Control Group (NE) </th> <th style="text-align:right;"> Treatment Group </th> <th style="text-align:right;"> Difference </th> </tr> </thead> <tbody> <tr grouplength="8"><td colspan="4" style="border-bottom: 1px solid;"><strong>Baseline Characteristics</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Age </td> <td style="text-align:right;"> 23.82 </td> <td style="text-align:right;"> 25.82 </td> <td style="text-align:right;"> 1.99 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Education </td> <td style="text-align:right;"> 10.37 </td> <td style="text-align:right;"> 10.35 </td> <td style="text-align:right;"> -0.02 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Black </td> <td style="text-align:right;"> 0.83 </td> <td style="text-align:right;"> 0.84 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Hispanic </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Married </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> No degree </td> <td style="text-align:right;"> 0.68 </td> <td style="text-align:right;"> 0.71 </td> <td style="text-align:right;"> 0.03 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Earnings in 1974 </td> <td style="text-align:right;"> 1905.21 </td> <td style="text-align:right;"> 2095.57 </td> <td style="text-align:right;"> 190.36 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;" indentlevel="1"> Earnings in 1975 </td> <td style="text-align:right;"> 1302.15 </td> <td style="text-align:right;"> 1532.06 </td> <td style="text-align:right;"> 229.91 </td> </tr> <tr grouplength="1"><td colspan="4" style="border-bottom: 1px solid;"><strong>Outcome</strong></td></tr> <tr> <td style="text-align:left;padding-left: 2em;font-weight: bold;" indentlevel="1"> Earnings in 1978 </td> <td style="text-align:right;font-weight: bold;"> 4586.90 </td> <td style="text-align:right;font-weight: bold;"> 6349.14 </td> <td style="text-align:right;font-weight: bold;"> 1762.25 </td> </tr> <tr> <td style="text-align:left;"> Number of Observations </td> <td style="text-align:right;"> 185.00 </td> <td style="text-align:right;"> 185.00 </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> ]