class: center, middle, inverse, title-slide # Econ 474 - Econometrics of Policy Evaluation ## Regression ### Marcelino Guerra ### February 02-07, 2022 --- # Describing Relationships .pull-left[ * For most research questions, we are interested in the relationship between two (or more) variables, i.e., what learning about one variable tells us about the other * Consider the classic example of **salary** and **years of education**. The scatterplot shows the relationship between those two variables: we would like to explain the variation in hourly wages with education levels * It seems that there is a positive relationship between salary and education ] .pull-right[
.small[**Note:** The data come from the 1976 Current Population Survey in the USA.] ] --- class: inverse, middle, center # The basics --- # Conditional Expectation Functions (CEF) .pull-left[ * Frequently, we are interested in conditional expectations: for example, the expected salary for people who have completed three years of schooling. This can be written as `\(E[Y_{i}|X_{i}=x]\)`, where `\(Y_{i}\)` is salary, and `\(X_{i}=x\)` can be read as "when `\(X_{i}\)` equals the particular value `\(x\)`" - in our case, when education equals 3 years * Conditional expectations tell us how one variable's population average changes as we move the conditioning variable over its values. For every value of the conditioning variable, we might get a different average of the dependent variable `\(Y_{i}\)`, and the collection of all those averages is called the *conditional expectation function* * The figure shows the conditional expectation of hourly wages given years of education. Despite its ups and downs, the earnings-schooling CEF is upward-sloping ] .pull-right[
] --- # Line Fitting I .pull-left[ * Assuming that the relationship between earnings and education is linear, one can easily apply the most well-known application of line-fitting: the Ordinary Least Squares (OLS) * OLS picks the line that gives the lowest *sum of squared residuals*. A residual is a difference between an observation's actual value and the conditional mean assigned by the line `\(Y-\hat{Y}\)` * Our model of earnings and education can be written as `$$\underbrace{Y_{i}}_{Wage_{i}}=\beta_{0}+\beta_{1}\underbrace{X_{i}}_{Educ_{i}}+\varepsilon_{i}$$` where `\(\beta\)`'s are unknown parameters to be estimated and `\(\varepsilon\)` is the unobserved error term ] .pull-right[
.small[**In the beginning, there was only a scatter plot. And then God said "let us fit a line"**] ] --- # Line Fitting II .panelset[ .panel[.panel-name[OLS Estimation] Ultimately, the goal is to minimize `\(\sum e_{i}^{2}\)`, where `\(e_{i}=Y_{i}-\underbrace{\hat{Y}_{i}}_{Prediction}\)`. In other words, we estimate `\(\beta_{0}\)` and `\(\beta_{1}\)` by minimizing the sum of squared deviations from the regression line: `$$(\hat{\beta}_{0}, \hat{\beta}_{1})=\underset{\beta_{0}, \beta_{1}}{\arg\min} \sum_{i=1}^{n} (Y_{i}-\beta_{0}-\beta_{1}X_{i})^{2}$$` which leads to: `$$\hat{\beta_{0}}= \bar{Y} - \hat{\beta_{1}}\bar{X} \\ \hat{\beta_{1}}=\frac{\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}=\frac{Cov(X_{i}, Y_{i})}{Var(X_{i})}$$` **OLS** gives us the best linear approximation of the relationship between `\(X\)` and `\(Y\)` ] .panel[.panel-name[OLS Results (Individual-level data)] .pull-left[ * The table shows the results of the linear regression. As you can see, `\(\hat{\beta}_{0}=-0.905\)` and `\(\hat{\beta}_{1}=0.541\)`. Hence, the OLS selected the best-fit values of `\(\beta_{0}\)` and `\(\beta_{1}\)` to give us `$$\widehat{Wage_{i}}=-0.905+0.541Educ_{i}$$` * So, we would expect a 0.541 increase in the conditional mean of the hourly wage for a one-year increase in education. Also, for someone with three years of schooling, the predicted hourly wage (in 1976) according to our model is `\(-0.905+0.541\times 3=0.718\)`. You can check the scatterplot and see that the individual with three years of education in the sample has an hourly salary of 2.92. ] .pull-right[ <table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>Hourly Wages</td></tr> <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">educ</td><td>0.541<sup>***</sup> (0.053)</td></tr> <tr><td style="text-align:left">Constant</td><td>-0.905 (0.685)</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>526</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.163</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] ] .panel[.panel-name[R Code] ```r #library(stargazer) # install.packages("wooldridge") if you don't have the package # install.packages("stargazer") if you don't have the package #library(stargazer) library(wooldridge) data("wage1", package = "wooldridge") # load data reg<-lm(wage~educ, data=wage1) summary(reg) #stargazer(reg, header = F, single.row = TRUE, no.space = T, dep.var.labels.include = FALSE,dep.var.caption = "Hourly Wages", type='html', omit.stat=c("rsq", "ser", "f")) ``` ] ] --- # The CEF is All You Need .panelset[ .panel[.panel-name[OLS Results (Means by years of schooling)] .pull-left[ Fun fact: the solution to this problem `$$\hat{\beta}=\underset{\beta}{\arg\min} \{ (E[Y_{i}|X_{i}]-X^{'}\beta)^{2} \}$$` is the same as this `$$\hat{\beta}=\underset{\beta}{\arg\min} \{ (Y_{i}-X^{'}\beta)^{2} \}$$` The implication of that is you only need conditional means to run regressions. For example, the table shows the results of a **weighted** (*by what?*) linear regression using means by years of schooling instead of individual-level data. As you can see, `\(\hat{\beta}_{0}=-0.905\)` and `\(\hat{\beta}_{1}=0.541\)`, same as before (when using individual-level data). ] .pull-right[ <table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>Hourly Wages</td></tr> <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">educ</td><td>0.541<sup>***</sup> (0.081)</td></tr> <tr><td style="text-align:left">Constant</td><td>-0.905 (1.043)</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>18</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.719</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] ] .panel[.panel-name[R Code] ```r library(stargazer) wage2<-wage1%>%group_by(educ)%>%summarize(wage=mean(wage), count=n()) reg2<-lm(wage~educ, weights = count, data=wage2) summary(reg2) #stargazer(reg2, header = F, single.row = TRUE, no.space = T, dep.var.labels.include = FALSE,dep.var.caption = "Hourly Wages", type='html', omit.stat=c("rsq", "ser", "f")) ``` ] ] --- # Fits and Residuals As you just saw, regression breaks any dependent variable into two pieces: `$$Y_{i}=\hat{Y_{i}}+e_{i}$$` `\(\hat{Y}_{i}\)` is the fitted value or the part of `\(Y_{i}\)` that the model explains. The residual `\(e_{i}\)` is what is left over. Some of the residuals properties: 1. Regression residuals have expectation zero (i.e., `\(E(e_{i})=0\)`) 2. Regression residuals are uncorrelated with all the regressors that made them and with the corresponding fitted value: `\(E(X_{ik}e_{i})=0\)` for each regressor `\(X_{ik}\)`. In other words, if you regress `\(e_{i}\)` on `\(X_{1i}, X_{2i}, \dots\)`, the estimated coefficients will be equal to zero. 3. Regression residuals are uncorrelated with the corresponding fitted values: `\(E(\hat{Y}_{i}e_{i})=0\)` Those properties can be derived using the first-order conditions we used to get `\(\hat{\beta}_{0}\)` and `\(\hat{\beta}_{1}\)`. --- # Regression for Dummies An important case using regression is the bivariate regression with a dummy regressor (we saw it [here](https://guerramarcelino.github.io/Econ474/Rlabs/lab1#equivalence-of-differences-in-means-and-regression)). Say `\(Z_{i}\)` takes on two values: 0 and 1. Hence, one can write it as `$$E[Y_{i}|Z_{i}=0]=\beta_{0} \\ E[Y_{i}|Z_{i}=1]=\beta_{0}+\beta_{1}$$` so that `\(\beta_{1}=E[Y_{i}|Z_{i}=1]-E[Y_{i}|Z_{i}=0]\)`. Using this notation, we can write `$$E[Y_{i}|Z_{i}]=E[Y_{i}|Z_{i}=0]+(E[Y_{i}|Z_{i}=1]-E[Y_{i}|Z_{i}=0])Z_{i} = \beta_{0}+\beta_{1}Z_{i}$$` Does it look familiar? This shows that `\(E[Y_{i}|Z_{i}]\)` is a linear function of `\(Z_{i}\)`, with slope `\(\beta_{1}\)` and intercept `\(\beta_{0}\)`. Because the CEF with a single dummy variable is linear, regression fits this CEF perfectly. --- # Controlling for Variables .panelset[ .panel[.panel-name[Regression Anatomy] The most exciting regressions include more than one variable. Let's recap bivariate regression: `$$\hat{\beta_{0}}= \bar{Y} - \hat{\beta_{1}}\bar{X}\\ \hat{\beta_{1}}=\frac{Cov(X_{i}, Y_{i})}{Var(X_{i})}$$` **With multiple regressors**, the `\(k-th\)` slope coefficient is: `$$\beta_{k}=\frac{Cov(Y_{i}, \tilde{x}_{ki})}{V(\tilde{x}_{ki})}$$` where `\(\tilde{x}_{ki}\)` is the residual from a regression of `\(x_{ki}\)` on all the other covariates. Each coefficient in a multivariate regression is the bivariate slope coefficient for the corresponding regressor, after "partialling out" the other variables in the model. ] .panel[.panel-name[FWL] In other words, you can think about residuals as the part of `\(Y\)` that has nothing to do with `\(X\)`. Back to the wages *versus* education example, we saw that the predicted value for `\(Wage\)` given `\(Educ=3\)` was equal to 0.718, far from the observed value of 2.92. It looks like education can only be responsible for 0.718, and the extra 2.202 must be because of some other part of the data generating process. Let's expand the analysis and include other variables such as `experience`. We would like to know how much of a relationship between wage and education is not explained by experience. To do that, we can: 1. Run a regression between `\(Y\)` (wage) and `\(Z\)` (experience), and get the residuals `\(Y^{R}\)` 2. Run a regression between `\(X\)` (education) and `\(Z\)` (experience), and get the residuals `\(X^{R}\)` 3. Run a regression between `\(Y^{R}\)` and `\(X^{R}\)` This particular set of calculations is known as **Frisch-Waugh-Lovell theorem**. **What if you do not observe a variable you should include in this model (e.g., ability)?** ] .panel[.panel-name[FWL Results] .pull-left[ * Since `\(Y^{R}\)` and `\(X^{R}\)` have had the parts of `\(Y\)` and `\(X\)` that can be explained with `\(Z\)` removed, the relationship between `\(Y^{R}\)` and `\(X^{R}\)` is the part of the relationship between `\(Y\)` and `\(X\)` that is not explained by `\(Z\)` * During this process, we are washing out all the variation related to `\(Z\)`, in effect not allowing `\(Z\)` to vary. This is why we call the process "controlling for `\(Z\)`"/"adjusting for `\(Z\)`". ] .pull-right[ <table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>YR</td></tr> <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">XR</td><td>0.644<sup>***</sup> (0.054)</td></tr> <tr><td style="text-align:left">Constant</td><td>0.000 (0.142)</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>526</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.214</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] ] .panel[.panel-name[Multiple regression] .pull-left[ If you run the regression model `$$Wage_{i}=\beta_{0}+\beta_{1}Educ_{i}+\beta_{2}Exper_{i}+\varepsilon_{i}$$` you'll get the same result for the variable of interest `\(Educ\)`. The OLS estimate for `\(\beta_{1}\)` represents the relationship between earnings and schooling *conditional on experience*. **Even better**, we can add more controls such as gender, race, etc. .center[ ![](figs/mind.gif)] ] .pull-right[ <table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>Hourly Wages</td></tr> <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">educ</td><td>0.644<sup>***</sup> (0.054)</td></tr> <tr><td style="text-align:left">exper</td><td>0.070<sup>***</sup> (0.011)</td></tr> <tr><td style="text-align:left">Constant</td><td>-3.391<sup>***</sup> (0.767)</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>526</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.222</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] ] ] --- # Building Models with Logs I Frequently, economists transform variables using the natural logarithm. For instance, wages are approximately log-normally distributed, and you might want to use `\(ln(Wage)\)` as a dependent variable instead of `\(Wage\)` to model the relationship between earnings vs. schooling - in that way, `\(ln(Wage)\)` would be approximately normally distributed. Taking the log also reduces the impact of outliers, and the estimates have a convenient interpretation. To see what happens, let's go back to the bivariate regression with dummies example, modifying the dependent variable: `$$ln(Y_{i})=\beta_{0}+\beta_{1}Z_{i}+\varepsilon_{i}$$` where `\(Z_{i}\)` takes on two values: 0 and 1. We can rewrite the equation as `$$E[ln(Y_{i})|Z_{i}]=\beta_{0}+\beta_{1}Z_{i}$$` and the regression, in this case, fits the CEF perfectly. Suppose we engineer a *ceteris paribus* change in `\(Z_{i}\)` for individual `\(i\)`. This reveals potential outcome `\(Y_{0i}\)` when `\(Z_{i}=0\)` and `\(Y_{1i}\)` when `\(Z_{i}=1\)` (cont.) --- # Building Models with Logs II .panelset[ .panel[.panel-name[Interpretation] Rewriting the equation for log of potential outcomes: `$$ln(Y_{0i})=\beta_{0}+\varepsilon_{i}\\ ln(Y_{1i})=\beta_{0}+\beta_{1}+\varepsilon_{i}$$` The difference in potential outcomes is `$$ln(Y_{1i})-ln(Y_{0i})=\beta_{1}$$` Further rearranging this term: `$$\beta_{1}=ln(\frac{Y_{1i}}{Y_{0i}})=ln(1+\frac{Y_{1i}-Y_{0i}}{Y_{0i}})=ln(1+\Delta \%Y_{p})\approx\Delta \%Y_{p}$$` assuming a small `\(\Delta \%Y_{p}\)`. `\(\Delta \%Y_{p}\)` is shorthand for the **percentage change in potential outcomes induced by `\(Z_{i}\)`**. ] .panel[.panel-name[Example: log(Wage)] .pull-left[ Running the earnings vs schooling regression in the *log-linear* form `$$ln(Wage_{i})=\beta_{0}+\beta_{1}Educ_{i}+\varepsilon_{i}$$` you get the results that the table displays. The interpretation of `\(\beta_{1}\)` is the following: **an additional year of education increases the hourly wage, on average, by 8.3%.** ] .pull-right[ <table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>Hourly Wages</td></tr> <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">educ</td><td>0.083<sup>***</sup> (0.008)</td></tr> <tr><td style="text-align:left">Constant</td><td>0.584<sup>***</sup> (0.097)</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>526</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.184</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] ] ] --- # Regression Standard Errors .panelset[ .panel[.panel-name[SE: Simple Regression] So far, we did not give too much attention to the fact that our data comes from samples. Just like [sample means](https://guerramarcelino.github.io/Econ474/Rlabs/lab2), sample regression estimates are subject to **sampling variance**. Every time that we draw a new sample from the same population to estimate the same regression model, we might get different results. Again, one needs to have in mind how to quantify the uncertainty that arises with sampling. In the regression framework, we also measure variability with the standard error. In a bivariate regression `\(Y_{i}=\beta_{0}+\beta_{1}X_{i}+\varepsilon_{i}\)`, the standard error of the slope can be written as `$$SE(\hat{\beta}_{1})=\frac{\sigma_{e}}{\sigma_{X}\sqrt{n-2}}$$` where `\(\sigma_{e}\)` is the standard deviation of the regression residuals, and `\(\sigma_{X}\)` is the standard deviation of the regressor `\(X_{i}\)`. Like the standard error of a sample average, regression SEs decrease with *i)* sample size `\(\uparrow n\)` *ii)* variability in the explanatory variable `\(\uparrow \sigma_{X}\)`. When the residual variance is large, regression estimates are not precise - in this case, the line does not fit the dots very well. ] .panel[.panel-name[SE: Multiple Regression] In a multivariate model `$$Y_{i}=\beta_{0}+\sum_{k=1}^{K}\beta_{k}X_{ki}+\varepsilon_{i}$$` the standard error for the `\(kth\)` sample slope, `\(\beta_{k}\)`, is: `$$SE(\hat{\beta}_{k})=\frac{\sigma_{e}}{\sigma_{\tilde{X}_{k}}\sqrt{n-p}}$$` where `\(p\)` is the number of parameters to be estimated, and `\(\sigma_{\tilde{X}_{k}}\)` is the standard deviation of `\(\tilde{X}_{ki}\)`, the residual from a regression of `\(X_{ki}\)` on all other regressors (remember the FWL theorem?). As you add more and more explanatory variables in the regression, `\(\sigma_{e}\)` will fall. On the other hand, the standard deviation of `\(\sigma_{\tilde{X}_{k}}\)` gets smaller since additional regressors might explain some variation of `\(X_{ki}\)`. The upshot of these changes in the top and bottom can be increased or decreased precision. ] .panel[.panel-name[Example: log(Wage)] ```r options(scipen=999) data("wage1", package = "wooldridge") # load data ### Standard deviation of educ sdX<-sd(wage1$educ) ### Number of observations in the sample n<-nrow(wage1) ### Regression residuals reg<-lm(lwage~educ, data=wage1) sde<-sd(residuals(reg)) ### Standard error of beta1 SE_educ<-sde/(sqrt(n-2)*sdX) SE_educ ``` [1] 0.007566694 ] ] --- class: inverse, middle, center # Regression as a Matchmaker --- # Example: Private School Effects * Students who attended a private four-year college in America paid an average of about $31,875 in tuition and fees in the 2018-19 period, while students who went to a public university in their home state paid, on average, 9,212 dollars - the yearly difference in tuition is considerable. Is it worth it? * Comparisons between earnings of those who went to different schools may reveal a significant gap favoring elite-college alumni. Indeed, on average, the wage difference is **14% in favor of those who went private**. Is that a fair comparison? * Differences in test scores, family background, motivation, and perhaps other skills and talents affect future earnings. Without dealing with *selection bias*, one cannot claim that there is a causal effect of private education on salaries * What if we use regression to control for family income, SAT scores, and other covariates that we can observe? * Still, there are factors complicated to quantify that might affect both attendance decisions and later earnings. For instance, how do we take into account diligence and family connections? --- # Dale and Krueger (2002) I * Instead of focusing on everything that might matter for college choice and earnings, Stacy Dale and Alan Krueger<sup>1</sup> came up with a compelling shortcut: the characteristics of colleges to which students applied and were admitted * Consider the case of two students who both applied to and were admitted to UMass and Harvard, but one goes to Harvard, and the other goes to UMass. The fact that those students applied to the same universities suggests they have similar ambition and motivation. Since both were admitted to the same places, one can assume that they might be able to succeed under the same circumstances. Hence, comparing earnings of those two similar students that took different paths would be fair * Dale and Krueger analyzed a large data set called College and Beyond (C&B). The C&B data contains information about thousands of students who enrolled in a group of selective U.S. universities, together with survey information collected multiple times from the year students took the SAT to long after most had graduated from college .footnote[ [1] Stacy Berg Dale and Alan B. Krueger, "Estimating the Payoff to Attending a More Selective College: An Application of Selection on Observables and Unobservables," *Quarterly Journal of Economics*, vol. 117, no. 4, November 2002.] --- # College Matching Matrix I .pull-left[ The table illustrates the idea of "college matching." There you have applications, admissions, and matriculation decisions for nine made-up students and six made-up universities. What happens when we compare the earnings of those who entered private with those who went to public universities? **Average earnings of Private university students:** `\(\frac{110,000+100,000+60,000+115,000+75,000}{5}=92,000\)` **Average earnings of Public university students:** `\(\frac{110,000+30,000+90,000+60,000}{4}=72,500\)` **That gap suggests a sizeable private school advantage.** ] .pull-right[ ![](figs/fig1.png) .small[**Note: Angrist and Pischke (2014), Table 2.1**] ] --- # College Matching Matrix II .pull-left[ The table organizes nine students into four groups. Each group is defined by the set of schools to which they **applied** and were **admitted**. Within each group, students are most likely similar in characteristics that are hard to quantify, such as ability and ambition. Hence, within-group comparisons can be considered apples-to-apples comparisons. Since students in groups C and D attended only private and public schools, respectively, there is not much information there. Focusing only on groups A and B, the earnings gap between private and public education is $$(\frac{3}{5} \times -5,000) + (\frac{2}{5}) \times 30,000=9,000 $$ where -5,000 and 30,000 are the private school differentials in groups A and B. ] .pull-right[ ![](figs/fig1.png) .small[**Note: Angrist and Pischke (2014), Table 2.1**] ] --- # Dale and Krueger (2002) II Using the C&B dataset, Dale and Krueger matched 5,583 students into 151 similar selectivity groups containing students who went to both public and private universities. Besides the "group variables" that capture the relative selectivity of the schools to which students applied and were admitted, the researchers also controlled for other variables such as SAT scores and parental income. The resulting regression model looks like this: `$$ln(Earnings_{i})=\beta_{0}+\beta_{1}Private_{i}+ \sum_{j=1}^{150}\gamma_{j}GROUP_{ji}+\delta_{1}SAT_{i}+\delta_{2}PI_{i}+\varepsilon_{i}$$` where `\(\beta_{1}\)` is the treatment effect of interest: the extent to which earnings differ for students who attend a private school compared to students who went to public universities. The model also controls for 151 selectivity groups: the variable `\(GROUP_{ji}\)` equals to one when student `\(i\)` is in the group `\(j\)` and is zero otherwise. **The idea is to control for the sets of schools to which students applied and were admitted** to bring the comparison as close as possible to an apples-to-apples comparison. --- # Dale and Krueger (2002) III .pull-left[ * The table reports the results of six regressions. The first column captures the difference (in %) of earnings between those who attended a private school and everyone else: a pretty large difference (around 14%) * Adding more and more controls that the researchers observe, there is still a gap around 9% (columns 2 and 3) - not as large as 14%, but still relevant * However, when considering the selectivity-group dummies (columns 4 to 6), the gap shrinks and is not statistically significant anymore: the private school premium is gone ] .pull-right[ ![](figs/fig2_2.png) ] --- # Regression and Causality I .center[Regression reduces - maybe even eliminates - selection bias **as long as you have credible identification strategy**] Let's go back to our potential outcomes framework: `$$Y_{i}=Y_{0i}+D_{i}(Y_{1i}-Y_{0i})$$` where we get to see either `\(Y_{1i}\)` or `\(Y_{0i}\)`, but never both. We hope to measure the average `\(Y_{1i}-Y_{0i}\)`. The naive comparison gives us .bg-washed-yellow.b--orange.ba.bw2.br3.shadow-5.ph4.mt1[ `\begin{equation*} \underbrace{E[Y_{i} | D_{i}=1]-E[Y_{i} | D_{i}=0]}_{\text{Observed difference}}= \underbrace{E[Y_{1i}|D_{i}=1]-E[Y_{0i}|D_{i}=1]}_{\text{Average treatment effect on the treated}}+\underbrace{E[Y_{0i}|D_{i}=1]-E[Y_{0i}|D_{i}=0]}_{\text{Selection bias}} \end{equation*}` ] In the example of private *versus* public education, students that go to private schools would have higher future earnings after college anyway, and the positive bias exacerbates the benefits of private education. --- # Regression and Causality II The **conditional independence assumption** (CIA) asserts that conditional on observed characteristics, `\(X_{i}\)`, the selection bias disappears. Formally, `$$Y_{1i}, Y_{0i} \perp\!\!\!\perp D_{i}|X_{i}$$` Given the CIA, `\(conditional-on-X_{i}\)` comparisons have a causal interpretation: `$$E[Y_{i}|X_{i}, D_{i}=1]-E[Y_{i}|X_{i}, D_{i}=0]=E[Y_{1i}-Y_{0i}|X_{i}]$$` Back to the private *versus* public education: `$$E[Y_{0i}|Private_{i};GROUP_{i},SAT_{i}, lnPI_{i}]=E[Y_{0i}|GROUP_{i},SAT_{i}, lnPI_{i}]$$` and the CIA makes this causal. --- class: inverse, middle, center # Omitted Variable Bias (OVB) --- # OVB I Regression is a way to make **other things equal**. However, you can only generate fair comparisons for variables included on the right-hand side - fail to include what matters still leaves us with selection bias. The "regression version" of selection bias is what we call **omitted variable bias (OVB)**. Let us go back to our example of earnings *versus* schooling. We already ran both the *long regression* `$$\underbrace{Y_{i}}_{Wage_{i}}=\beta_{0}^{l}+\beta_{1}^{l}\underbrace{X_{1i}}_{Educ_{i}}+\beta_{2}\underbrace{X_{2i}}_{Exper_{i}}+\varepsilon_{i}^{l}$$` and the *short regression* `$$\underbrace{Y_{i}}_{Wage_{i}}=\beta_{0}^{s}+\beta_{1}^{s}\underbrace{X_{1i}}_{Educ_{i}}+\varepsilon_{i}^{s}$$` The OVB formula describes the relationship between short and long coefficients as follows: `$$\beta^{s}=\beta^{l}+\pi_{21} \beta_{2}$$` where `\(\beta_{2}\)` is the coefficient is the coefficient of `\(X_{2i}\)` in the long regression, and `\(\pi_{21}\)` is the coefficient of `\(X_{1i}\)` in a regression of `\(X_{2i}\)` on `\(X_{1i}\)`. --- # OVB II .panelset[ .panel[.panel-name[Education and Experience] .pull-left[ .center[**Short Regression**] <table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>Hourly Wages</td></tr> <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">educ</td><td>0.083<sup>***</sup> (0.008)</td></tr> <tr><td style="text-align:left">Constant</td><td>0.584<sup>***</sup> (0.097)</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>526</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.184</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] .pull-right[ .center[**Long Regression**] <table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>Hourly Wages</td></tr> <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">educ</td><td>0.098<sup>***</sup> (0.008)</td></tr> <tr><td style="text-align:left">exper</td><td>0.010<sup>***</sup> (0.002)</td></tr> <tr><td style="text-align:left">Constant</td><td>0.217<sup>**</sup> (0.109)</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>526</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.246</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] ] .panel[.panel-name[OVB Formula] .pull-left[ .center[**Short Regression**] ```r library(wooldridge) # load data data("wage1", package = "wooldridge") short_reg<-lm(lwage~educ, data=wage1) beta_short<-short_reg$coefficients[2] beta_short ``` educ 0.08274437 ] .pull-right[ .center[**Long Regression**] ```r long_reg<-lm(lwage~educ+exper, data=wage1) beta_long<-long_reg$coefficients[2] beta2<-long_reg$coefficients[3] exper_on_educ<-lm(exper~educ, data=wage1) pi12<-exper_on_educ$coefficients[2] beta_long+pi12*beta2 ``` educ 0.08274437 ```r ## which is equal to beta_short ``` ] ] ] --- # OVB III In other words, the omitted variable bias (OVB) formula connects regression coefficients in models with different controls. Consider the following **long regression** of wages on schooling `\((s_{i})\)`, controlling for ability `\((A_{i})\)`: `$$Y_{i} = \alpha+\rho s_{i}+ A^{'}_{i} \gamma + \varepsilon_{i}$$` Since ability is hard to measure, what are the consequences of omitting that variable? `$$\dfrac{Cov(Y_{i}, s_{i})}{V(s_{i})}=\rho+\gamma^{'}\delta_{As}$$` where `\(\delta_{As}\)` is the vector of coefficients from regressions of the elements of `\(A_{i}\)` on `\(s_{i}\)`. In English, *short equals long plus the effect of omitted times the regression of omitted on included*. **When omitted and included are uncorrelated, short `\(=\)` long.** --- # Regression Sensitivity Analysis .pull-left[ * We are never sure whether a given set of controls is enough to eliminate selection bias. However, we may ask one important question: how sensitive are regression results to changes in the control variables? * Usually, our confidence in regression estimates of causal effects grows when treatment effects are insensitive to whether a particular variable is included or not in the model as long as a few core controls are always included in the model * Back to Dale and Krueger (2002), you can see that after they take into account the selectivity-group dummies (columns 4 to 6), the effect of private education remains the same even after the inclusion of multiple covariates ] .pull-right[ ![](figs/fig2_2.png) ] --- class: inverse, middle, center # Fixing Standard Errors --- # Your Standard Errors are Probably Wrong .pull-left[ * We saw that our regression estimates are subject to sampling variation, and we need to account for that uncertainty estimating the standard error `\(SE(\hat{\beta}_{k})\)`. With that estimate, one can calculate test statistics to evaluate statistical significance, confidence intervals, etc. * Standard errors computed using `\(SE(\hat{\beta}_{k})=\frac{\sigma_{e}}{\sigma_{\tilde{X}_{k}}\sqrt{n-p}}\)` are nowadays considered old-fashioned because that formula is derived assuming the variance of residuals is unrelated to regressors - the *homoskedasticity* assumption * Most of the time, that is a heroic assumption. For instance, you can see that among people with higher levels of education (10 years +), salary varies a lot more compared to individuals with fewer years of education ] .pull-right[
.small[**Note:** The data come from the 1976 Current Population Survey in the USA.] ] --- # Fixing Standard Errors .panelset[ .panel[.panel-name[Robust Standard Errors] * Given the problem with the homoskedasticity assumption, one can build the standard errors based on some knowledge about the error variance. Robust standard errors `\(RSE (\hat{\beta})\)` allow for the possibility that the regression line fits more or less well for different values of `\(X_{i}\)` - a scenario known as heteroskedasticity * If the residual turns out to be homoskedastic, the estimates of the robust standard error should be close to `\(SE(\hat{\beta})\)`. However, if residuals are indeed heteroskedastic, estimates of `\(RSE(\hat{\beta})\)` provide a much better picture of the sampling variance ] .panel[.panel-name[R code] ```r library(wooldridge) data("wage1", package = "wooldridge") # load data library(fixest) reg1<-feols(wage~educ, data=wage1, se="standard") reg2<-feols(wage~educ, data=wage1, se="hetero") etable(reg1, reg2) ``` ``` ## reg1 reg2 ## Dependent Var.: wage wage ## ## (Intercept) -0.9049 (0.6850) -0.9049 (0.7255) ## educ 0.5414*** (0.0532) 0.5414*** (0.0613) ## _______________ __________________ __________________ ## S.E. type Standard Heteroskedas.-rob. ## Observations 526 526 ## R2 0.16476 0.16476 ## Adj. R2 0.16316 0.16316 ``` ] ]