For most research questions, we are interested in the relationship between two (or more) variables, i.e., what learning about one variable tells us about the other
Consider the classic example of salary and years of education. The scatterplot shows the relationship between those two variables: we would like to explain the variation in hourly wages with education levels
It seems that there is a positive relationship between salary and education
Frequently, we are interested in conditional expectations: for example, the expected salary for people who have completed three years of schooling. This can be written as E[Yi|Xi=x], where Yi is salary, and Xi=x can be read as "when Xi equals the particular value x" - in our case, when education equals 3 years
Conditional expectations tell us how one variable's population average changes as we move the conditioning variable over its values. For every value of the conditioning variable, we might get a different average of the dependent variable Yi, and the collection of all those averages is called the conditional expectation function
The figure shows the conditional expectation of hourly wages given years of education. Despite its ups and downs, the earnings-schooling CEF is upward-sloping
Assuming that the relationship between earnings and education is linear, one can easily apply the most well-known application of line-fitting: the Ordinary Least Squares (OLS)
OLS picks the line that gives the lowest sum of squared residuals. A residual is a difference between an observation's actual value and the conditional mean assigned by the line Y−ˆY
Our model of earnings and education can be written as
Yi⏟Wagei=β0+β1Xi⏟Educi+εi where β's are unknown parameters to be estimated and ε is the unobserved error term
Ultimately, the goal is to minimize ∑e2i, where ei=Yi−ˆYi⏟Prediction. In other words, we estimate β0 and β1 by minimizing the sum of squared deviations from the regression line:
(ˆβ0,ˆβ1)=argmin which leads to: \hat{\beta_{0}}= \bar{Y} - \hat{\beta_{1}}\bar{X} \\ \hat{\beta_{1}}=\frac{\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}=\frac{Cov(X_{i}, Y_{i})}{Var(X_{i})}
OLS gives us the best linear approximation of the relationship between X and Y
\widehat{Wage_{i}}=-0.905+0.541Educ_{i}
Hourly Wages | |
educ | 0.541*** (0.053) |
Constant | -0.905 (0.685) |
Observations | 526 |
Adjusted R2 | 0.163 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
#library(stargazer)# install.packages("wooldridge") if you don't have the package# install.packages("stargazer") if you don't have the package#library(stargazer)library(wooldridge)data("wage1", package = "wooldridge") # load datareg<-lm(wage~educ, data=wage1)summary(reg)#stargazer(reg, header = F, single.row = TRUE, no.space = T, dep.var.labels.include = FALSE,dep.var.caption = "Hourly Wages", type='html', omit.stat=c("rsq", "ser", "f"))
Fun fact: the solution to this problem
\hat{\beta}=\underset{\beta}{\arg\min} \{ (E[Y_{i}|X_{i}]-X^{'}\beta)^{2} \} is the same as this
\hat{\beta}=\underset{\beta}{\arg\min} \{ (Y_{i}-X^{'}\beta)^{2} \} The implication of that is you only need conditional means to run regressions. For example, the table shows the results of a weighted (by what?) linear regression using means by years of schooling instead of individual-level data. As you can see, \hat{\beta}_{0}=-0.905 and \hat{\beta}_{1}=0.541, same as before (when using individual-level data).
Hourly Wages | |
educ | 0.541*** (0.081) |
Constant | -0.905 (1.043) |
Observations | 18 |
Adjusted R2 | 0.719 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
library(stargazer)wage2<-wage1%>%group_by(educ)%>%summarize(wage=mean(wage), count=n())reg2<-lm(wage~educ, weights = count, data=wage2)summary(reg2)#stargazer(reg2, header = F, single.row = TRUE, no.space = T, dep.var.labels.include = FALSE,dep.var.caption = "Hourly Wages", type='html', omit.stat=c("rsq", "ser", "f"))
As you just saw, regression breaks any dependent variable into two pieces:
Y_{i}=\hat{Y_{i}}+e_{i} \hat{Y}_{i} is the fitted value or the part of Y_{i} that the model explains. The residual e_{i} is what is left over.
Some of the residuals properties:
Regression residuals have expectation zero (i.e., E(e_{i})=0)
Regression residuals are uncorrelated with all the regressors that made them and with the corresponding fitted value: E(X_{ik}e_{i})=0 for each regressor X_{ik}. In other words, if you regress e_{i} on X_{1i}, X_{2i}, \dots, the estimated coefficients will be equal to zero.
Regression residuals are uncorrelated with the corresponding fitted values: E(\hat{Y}_{i}e_{i})=0
Those properties can be derived using the first-order conditions we used to get \hat{\beta}_{0} and \hat{\beta}_{1}.
An important case using regression is the bivariate regression with a dummy regressor (we saw it here). Say Z_{i} takes on two values: 0 and 1. Hence, one can write it as
E[Y_{i}|Z_{i}=0]=\beta_{0} \\ E[Y_{i}|Z_{i}=1]=\beta_{0}+\beta_{1}
so that \beta_{1}=E[Y_{i}|Z_{i}=1]-E[Y_{i}|Z_{i}=0].
Using this notation, we can write
E[Y_{i}|Z_{i}]=E[Y_{i}|Z_{i}=0]+(E[Y_{i}|Z_{i}=1]-E[Y_{i}|Z_{i}=0])Z_{i} = \beta_{0}+\beta_{1}Z_{i}
Does it look familiar? This shows that E[Y_{i}|Z_{i}] is a linear function of Z_{i}, with slope \beta_{1} and intercept \beta_{0}. Because the CEF with a single dummy variable is linear, regression fits this CEF perfectly.
The most exciting regressions include more than one variable. Let's recap bivariate regression:
\hat{\beta_{0}}= \bar{Y} - \hat{\beta_{1}}\bar{X}\\ \hat{\beta_{1}}=\frac{Cov(X_{i}, Y_{i})}{Var(X_{i})}
With multiple regressors, the k-th slope coefficient is:
\beta_{k}=\frac{Cov(Y_{i}, \tilde{x}_{ki})}{V(\tilde{x}_{ki})} where \tilde{x}_{ki} is the residual from a regression of x_{ki} on all the other covariates. Each coefficient in a multivariate regression is the bivariate slope coefficient for the corresponding regressor, after "partialling out" the other variables in the model.
In other words, you can think about residuals as the part of Y that has nothing to do with X. Back to the wages versus education example, we saw that the predicted value for Wage given Educ=3 was equal to 0.718, far from the observed value of 2.92. It looks like education can only be responsible for 0.718, and the extra 2.202 must be because of some other part of the data generating process.
Let's expand the analysis and include other variables such as experience
. We would like to know how much of a relationship between wage and education is not explained by experience. To do that, we can:
This particular set of calculations is known as Frisch-Waugh-Lovell theorem.
What if you do not observe a variable you should include in this model (e.g., ability)?
Since Y^{R} and X^{R} have had the parts of Y and X that can be explained with Z removed, the relationship between Y^{R} and X^{R} is the part of the relationship between Y and X that is not explained by Z
During this process, we are washing out all the variation related to Z, in effect not allowing Z to vary. This is why we call the process "controlling for Z"/"adjusting for Z".
YR | |
XR | 0.644*** (0.054) |
Constant | 0.000 (0.142) |
Observations | 526 |
Adjusted R2 | 0.214 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
If you run the regression model
Wage_{i}=\beta_{0}+\beta_{1}Educ_{i}+\beta_{2}Exper_{i}+\varepsilon_{i}
you'll get the same result for the variable of interest Educ. The OLS estimate for \beta_{1} represents the relationship between earnings and schooling conditional on experience. Even better, we can add more controls such as gender, race, etc.
Hourly Wages | |
educ | 0.644*** (0.054) |
exper | 0.070*** (0.011) |
Constant | -3.391*** (0.767) |
Observations | 526 |
Adjusted R2 | 0.222 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Frequently, economists transform variables using the natural logarithm. For instance, wages are approximately log-normally distributed, and you might want to use ln(Wage) as a dependent variable instead of Wage to model the relationship between earnings vs. schooling - in that way, ln(Wage) would be approximately normally distributed. Taking the log also reduces the impact of outliers, and the estimates have a convenient interpretation.
To see what happens, let's go back to the bivariate regression with dummies example, modifying the dependent variable:
ln(Y_{i})=\beta_{0}+\beta_{1}Z_{i}+\varepsilon_{i} where Z_{i} takes on two values: 0 and 1. We can rewrite the equation as E[ln(Y_{i})|Z_{i}]=\beta_{0}+\beta_{1}Z_{i} and the regression, in this case, fits the CEF perfectly.
Suppose we engineer a ceteris paribus change in Z_{i} for individual i. This reveals potential outcome Y_{0i} when Z_{i}=0 and Y_{1i} when Z_{i}=1 (cont.)
Rewriting the equation for log of potential outcomes:
ln(Y_{0i})=\beta_{0}+\varepsilon_{i}\\ ln(Y_{1i})=\beta_{0}+\beta_{1}+\varepsilon_{i}
The difference in potential outcomes is
ln(Y_{1i})-ln(Y_{0i})=\beta_{1} Further rearranging this term:
\beta_{1}=ln(\frac{Y_{1i}}{Y_{0i}})=ln(1+\frac{Y_{1i}-Y_{0i}}{Y_{0i}})=ln(1+\Delta \%Y_{p})\approx\Delta \%Y_{p} assuming a small \Delta \%Y_{p}.
\Delta \%Y_{p} is shorthand for the percentage change in potential outcomes induced by Z_{i}.
Running the earnings vs schooling regression in the log-linear form
ln(Wage_{i})=\beta_{0}+\beta_{1}Educ_{i}+\varepsilon_{i} you get the results that the table displays. The interpretation of \beta_{1} is the following: an additional year of education increases the hourly wage, on average, by 8.3%.
Hourly Wages | |
educ | 0.083*** (0.008) |
Constant | 0.584*** (0.097) |
Observations | 526 |
Adjusted R2 | 0.184 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
So far, we did not give too much attention to the fact that our data comes from samples. Just like sample means, sample regression estimates are subject to sampling variance. Every time that we draw a new sample from the same population to estimate the same regression model, we might get different results. Again, one needs to have in mind how to quantify the uncertainty that arises with sampling.
In the regression framework, we also measure variability with the standard error. In a bivariate regression Y_{i}=\beta_{0}+\beta_{1}X_{i}+\varepsilon_{i}, the standard error of the slope can be written as
SE(\hat{\beta}_{1})=\frac{\sigma_{e}}{\sigma_{X}\sqrt{n-2}} where \sigma_{e} is the standard deviation of the regression residuals, and \sigma_{X} is the standard deviation of the regressor X_{i}. Like the standard error of a sample average, regression SEs decrease with i) sample size \uparrow n ii) variability in the explanatory variable \uparrow \sigma_{X}. When the residual variance is large, regression estimates are not precise - in this case, the line does not fit the dots very well.
In a multivariate model
Y_{i}=\beta_{0}+\sum_{k=1}^{K}\beta_{k}X_{ki}+\varepsilon_{i} the standard error for the kth sample slope, \beta_{k}, is:
SE(\hat{\beta}_{k})=\frac{\sigma_{e}}{\sigma_{\tilde{X}_{k}}\sqrt{n-p}}
where p is the number of parameters to be estimated, and \sigma_{\tilde{X}_{k}} is the standard deviation of \tilde{X}_{ki}, the residual from a regression of X_{ki} on all other regressors (remember the FWL theorem?). As you add more and more explanatory variables in the regression, \sigma_{e} will fall. On the other hand, the standard deviation of \sigma_{\tilde{X}_{k}} gets smaller since additional regressors might explain some variation of X_{ki}. The upshot of these changes in the top and bottom can be increased or decreased precision.
options(scipen=999)data("wage1", package = "wooldridge") # load data### Standard deviation of educsdX<-sd(wage1$educ)### Number of observations in the samplen<-nrow(wage1)### Regression residualsreg<-lm(lwage~educ, data=wage1)sde<-sd(residuals(reg))### Standard error of beta1SE_educ<-sde/(sqrt(n-2)*sdX)SE_educ
[1] 0.007566694
Students who attended a private four-year college in America paid an average of about $31,875 in tuition and fees in the 2018-19 period, while students who went to a public university in their home state paid, on average, 9,212 dollars - the yearly difference in tuition is considerable. Is it worth it?
Comparisons between earnings of those who went to different schools may reveal a significant gap favoring elite-college alumni. Indeed, on average, the wage difference is 14% in favor of those who went private. Is that a fair comparison?
What if we use regression to control for family income, SAT scores, and other covariates that we can observe?
Instead of focusing on everything that might matter for college choice and earnings, Stacy Dale and Alan Krueger1 came up with a compelling shortcut: the characteristics of colleges to which students applied and were admitted
Consider the case of two students who both applied to and were admitted to UMass and Harvard, but one goes to Harvard, and the other goes to UMass. The fact that those students applied to the same universities suggests they have similar ambition and motivation. Since both were admitted to the same places, one can assume that they might be able to succeed under the same circumstances. Hence, comparing earnings of those two similar students that took different paths would be fair
Dale and Krueger analyzed a large data set called College and Beyond (C&B). The C&B data contains information about thousands of students who enrolled in a group of selective U.S. universities, together with survey information collected multiple times from the year students took the SAT to long after most had graduated from college
[1] Stacy Berg Dale and Alan B. Krueger, "Estimating the Payoff to Attending a More Selective College: An Application of Selection on Observables and Unobservables," Quarterly Journal of Economics, vol. 117, no. 4, November 2002.
The table illustrates the idea of "college matching." There you have applications, admissions, and matriculation decisions for nine made-up students and six made-up universities.
What happens when we compare the earnings of those who entered private with those who went to public universities?
Average earnings of Private university students:
\frac{110,000+100,000+60,000+115,000+75,000}{5}=92,000
Average earnings of Public university students:
\frac{110,000+30,000+90,000+60,000}{4}=72,500
That gap suggests a sizeable private school advantage.
Note: Angrist and Pischke (2014), Table 2.1
The table organizes nine students into four groups. Each group is defined by the set of schools to which they applied and were admitted. Within each group, students are most likely similar in characteristics that are hard to quantify, such as ability and ambition.
Hence, within-group comparisons can be considered apples-to-apples comparisons. Since students in groups C and D attended only private and public schools, respectively, there is not much information there. Focusing only on groups A and B, the earnings gap between private and public education is
(\frac{3}{5} \times -5,000) + (\frac{2}{5}) \times 30,000=9,000 where -5,000 and 30,000 are the private school differentials in groups A and B.
Note: Angrist and Pischke (2014), Table 2.1
Using the C&B dataset, Dale and Krueger matched 5,583 students into 151 similar selectivity groups containing students who went to both public and private universities. Besides the "group variables" that capture the relative selectivity of the schools to which students applied and were admitted, the researchers also controlled for other variables such as SAT scores and parental income.
The resulting regression model looks like this:
ln(Earnings_{i})=\beta_{0}+\beta_{1}Private_{i}+ \sum_{j=1}^{150}\gamma_{j}GROUP_{ji}+\delta_{1}SAT_{i}+\delta_{2}PI_{i}+\varepsilon_{i}
where \beta_{1} is the treatment effect of interest: the extent to which earnings differ for students who attend a private school compared to students who went to public universities. The model also controls for 151 selectivity groups: the variable GROUP_{ji} equals to one when student i is in the group j and is zero otherwise. The idea is to control for the sets of schools to which students applied and were admitted to bring the comparison as close as possible to an apples-to-apples comparison.
The table reports the results of six regressions. The first column captures the difference (in %) of earnings between those who attended a private school and everyone else: a pretty large difference (around 14%)
Adding more and more controls that the researchers observe, there is still a gap around 9% (columns 2 and 3) - not as large as 14%, but still relevant
However, when considering the selectivity-group dummies (columns 4 to 6), the gap shrinks and is not statistically significant anymore: the private school premium is gone
Regression reduces - maybe even eliminates - selection bias as long as you have credible identification strategy
Let's go back to our potential outcomes framework: Y_{i}=Y_{0i}+D_{i}(Y_{1i}-Y_{0i}) where we get to see either Y_{1i} or Y_{0i}, but never both. We hope to measure the average Y_{1i}-Y_{0i}. The naive comparison gives us
\begin{equation*} \underbrace{E[Y_{i} | D_{i}=1]-E[Y_{i} | D_{i}=0]}_{\text{Observed difference}}= \underbrace{E[Y_{1i}|D_{i}=1]-E[Y_{0i}|D_{i}=1]}_{\text{Average treatment effect on the treated}}+\underbrace{E[Y_{0i}|D_{i}=1]-E[Y_{0i}|D_{i}=0]}_{\text{Selection bias}} \end{equation*}
In the example of private versus public education, students that go to private schools would have higher future earnings after college anyway, and the positive bias exacerbates the benefits of private education.
The conditional independence assumption (CIA) asserts that conditional on observed characteristics, X_{i}, the selection bias disappears. Formally,
Y_{1i}, Y_{0i} \perp\!\!\!\perp D_{i}|X_{i}
Given the CIA, conditional-on-X_{i} comparisons have a causal interpretation:
E[Y_{i}|X_{i}, D_{i}=1]-E[Y_{i}|X_{i}, D_{i}=0]=E[Y_{1i}-Y_{0i}|X_{i}]
Back to the private versus public education:
E[Y_{0i}|Private_{i};GROUP_{i},SAT_{i}, lnPI_{i}]=E[Y_{0i}|GROUP_{i},SAT_{i}, lnPI_{i}] and the CIA makes this causal.
Regression is a way to make other things equal. However, you can only generate fair comparisons for variables included on the right-hand side - fail to include what matters still leaves us with selection bias. The "regression version" of selection bias is what we call omitted variable bias (OVB). Let us go back to our example of earnings versus schooling. We already ran both the long regression
\underbrace{Y_{i}}_{Wage_{i}}=\beta_{0}^{l}+\beta_{1}^{l}\underbrace{X_{1i}}_{Educ_{i}}+\beta_{2}\underbrace{X_{2i}}_{Exper_{i}}+\varepsilon_{i}^{l} and the short regression
\underbrace{Y_{i}}_{Wage_{i}}=\beta_{0}^{s}+\beta_{1}^{s}\underbrace{X_{1i}}_{Educ_{i}}+\varepsilon_{i}^{s}
The OVB formula describes the relationship between short and long coefficients as follows:
\beta^{s}=\beta^{l}+\pi_{21} \beta_{2}
where \beta_{2} is the coefficient is the coefficient of X_{2i} in the long regression, and \pi_{21} is the coefficient of X_{1i} in a regression of X_{2i} on X_{1i}.
Short Regression
Hourly Wages | |
educ | 0.083*** (0.008) |
Constant | 0.584*** (0.097) |
Observations | 526 |
Adjusted R2 | 0.184 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Long Regression
Hourly Wages | |
educ | 0.098*** (0.008) |
exper | 0.010*** (0.002) |
Constant | 0.217** (0.109) |
Observations | 526 |
Adjusted R2 | 0.246 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Short Regression
library(wooldridge)# load datadata("wage1", package = "wooldridge") short_reg<-lm(lwage~educ, data=wage1)beta_short<-short_reg$coefficients[2]beta_short
educ
0.08274437
Long Regression
long_reg<-lm(lwage~educ+exper, data=wage1)beta_long<-long_reg$coefficients[2]beta2<-long_reg$coefficients[3]exper_on_educ<-lm(exper~educ, data=wage1)pi12<-exper_on_educ$coefficients[2]beta_long+pi12*beta2
educ
0.08274437
## which is equal to beta_short
In other words, the omitted variable bias (OVB) formula connects regression coefficients in models with different controls. Consider the following long regression of wages on schooling (s_{i}), controlling for ability (A_{i}):
Y_{i} = \alpha+\rho s_{i}+ A^{'}_{i} \gamma + \varepsilon_{i} Since ability is hard to measure, what are the consequences of omitting that variable?
\dfrac{Cov(Y_{i}, s_{i})}{V(s_{i})}=\rho+\gamma^{'}\delta_{As} where \delta_{As} is the vector of coefficients from regressions of the elements of A_{i} on s_{i}.
In English, short equals long plus the effect of omitted times the regression of omitted on included. When omitted and included are uncorrelated, short = long.
We are never sure whether a given set of controls is enough to eliminate selection bias. However, we may ask one important question: how sensitive are regression results to changes in the control variables?
Usually, our confidence in regression estimates of causal effects grows when treatment effects are insensitive to whether a particular variable is included or not in the model as long as a few core controls are always included in the model
Back to Dale and Krueger (2002), you can see that after they take into account the selectivity-group dummies (columns 4 to 6), the effect of private education remains the same even after the inclusion of multiple covariates
We saw that our regression estimates are subject to sampling variation, and we need to account for that uncertainty estimating the standard error SE(\hat{\beta}_{k}). With that estimate, one can calculate test statistics to evaluate statistical significance, confidence intervals, etc.
Standard errors computed using SE(\hat{\beta}_{k})=\frac{\sigma_{e}}{\sigma_{\tilde{X}_{k}}\sqrt{n-p}} are nowadays considered old-fashioned because that formula is derived assuming the variance of residuals is unrelated to regressors - the homoskedasticity assumption
Most of the time, that is a heroic assumption. For instance, you can see that among people with higher levels of education (10 years +), salary varies a lot more compared to individuals with fewer years of education
Given the problem with the homoskedasticity assumption, one can build the standard errors based on some knowledge about the error variance. Robust standard errors RSE (\hat{\beta}) allow for the possibility that the regression line fits more or less well for different values of X_{i} - a scenario known as heteroskedasticity
If the residual turns out to be homoskedastic, the estimates of the robust standard error should be close to SE(\hat{\beta}). However, if residuals are indeed heteroskedastic, estimates of RSE(\hat{\beta}) provide a much better picture of the sampling variance
library(wooldridge)data("wage1", package = "wooldridge") # load datalibrary(fixest)reg1<-feols(wage~educ, data=wage1, se="standard")reg2<-feols(wage~educ, data=wage1, se="hetero")etable(reg1, reg2)
## reg1 reg2## Dependent Var.: wage wage## ## (Intercept) -0.9049 (0.6850) -0.9049 (0.7255)## educ 0.5414*** (0.0532) 0.5414*** (0.0613)## _______________ __________________ __________________## S.E. type Standard Heteroskedas.-rob.## Observations 526 526## R2 0.16476 0.16476## Adj. R2 0.16316 0.16316
For most research questions, we are interested in the relationship between two (or more) variables, i.e., what learning about one variable tells us about the other
Consider the classic example of salary and years of education. The scatterplot shows the relationship between those two variables: we would like to explain the variation in hourly wages with education levels
It seems that there is a positive relationship between salary and education
Note: The data come from the 1976 Current Population Survey in the USA.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
s | Toggle scribble toolbox |
Esc | Back to slideshow |
For most research questions, we are interested in the relationship between two (or more) variables, i.e., what learning about one variable tells us about the other
Consider the classic example of salary and years of education. The scatterplot shows the relationship between those two variables: we would like to explain the variation in hourly wages with education levels
It seems that there is a positive relationship between salary and education
Note: The data come from the 1976 Current Population Survey in the USA.
Frequently, we are interested in conditional expectations: for example, the expected salary for people who have completed three years of schooling. This can be written as E[Y_{i}|X_{i}=x], where Y_{i} is salary, and X_{i}=x can be read as "when X_{i} equals the particular value x" - in our case, when education equals 3 years
Conditional expectations tell us how one variable's population average changes as we move the conditioning variable over its values. For every value of the conditioning variable, we might get a different average of the dependent variable Y_{i}, and the collection of all those averages is called the conditional expectation function
The figure shows the conditional expectation of hourly wages given years of education. Despite its ups and downs, the earnings-schooling CEF is upward-sloping
Assuming that the relationship between earnings and education is linear, one can easily apply the most well-known application of line-fitting: the Ordinary Least Squares (OLS)
OLS picks the line that gives the lowest sum of squared residuals. A residual is a difference between an observation's actual value and the conditional mean assigned by the line Y-\hat{Y}
Our model of earnings and education can be written as
\underbrace{Y_{i}}_{Wage_{i}}=\beta_{0}+\beta_{1}\underbrace{X_{i}}_{Educ_{i}}+\varepsilon_{i} where \beta's are unknown parameters to be estimated and \varepsilon is the unobserved error term
In the beginning, there was only a scatter plot. And then God said "let us fit a line"
Ultimately, the goal is to minimize \sum e_{i}^{2}, where e_{i}=Y_{i}-\underbrace{\hat{Y}_{i}}_{Prediction}. In other words, we estimate \beta_{0} and \beta_{1} by minimizing the sum of squared deviations from the regression line:
(\hat{\beta}_{0}, \hat{\beta}_{1})=\underset{\beta_{0}, \beta_{1}}{\arg\min} \sum_{i=1}^{n} (Y_{i}-\beta_{0}-\beta_{1}X_{i})^{2} which leads to: \hat{\beta_{0}}= \bar{Y} - \hat{\beta_{1}}\bar{X} \\ \hat{\beta_{1}}=\frac{\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}=\frac{Cov(X_{i}, Y_{i})}{Var(X_{i})}
OLS gives us the best linear approximation of the relationship between X and Y
\widehat{Wage_{i}}=-0.905+0.541Educ_{i}
Hourly Wages | |
educ | 0.541*** (0.053) |
Constant | -0.905 (0.685) |
Observations | 526 |
Adjusted R2 | 0.163 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
#library(stargazer)# install.packages("wooldridge") if you don't have the package# install.packages("stargazer") if you don't have the package#library(stargazer)library(wooldridge)data("wage1", package = "wooldridge") # load datareg<-lm(wage~educ, data=wage1)summary(reg)#stargazer(reg, header = F, single.row = TRUE, no.space = T, dep.var.labels.include = FALSE,dep.var.caption = "Hourly Wages", type='html', omit.stat=c("rsq", "ser", "f"))
Fun fact: the solution to this problem
\hat{\beta}=\underset{\beta}{\arg\min} \{ (E[Y_{i}|X_{i}]-X^{'}\beta)^{2} \} is the same as this
\hat{\beta}=\underset{\beta}{\arg\min} \{ (Y_{i}-X^{'}\beta)^{2} \} The implication of that is you only need conditional means to run regressions. For example, the table shows the results of a weighted (by what?) linear regression using means by years of schooling instead of individual-level data. As you can see, \hat{\beta}_{0}=-0.905 and \hat{\beta}_{1}=0.541, same as before (when using individual-level data).
Hourly Wages | |
educ | 0.541*** (0.081) |
Constant | -0.905 (1.043) |
Observations | 18 |
Adjusted R2 | 0.719 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
library(stargazer)wage2<-wage1%>%group_by(educ)%>%summarize(wage=mean(wage), count=n())reg2<-lm(wage~educ, weights = count, data=wage2)summary(reg2)#stargazer(reg2, header = F, single.row = TRUE, no.space = T, dep.var.labels.include = FALSE,dep.var.caption = "Hourly Wages", type='html', omit.stat=c("rsq", "ser", "f"))
As you just saw, regression breaks any dependent variable into two pieces:
Y_{i}=\hat{Y_{i}}+e_{i} \hat{Y}_{i} is the fitted value or the part of Y_{i} that the model explains. The residual e_{i} is what is left over.
Some of the residuals properties:
Regression residuals have expectation zero (i.e., E(e_{i})=0)
Regression residuals are uncorrelated with all the regressors that made them and with the corresponding fitted value: E(X_{ik}e_{i})=0 for each regressor X_{ik}. In other words, if you regress e_{i} on X_{1i}, X_{2i}, \dots, the estimated coefficients will be equal to zero.
Regression residuals are uncorrelated with the corresponding fitted values: E(\hat{Y}_{i}e_{i})=0
Those properties can be derived using the first-order conditions we used to get \hat{\beta}_{0} and \hat{\beta}_{1}.
An important case using regression is the bivariate regression with a dummy regressor (we saw it here). Say Z_{i} takes on two values: 0 and 1. Hence, one can write it as
E[Y_{i}|Z_{i}=0]=\beta_{0} \\ E[Y_{i}|Z_{i}=1]=\beta_{0}+\beta_{1}
so that \beta_{1}=E[Y_{i}|Z_{i}=1]-E[Y_{i}|Z_{i}=0].
Using this notation, we can write
E[Y_{i}|Z_{i}]=E[Y_{i}|Z_{i}=0]+(E[Y_{i}|Z_{i}=1]-E[Y_{i}|Z_{i}=0])Z_{i} = \beta_{0}+\beta_{1}Z_{i}
Does it look familiar? This shows that E[Y_{i}|Z_{i}] is a linear function of Z_{i}, with slope \beta_{1} and intercept \beta_{0}. Because the CEF with a single dummy variable is linear, regression fits this CEF perfectly.
The most exciting regressions include more than one variable. Let's recap bivariate regression:
\hat{\beta_{0}}= \bar{Y} - \hat{\beta_{1}}\bar{X}\\ \hat{\beta_{1}}=\frac{Cov(X_{i}, Y_{i})}{Var(X_{i})}
With multiple regressors, the k-th slope coefficient is:
\beta_{k}=\frac{Cov(Y_{i}, \tilde{x}_{ki})}{V(\tilde{x}_{ki})} where \tilde{x}_{ki} is the residual from a regression of x_{ki} on all the other covariates. Each coefficient in a multivariate regression is the bivariate slope coefficient for the corresponding regressor, after "partialling out" the other variables in the model.
In other words, you can think about residuals as the part of Y that has nothing to do with X. Back to the wages versus education example, we saw that the predicted value for Wage given Educ=3 was equal to 0.718, far from the observed value of 2.92. It looks like education can only be responsible for 0.718, and the extra 2.202 must be because of some other part of the data generating process.
Let's expand the analysis and include other variables such as experience
. We would like to know how much of a relationship between wage and education is not explained by experience. To do that, we can:
This particular set of calculations is known as Frisch-Waugh-Lovell theorem.
What if you do not observe a variable you should include in this model (e.g., ability)?
Since Y^{R} and X^{R} have had the parts of Y and X that can be explained with Z removed, the relationship between Y^{R} and X^{R} is the part of the relationship between Y and X that is not explained by Z
During this process, we are washing out all the variation related to Z, in effect not allowing Z to vary. This is why we call the process "controlling for Z"/"adjusting for Z".
YR | |
XR | 0.644*** (0.054) |
Constant | 0.000 (0.142) |
Observations | 526 |
Adjusted R2 | 0.214 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
If you run the regression model
Wage_{i}=\beta_{0}+\beta_{1}Educ_{i}+\beta_{2}Exper_{i}+\varepsilon_{i}
you'll get the same result for the variable of interest Educ. The OLS estimate for \beta_{1} represents the relationship between earnings and schooling conditional on experience. Even better, we can add more controls such as gender, race, etc.
Hourly Wages | |
educ | 0.644*** (0.054) |
exper | 0.070*** (0.011) |
Constant | -3.391*** (0.767) |
Observations | 526 |
Adjusted R2 | 0.222 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Frequently, economists transform variables using the natural logarithm. For instance, wages are approximately log-normally distributed, and you might want to use ln(Wage) as a dependent variable instead of Wage to model the relationship between earnings vs. schooling - in that way, ln(Wage) would be approximately normally distributed. Taking the log also reduces the impact of outliers, and the estimates have a convenient interpretation.
To see what happens, let's go back to the bivariate regression with dummies example, modifying the dependent variable:
ln(Y_{i})=\beta_{0}+\beta_{1}Z_{i}+\varepsilon_{i} where Z_{i} takes on two values: 0 and 1. We can rewrite the equation as E[ln(Y_{i})|Z_{i}]=\beta_{0}+\beta_{1}Z_{i} and the regression, in this case, fits the CEF perfectly.
Suppose we engineer a ceteris paribus change in Z_{i} for individual i. This reveals potential outcome Y_{0i} when Z_{i}=0 and Y_{1i} when Z_{i}=1 (cont.)
Rewriting the equation for log of potential outcomes:
ln(Y_{0i})=\beta_{0}+\varepsilon_{i}\\ ln(Y_{1i})=\beta_{0}+\beta_{1}+\varepsilon_{i}
The difference in potential outcomes is
ln(Y_{1i})-ln(Y_{0i})=\beta_{1} Further rearranging this term:
\beta_{1}=ln(\frac{Y_{1i}}{Y_{0i}})=ln(1+\frac{Y_{1i}-Y_{0i}}{Y_{0i}})=ln(1+\Delta \%Y_{p})\approx\Delta \%Y_{p} assuming a small \Delta \%Y_{p}.
\Delta \%Y_{p} is shorthand for the percentage change in potential outcomes induced by Z_{i}.
Running the earnings vs schooling regression in the log-linear form
ln(Wage_{i})=\beta_{0}+\beta_{1}Educ_{i}+\varepsilon_{i} you get the results that the table displays. The interpretation of \beta_{1} is the following: an additional year of education increases the hourly wage, on average, by 8.3%.
Hourly Wages | |
educ | 0.083*** (0.008) |
Constant | 0.584*** (0.097) |
Observations | 526 |
Adjusted R2 | 0.184 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
So far, we did not give too much attention to the fact that our data comes from samples. Just like sample means, sample regression estimates are subject to sampling variance. Every time that we draw a new sample from the same population to estimate the same regression model, we might get different results. Again, one needs to have in mind how to quantify the uncertainty that arises with sampling.
In the regression framework, we also measure variability with the standard error. In a bivariate regression Y_{i}=\beta_{0}+\beta_{1}X_{i}+\varepsilon_{i}, the standard error of the slope can be written as
SE(\hat{\beta}_{1})=\frac{\sigma_{e}}{\sigma_{X}\sqrt{n-2}} where \sigma_{e} is the standard deviation of the regression residuals, and \sigma_{X} is the standard deviation of the regressor X_{i}. Like the standard error of a sample average, regression SEs decrease with i) sample size \uparrow n ii) variability in the explanatory variable \uparrow \sigma_{X}. When the residual variance is large, regression estimates are not precise - in this case, the line does not fit the dots very well.
In a multivariate model
Y_{i}=\beta_{0}+\sum_{k=1}^{K}\beta_{k}X_{ki}+\varepsilon_{i} the standard error for the kth sample slope, \beta_{k}, is:
SE(\hat{\beta}_{k})=\frac{\sigma_{e}}{\sigma_{\tilde{X}_{k}}\sqrt{n-p}}
where p is the number of parameters to be estimated, and \sigma_{\tilde{X}_{k}} is the standard deviation of \tilde{X}_{ki}, the residual from a regression of X_{ki} on all other regressors (remember the FWL theorem?). As you add more and more explanatory variables in the regression, \sigma_{e} will fall. On the other hand, the standard deviation of \sigma_{\tilde{X}_{k}} gets smaller since additional regressors might explain some variation of X_{ki}. The upshot of these changes in the top and bottom can be increased or decreased precision.
options(scipen=999)data("wage1", package = "wooldridge") # load data### Standard deviation of educsdX<-sd(wage1$educ)### Number of observations in the samplen<-nrow(wage1)### Regression residualsreg<-lm(lwage~educ, data=wage1)sde<-sd(residuals(reg))### Standard error of beta1SE_educ<-sde/(sqrt(n-2)*sdX)SE_educ
[1] 0.007566694
Students who attended a private four-year college in America paid an average of about $31,875 in tuition and fees in the 2018-19 period, while students who went to a public university in their home state paid, on average, 9,212 dollars - the yearly difference in tuition is considerable. Is it worth it?
Comparisons between earnings of those who went to different schools may reveal a significant gap favoring elite-college alumni. Indeed, on average, the wage difference is 14% in favor of those who went private. Is that a fair comparison?
What if we use regression to control for family income, SAT scores, and other covariates that we can observe?
Instead of focusing on everything that might matter for college choice and earnings, Stacy Dale and Alan Krueger1 came up with a compelling shortcut: the characteristics of colleges to which students applied and were admitted
Consider the case of two students who both applied to and were admitted to UMass and Harvard, but one goes to Harvard, and the other goes to UMass. The fact that those students applied to the same universities suggests they have similar ambition and motivation. Since both were admitted to the same places, one can assume that they might be able to succeed under the same circumstances. Hence, comparing earnings of those two similar students that took different paths would be fair
Dale and Krueger analyzed a large data set called College and Beyond (C&B). The C&B data contains information about thousands of students who enrolled in a group of selective U.S. universities, together with survey information collected multiple times from the year students took the SAT to long after most had graduated from college
[1] Stacy Berg Dale and Alan B. Krueger, "Estimating the Payoff to Attending a More Selective College: An Application of Selection on Observables and Unobservables," Quarterly Journal of Economics, vol. 117, no. 4, November 2002.
The table illustrates the idea of "college matching." There you have applications, admissions, and matriculation decisions for nine made-up students and six made-up universities.
What happens when we compare the earnings of those who entered private with those who went to public universities?
Average earnings of Private university students:
\frac{110,000+100,000+60,000+115,000+75,000}{5}=92,000
Average earnings of Public university students:
\frac{110,000+30,000+90,000+60,000}{4}=72,500
That gap suggests a sizeable private school advantage.
Note: Angrist and Pischke (2014), Table 2.1
The table organizes nine students into four groups. Each group is defined by the set of schools to which they applied and were admitted. Within each group, students are most likely similar in characteristics that are hard to quantify, such as ability and ambition.
Hence, within-group comparisons can be considered apples-to-apples comparisons. Since students in groups C and D attended only private and public schools, respectively, there is not much information there. Focusing only on groups A and B, the earnings gap between private and public education is
(\frac{3}{5} \times -5,000) + (\frac{2}{5}) \times 30,000=9,000 where -5,000 and 30,000 are the private school differentials in groups A and B.
Note: Angrist and Pischke (2014), Table 2.1
Using the C&B dataset, Dale and Krueger matched 5,583 students into 151 similar selectivity groups containing students who went to both public and private universities. Besides the "group variables" that capture the relative selectivity of the schools to which students applied and were admitted, the researchers also controlled for other variables such as SAT scores and parental income.
The resulting regression model looks like this:
ln(Earnings_{i})=\beta_{0}+\beta_{1}Private_{i}+ \sum_{j=1}^{150}\gamma_{j}GROUP_{ji}+\delta_{1}SAT_{i}+\delta_{2}PI_{i}+\varepsilon_{i}
where \beta_{1} is the treatment effect of interest: the extent to which earnings differ for students who attend a private school compared to students who went to public universities. The model also controls for 151 selectivity groups: the variable GROUP_{ji} equals to one when student i is in the group j and is zero otherwise. The idea is to control for the sets of schools to which students applied and were admitted to bring the comparison as close as possible to an apples-to-apples comparison.
The table reports the results of six regressions. The first column captures the difference (in %) of earnings between those who attended a private school and everyone else: a pretty large difference (around 14%)
Adding more and more controls that the researchers observe, there is still a gap around 9% (columns 2 and 3) - not as large as 14%, but still relevant
However, when considering the selectivity-group dummies (columns 4 to 6), the gap shrinks and is not statistically significant anymore: the private school premium is gone
Regression reduces - maybe even eliminates - selection bias as long as you have credible identification strategy
Let's go back to our potential outcomes framework: Y_{i}=Y_{0i}+D_{i}(Y_{1i}-Y_{0i}) where we get to see either Y_{1i} or Y_{0i}, but never both. We hope to measure the average Y_{1i}-Y_{0i}. The naive comparison gives us
\begin{equation*} \underbrace{E[Y_{i} | D_{i}=1]-E[Y_{i} | D_{i}=0]}_{\text{Observed difference}}= \underbrace{E[Y_{1i}|D_{i}=1]-E[Y_{0i}|D_{i}=1]}_{\text{Average treatment effect on the treated}}+\underbrace{E[Y_{0i}|D_{i}=1]-E[Y_{0i}|D_{i}=0]}_{\text{Selection bias}} \end{equation*}
In the example of private versus public education, students that go to private schools would have higher future earnings after college anyway, and the positive bias exacerbates the benefits of private education.
The conditional independence assumption (CIA) asserts that conditional on observed characteristics, X_{i}, the selection bias disappears. Formally,
Y_{1i}, Y_{0i} \perp\!\!\!\perp D_{i}|X_{i}
Given the CIA, conditional-on-X_{i} comparisons have a causal interpretation:
E[Y_{i}|X_{i}, D_{i}=1]-E[Y_{i}|X_{i}, D_{i}=0]=E[Y_{1i}-Y_{0i}|X_{i}]
Back to the private versus public education:
E[Y_{0i}|Private_{i};GROUP_{i},SAT_{i}, lnPI_{i}]=E[Y_{0i}|GROUP_{i},SAT_{i}, lnPI_{i}] and the CIA makes this causal.
Regression is a way to make other things equal. However, you can only generate fair comparisons for variables included on the right-hand side - fail to include what matters still leaves us with selection bias. The "regression version" of selection bias is what we call omitted variable bias (OVB). Let us go back to our example of earnings versus schooling. We already ran both the long regression
\underbrace{Y_{i}}_{Wage_{i}}=\beta_{0}^{l}+\beta_{1}^{l}\underbrace{X_{1i}}_{Educ_{i}}+\beta_{2}\underbrace{X_{2i}}_{Exper_{i}}+\varepsilon_{i}^{l} and the short regression
\underbrace{Y_{i}}_{Wage_{i}}=\beta_{0}^{s}+\beta_{1}^{s}\underbrace{X_{1i}}_{Educ_{i}}+\varepsilon_{i}^{s}
The OVB formula describes the relationship between short and long coefficients as follows:
\beta^{s}=\beta^{l}+\pi_{21} \beta_{2}
where \beta_{2} is the coefficient is the coefficient of X_{2i} in the long regression, and \pi_{21} is the coefficient of X_{1i} in a regression of X_{2i} on X_{1i}.
Short Regression
Hourly Wages | |
educ | 0.083*** (0.008) |
Constant | 0.584*** (0.097) |
Observations | 526 |
Adjusted R2 | 0.184 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Long Regression
Hourly Wages | |
educ | 0.098*** (0.008) |
exper | 0.010*** (0.002) |
Constant | 0.217** (0.109) |
Observations | 526 |
Adjusted R2 | 0.246 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Short Regression
library(wooldridge)# load datadata("wage1", package = "wooldridge") short_reg<-lm(lwage~educ, data=wage1)beta_short<-short_reg$coefficients[2]beta_short
educ
0.08274437
Long Regression
long_reg<-lm(lwage~educ+exper, data=wage1)beta_long<-long_reg$coefficients[2]beta2<-long_reg$coefficients[3]exper_on_educ<-lm(exper~educ, data=wage1)pi12<-exper_on_educ$coefficients[2]beta_long+pi12*beta2
educ
0.08274437
## which is equal to beta_short
In other words, the omitted variable bias (OVB) formula connects regression coefficients in models with different controls. Consider the following long regression of wages on schooling (s_{i}), controlling for ability (A_{i}):
Y_{i} = \alpha+\rho s_{i}+ A^{'}_{i} \gamma + \varepsilon_{i} Since ability is hard to measure, what are the consequences of omitting that variable?
\dfrac{Cov(Y_{i}, s_{i})}{V(s_{i})}=\rho+\gamma^{'}\delta_{As} where \delta_{As} is the vector of coefficients from regressions of the elements of A_{i} on s_{i}.
In English, short equals long plus the effect of omitted times the regression of omitted on included. When omitted and included are uncorrelated, short = long.
We are never sure whether a given set of controls is enough to eliminate selection bias. However, we may ask one important question: how sensitive are regression results to changes in the control variables?
Usually, our confidence in regression estimates of causal effects grows when treatment effects are insensitive to whether a particular variable is included or not in the model as long as a few core controls are always included in the model
Back to Dale and Krueger (2002), you can see that after they take into account the selectivity-group dummies (columns 4 to 6), the effect of private education remains the same even after the inclusion of multiple covariates
We saw that our regression estimates are subject to sampling variation, and we need to account for that uncertainty estimating the standard error SE(\hat{\beta}_{k}). With that estimate, one can calculate test statistics to evaluate statistical significance, confidence intervals, etc.
Standard errors computed using SE(\hat{\beta}_{k})=\frac{\sigma_{e}}{\sigma_{\tilde{X}_{k}}\sqrt{n-p}} are nowadays considered old-fashioned because that formula is derived assuming the variance of residuals is unrelated to regressors - the homoskedasticity assumption
Most of the time, that is a heroic assumption. For instance, you can see that among people with higher levels of education (10 years +), salary varies a lot more compared to individuals with fewer years of education
Note: The data come from the 1976 Current Population Survey in the USA.
Given the problem with the homoskedasticity assumption, one can build the standard errors based on some knowledge about the error variance. Robust standard errors RSE (\hat{\beta}) allow for the possibility that the regression line fits more or less well for different values of X_{i} - a scenario known as heteroskedasticity
If the residual turns out to be homoskedastic, the estimates of the robust standard error should be close to SE(\hat{\beta}). However, if residuals are indeed heteroskedastic, estimates of RSE(\hat{\beta}) provide a much better picture of the sampling variance
library(wooldridge)data("wage1", package = "wooldridge") # load datalibrary(fixest)reg1<-feols(wage~educ, data=wage1, se="standard")reg2<-feols(wage~educ, data=wage1, se="hetero")etable(reg1, reg2)
## reg1 reg2## Dependent Var.: wage wage## ## (Intercept) -0.9049 (0.6850) -0.9049 (0.7255)## educ 0.5414*** (0.0532) 0.5414*** (0.0613)## _______________ __________________ __________________## S.E. type Standard Heteroskedas.-rob.## Observations 526 526## R2 0.16476 0.16476## Adj. R2 0.16316 0.16316