In the middle of 2014, The City of Fortaleza began an ongoing urban renewal project called “Areninhas.” The intervention consists of synthetic football turf, sometimes with a playground and an outdoor gym. Besides, there is a substantial increase in street lighting. Say the Mayor wants to know whether this public policy reduces violence, and the City Hall hires you to evaluate the Areninhas project. Let’s start answering a couple of questions:
What are the outcome and treatment variables?
What are the potential outcomes in this case?
What plausible causal channel runs directly from the treatment to the outcome?
Can you think about possible sources of selection bias in the naive comparison of outcomes by treatment status? Which way would you expect the bias to go and why?
Now, say the City Hall decided to study crime prevention through environmental design. Assume that city blocks were randomized, and 25 got the football fields, while the rest are in the control group.
How does randomization solve the selection bias problem you just mentioned?
What can you say about the external validity of this study? Can you think about scenarios where the internal validity of this study is violated?
The observed outcome \(Y_{i}\) is violent crime rate per city block \(i\). The treatment \(D_{i}\) is the neighborhood intervention/presence of an “areninha” in block \(i\) - \(D_{i}\) is an indicator variable that takes on 1 if the neighborhood is treated 0 otherwise.
Using a binary treatment variable to indicate the presence of the football field in certain city blocks \(i\):
\[\begin{equation*} \text{Potential Outcome}= \begin{cases}Y_{1i} & \text{if } D_{i}=1 \\ Y_{0i} & \text{if }D_{i}=0 \end{cases} \end{equation*}\]
where \(Y_{0i}\) indicates the crime rate had block \(i\) never experienced urban renewal, and \(Y_{1i}\) is the block’s crime rate if treated.
The urban renew might reduce crime rates through (at least) 3 channels:
The Town Hall explicitly targeted communities with low or very low Human Development Index (HDI). Those vulnerable areas experience, on average, more violence than the rest of the city. Hence, a naive comparison would show higher crime rates in treated areas, underestimating the true causal effect of this urban renewal policy.
With a large enough sample (so the LLN kicks in), random assignment of treatment solves selection bias by making \(D_{i}\) independent of potential outcomes. Observed and unobserved characteristics are then balanced, and you are making an apples-to-apples comparison.
Internal Validity
The primary threat to the internal validity of experiments is SUTVA violations. If the neighborhood intervention in certain city blocks generates crime displacement (i.e., crime moving from the blocks with areninhas to the surrounded areas) or diffusion of benefits (crime reduction in the neighboring region), this would be a SUTVA violation.
External Validity
Although Fortaleza is a large city with around 2.7 million inhabitants (its size is comparable to Houston or Chicago), the levels of violence there are incredibly high compared to other big cities, which might be a problem. Also, Brazilians are obsessed with football, so you might need to adapt this intervention in other countries.
We will use a dataset here from a randomized experiment conducted by Marianne Bertrand and Sendhil Mullainathan for this question. The researchers sent 4,870 fictitious resumes out to employers in response to job adverts in Boston and Chicago in 2001. They varied only the names of job applicants while leaving other relevant candidates’ attributes unchanged (i.e., candidates had similar qualifications). Some applicants had distinctly white-sounding names such as Greg Baker and Emily Walsh, whereas other resumes contained stereotypically black-sounding names such as Lakisha Washington or Jamal Jones. Hence, any difference in callback rates can solely be attributed to name manipulation.
Hint: What is the unit of observation? What is the treatment \(D_{i}\) and the observed outcome \(Y_{i}\)? What are the potential outcomes?
Create a dummy variable named female
that takes one if sex=="f"
, and zero otherwise.
The dataset contains information about candidates’ education (education
), years of experience (yearsexp
), military experience (military
), computer and special skills (computerskills
and specialskills
), a dummy for gender (female
), among others. Summarize that information by getting average values by race
groups.
Do education
, yearsexp
, military
, computerskills
, specialskills
and female
look balanced between race groups? Use t.test()
to formally compare resume characteristics and interpret its output. Why do we care about whether those variables are balanced?
The output of interest in the dataset is call
- a dummy that takes one if the candidate was called back. Use t.test()
to compare callbacks between White names and Black names. Is there a racial gap in the callback?
Now, run a regression of call
on race
, education
, yearsexp
, military
, computerskills
, specialskills
, and female
. Does the estimate related to race change too much? What is the explanation for that behavior?
For each resume of a fictitious job applicant \(i\), there is either a black-sounding name \(D_{i}=1\) or a white-sounding name \(D_{i}=0\). The resume also contains other characteristics such as years of experience, education, computer skills, etc.
The observed outcome \(Y_{i}\) is attached to certain treatment status. For example, Latoya is a black-sounding name (hence, \(D_{i}=1\)), and the outcome callback is represented by \(Y_{i}(1)=1\) since she got a callback. The counterfactual scenario is \(Y_{i}(0)\).
We can imagine the following: would Jamal had a callback from a potential employer had he carried a white-sounding name such as Matthew? Unfortunately, we cannot travel in time and change Jamal’s status, and that is the fundamental problem of causal inference: we can only observe one of the two potential outcomes. That’s why all the counterfactual scenarios are equal to ?
.
What we can do is to observe multiple subjects and learn about the average treatment effects.
Resume | Name | Black-sounding name | Y(1) | Y(0) | Education | Computer Skills |
---|---|---|---|---|---|---|
1 | Latoya | 1 | 1 | ? | High School | No |
2 | Matthew | 0 | ? | 0 | College | Yes |
3 | Sarah | 0 | ? | 1 | High School | Yes |
… | … | … | … | … | …. | … |
n | Jamal | 1 | 0 | ? | College | Yes |
setwd("D:/OneDrive - University of Illinois - Urbana/Causal Inference/ECON 474 Spring 2022/HW/HW1")
## Reading the data
<-readRDS("resume.RDS")
resume## female dummy
$female<-ifelse(resume$sex=="f", 1,0) resume
## Using some tidyverse functions
library(tidyverse)
%>%group_by(race)%>%summarize(educ=mean(education),
resumeexper=mean(yearsexp),
military=mean(military),
comp=mean(computerskills),
spec=mean(specialskills),
fem=mean(female))
## # A tibble: 2 x 7
## race educ exper military comp spec fem
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 b 3.62 7.83 0.102 0.832 0.327 0.775
## 2 w 3.62 7.86 0.0924 0.809 0.330 0.764
lapply(resume[,c('education', 'yearsexp', 'military',
'computerskills', 'specialskills',
'female')],
function(x) t.test(x ~ resume$race))
## $education
##
## Welch Two Sample t-test
##
## data: x by resume$race
## t = -0.24048, df = 4855.4, p-value = 0.81
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
## -0.04510429 0.03524803
## sample estimates:
## mean in group b mean in group w
## 3.616016 3.620945
##
##
## $yearsexp
##
## Welch Two Sample t-test
##
## data: x by resume$race
## t = -0.18462, df = 4867.1, p-value = 0.8535
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
## -0.3101545 0.2567664
## sample estimates:
## mean in group b mean in group w
## 7.829569 7.856263
##
##
## $military
##
## Welch Two Sample t-test
##
## data: x by resume$race
## t = 1.1129, df = 4858.8, p-value = 0.2658
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
## -0.007193736 0.026084906
## sample estimates:
## mean in group b mean in group w
## 0.10184805 0.09240246
##
##
## $computerskills
##
## Welch Two Sample t-test
##
## data: x by resume$race
## t = 2.1664, df = 4854.9, p-value = 0.03033
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
## 0.002264635 0.045373969
## sample estimates:
## mean in group b mean in group w
## 0.8324435 0.8086242
##
##
## $specialskills
##
## Welch Two Sample t-test
##
## data: x by resume$race
## t = -0.21349, df = 4868, p-value = 0.831
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
## -0.02927347 0.02352398
## sample estimates:
## mean in group b mean in group w
## 0.3273101 0.3301848
##
##
## $female
##
## Welch Two Sample t-test
##
## data: x by resume$race
## t = 0.88413, df = 4866.7, p-value = 0.3767
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
## -0.01299869 0.03435393
## sample estimates:
## mean in group b mean in group w
## 0.7745380 0.7638604
Although black-sounding names have, on average, more computer skills than white-sounding names, those two groups have all the other covariates balanced, which makes us closer to apples-to-apples comparisons.
t.test(call~race, data=resume)
##
## Welch Two Sample t-test
##
## data: call by race
## t = -4.1147, df = 4711.6, p-value = 3.943e-05
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
## -0.04729503 -0.01677067
## sample estimates:
## mean in group b mean in group w
## 0.06447639 0.09650924
There is a statistically significant and meaningful difference between callback rates: white names receive 50% more callbacks for interviews.
library(fixest)
<-feols(call~race, se="hetero", data=resume)
reg1<-feols(call~race+education+yearsexp+
reg2+computerskills+
military+female,
specialskillsse="hetero",
data=resume)
etable(reg1,reg2)
## reg1 reg2
## Dependent Var.: call call
##
## (Intercept) 0.0645*** (0.0050) 0.0086 (0.0255)
## racew 0.0320*** (0.0078) 0.0314*** (0.0077)
## education 0.0050 (0.0056)
## yearsexp 0.0034*** (0.0009)
## military 0.0069 (0.0125)
## computerskills -0.0195. (0.0113)
## specialskills 0.0666*** (0.0093)
## female 0.0059 (0.0093)
## _______________ __________________ __________________
## S.E. type Heteroskedas.-rob. Heteroskedas.-rob.
## Observations 4,870 4,870
## R2 0.00347 0.02074
## Adj. R2 0.00326 0.01933
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimates did not change much - the simple mean comparison gives a difference of 0.032 in favor of whites, and the regression estimate is 0.0314. Since these controls are uncorrelated with the treatment (by the nature of random assignment), we expected that. The standard errors decreased, which was also expected: although these covariates are uncorrelated with the treatment, they might explain callback rates, reducing the residual variance and lowering the standard errors of the regression estimates.
Let’s say you work as a Data Scientist at MeetMeAtTheQuad, a dating app startup company from UIUC alumni. You want to measure the causal effect of the dating app like button size on a couple of important metrics. You decided to run an A/B test to check whether the app developers should increase the like button or not.
Define some key metrics you would like to evaluate in this experiment
What is the experimental unit? What will the treatment and control group see while using the app? Set up the hypothesis testing you have in mind (i.e., what is your \(H_{0}\) and \(H_{a}\))?
One essential part of designing an experiment is knowing the sample size needed to test your hypothesis. Say you are running a t.test()
on the number of likes per user to check differences between the control and treatment groups. Using the pwr
package, find the sample size required for this experiment.
Note: assume a power equal to 0.8, a significance level of .05, two groups, and a minimum effect size of 0.5.
What happens with your answer in c) when you try to detect a minimum effect of 0.1?
Suppose you saw a statistically significant increase in the number of likes in the treatment group, but you did not see any effect on the number of matches. What might be the explanation for this pattern?
Hint: think about the two sides involved. To have a match, two persons need to like each other.
Likes and matches are important metrics to be evaluated in the experiment.
User is the experimental unit to be randomized. Users in the treatment group would see the big like button, and users in the control group would see the regular size like button. One can set up the hypothesis testing as
\[\begin{array}{rcl} H_{o} & \mu_{1}=\mu_{2} \\ H_{a} & \mu_{1}\neq \mu_{2} \end{array}\]where \(\mu_{1}\) is the average outcome (likes or matches) in the treatment group and \(\mu_{2}\) is the average outcome in the control group.
library(pwr)
pwr.t.test(d = 0.5,
sig.level = 0.05,
power = 0.8)
##
## Two-sample t test power calculation
##
## n = 63.76561
## d = 0.5
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
Each group needs 64 users. Hence, the total sample size needed is 128.
pwr.t.test(d = 0.1,
sig.level = 0.05,
power = 0.8)
##
## Two-sample t test power calculation
##
## n = 1570.733
## d = 0.1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
The required sample size to detect such small effect is much bigger: 3142.
If treated users are liking people in the control group (people who do not see the bigger like button), that might explain the pattern: more likes but same number of matches.
Time for some simulations! Power analysis is a necessary procedure to conduct during the design phase of an experiment. I will establish the first power analysis relationship. Then, you do two more.
Let’s simulate the required sample size for different significance levels (ranging from 0.001 to 0.1)
library(pwr)
### alpha is a vector of numbers from 0.001 to 0.1
<-seq(from=0.001, to=0.1, by=0.003)
alpha## there are 34 numbers from 0.001 to 0.1 using 0.003 as increment alpha
## [1] 0.001 0.004 0.007 0.010 0.013 0.016 0.019 0.022 0.025 0.028 0.031 0.034 0.037 0.040
## [15] 0.043 0.046 0.049 0.052 0.055 0.058 0.061 0.064 0.067 0.070 0.073 0.076 0.079 0.082
## [29] 0.085 0.088 0.091 0.094 0.097 0.100
<-matrix(NA, ncol=1, nrow=34)
samplefor(i in 1:length(alpha)){sample[i,1]<-pwr.t.test(d = 0.5,
sig.level = alpha[i],
power = 0.8)$n
}
<-data.frame(alpha, sample)
dataplot(y=data$sample, x=data$alpha, type="l", ylab='Sample Size', xlab='Significance Level')
Now, it is your turn! Show the following:
Hint: find the sample size n required for a range of power values (e.g., power from 0.60 to 0.95 by=0.01).
Hint: find the sample size n required for a range of minimum effect values(e.g., d from 0.001 to 1.5 by=0.05).
<-seq(from=0.60, to=0.95, by=0.01)
power
<-matrix(NA, ncol=1, nrow=length(power))
samplefor(i in 1:length(power)){sample[i,1]<-pwr.t.test(d = 0.5,
sig.level = 0.05,
power = power[i])$n
}
<-data.frame(power, sample)
dataplot(y=data$sample, x=data$power, type="l", ylab='Sample Size', xlab='Power')
<-seq(from=0.001, to=1.5, by=0.05)
effect
<-matrix(NA, ncol=1, nrow=length(effect))
samplefor(i in 1:length(effect)){sample[i,1]<-pwr.t.test(d = effect[i],
sig.level = 0.05,
power = 0.8)$n
}
options(scipen = 999)
<-data.frame(effect, sample)
dataplot(y=data$sample, x=data$effect, type="l", ylab='Sample Size', xlab='Effect Size')