In the middle of 2014, The City of Fortaleza began an ongoing urban renewal project called “Areninhas.” The intervention consists of synthetic football turf, sometimes with a playground and an outdoor gym. Besides, there is a substantial increase in street lighting. Say the Mayor wants to know whether this public policy reduces violence, and the City Hall hires you to evaluate the Areninhas project. Let’s start answering a couple of questions:
What are the outcome and treatment variables?
What are the potential outcomes in this case?
What plausible causal channel runs directly from the treatment to the outcome?
Can you think about possible sources of selection bias in the naive comparison of outcomes by treatment status? Which way would you expect the bias to go and why?
Now, say the City Hall decided to study crime prevention through environmental design. Assume that city blocks were randomized, and 25 got the football fields, while the rest are in the control group.
How does randomization solve the selection bias problem you just mentioned?
What can you say about the external validity of this study? Can you think about scenarios where the internal validity of this study is violated?
We will use a dataset here from a randomized experiment conducted by Marianne Bertrand and Sendhil Mullainathan for this question. The researchers sent 4,870 fictitious resumes out to employers in response to job adverts in Boston and Chicago in 2001. They varied only the names of job applicants while leaving other relevant candidates’ attributes unchanged (i.e., candidates had similar qualifications). Some applicants had distinctly white-sounding names such as Greg Baker and Emily Walsh, whereas other resumes contained stereotypically black-sounding names such as Lakisha Washington or Jamal Jones. Hence, any difference in callback rates can solely be attributed to name manipulation.
Hint: What is the unit of observation? What is the treatment \(D_{i}\) and the observed outcome \(Y_{i}\)? What are the potential outcomes?
Create a dummy variable named female
that takes one if sex=="f"
, and zero otherwise.
The dataset contains information about candidates’ education (education
), years of experience (yearsexp
), military experience (military
), computer and special skills (computerskills
and specialskills
), a dummy for gender (female
), among others. Summarize that information by getting average values by race
groups.
Do education
, yearsexp
, military
, computerskills
, specialskills
and female
look balanced between race groups? Use t.test()
to formally compare resume characteristics and interpret its output. Why do we care about whether those variables are balanced?
The output of interest in the dataset is call
- a dummy that takes one if the candidate was called back. Use t.test()
to compare callbacks between White names and Black names. Is there a racial gap in the callback?
Now, run a regression of call
on race
, education
, yearsexp
, military
, computerskills
, specialskills
, and female
. Does the estimate related to race change too much? What is the explanation for that behavior?
Let’s say you work as a Data Scientist at MeetMeAtTheQuad, a dating app startup company from UIUC alumni. You want to measure the causal effect of the dating app like button size on a couple of important metrics. You decided to run an A/B test to check whether the app developers should increase the like button or not.
Define some key metrics you would like to evaluate in this experiment
What is the experimental unit? What will the treatment and control group see while using the app? Set up the hypothesis testing you have in mind (i.e., what is your \(H_{0}\) and \(H_{a}\))?
One essential part of designing an experiment is knowing the sample size needed to test your hypothesis. Say you are running a t.test()
on the number of likes per user to check differences between the control and treatment groups. Using the pwr
package, find the sample size required for this experiment.
Note: assume a power equal to 0.8, a significance level of .05, two groups, and a minimum effect size of 0.5.
What happens with your answer in c) when you try to detect a minimum effect of 0.1?
Suppose you saw a statistically significant increase in the number of likes in the treatment group, but you did not see any effect on the number of matches. What might be the explanation for this pattern?
Hint: think about the two sides involved. To have a match, two persons need to like each other.
Time for some simulations! Power analysis is a necessary procedure to conduct during the design phase of an experiment. I will establish the first power analysis relationship. Then, you do two more.
Let’s simulate the required sample size for different significance levels (ranging from 0.001 to 0.1)
library(pwr)
### alpha is a vector of numbers from 0.001 to 0.1
<-seq(from=0.001, to=0.1, by=0.003)
alpha## there are 34 numbers from 0.001 to 0.1 using 0.003 as increment alpha
## [1] 0.001 0.004 0.007 0.010 0.013 0.016 0.019 0.022 0.025 0.028 0.031 0.034 0.037 0.040
## [15] 0.043 0.046 0.049 0.052 0.055 0.058 0.061 0.064 0.067 0.070 0.073 0.076 0.079 0.082
## [29] 0.085 0.088 0.091 0.094 0.097 0.100
<-matrix(NA, ncol=1, nrow=34)
samplefor(i in 1:length(alpha)){sample[i,1]<-pwr.t.test(d = 0.5,
sig.level = alpha[i],
power = 0.8)$n
}
<-data.frame(alpha, sample)
dataplot(y=data$sample, x=data$alpha, type="l", ylab='Sample Size', xlab='Significance Level')
Now, it is your turn! Show the following:
Hint: find the sample size n required for a range of power values (e.g., power from 0.60 to 0.95 by=0.01).
Hint: find the sample size n required for a range of minimum effect values(e.g., d from 0.001 to 1.5 by=0.05).