1 A Causal Question [30 points]

In the middle of 2014, The City of Fortaleza began an ongoing urban renewal project called “Areninhas.” The intervention consists of synthetic football turf, sometimes with a playground and an outdoor gym. Besides, there is a substantial increase in street lighting. Say the Mayor wants to know whether this public policy reduces violence, and the City Hall hires you to evaluate the Areninhas project. Let’s start answering a couple of questions:

  1. What are the outcome and treatment variables?

  2. What are the potential outcomes in this case?

  3. What plausible causal channel runs directly from the treatment to the outcome?

  4. Can you think about possible sources of selection bias in the naive comparison of outcomes by treatment status? Which way would you expect the bias to go and why?

Figure 1.1: City blocks of Fortaleza-CE, Brazil. Note: The map shows the first 25 areninhas in Fortaleza-CE, Brazil. The orange circle has 500 meters radius.

Now, say the City Hall decided to study crime prevention through environmental design. Assume that city blocks were randomized, and 25 got the football fields, while the rest are in the control group.

  1. How does randomization solve the selection bias problem you just mentioned?

  2. What can you say about the external validity of this study? Can you think about scenarios where the internal validity of this study is violated?

xkcd 2576: Control Group

2 Randomized Trials [70 (+ 10) points]

2.1 Racial Discrimination in the Labor Market [40 points]

We will use a dataset here from a randomized experiment conducted by Marianne Bertrand and Sendhil Mullainathan for this question. The researchers sent 4,870 fictitious resumes out to employers in response to job adverts in Boston and Chicago in 2001. They varied only the names of job applicants while leaving other relevant candidates’ attributes unchanged (i.e., candidates had similar qualifications). Some applicants had distinctly white-sounding names such as Greg Baker and Emily Walsh, whereas other resumes contained stereotypically black-sounding names such as Lakisha Washington or Jamal Jones. Hence, any difference in callback rates can solely be attributed to name manipulation.

  1. Illustrate this problem using the Potential Outcomes Framework

Hint: What is the unit of observation? What is the treatment \(D_{i}\) and the observed outcome \(Y_{i}\)? What are the potential outcomes?

  1. Create a dummy variable named female that takes one if sex=="f", and zero otherwise.

  2. The dataset contains information about candidates’ education (education), years of experience (yearsexp), military experience (military), computer and special skills (computerskills and specialskills), a dummy for gender (female), among others. Summarize that information by getting average values by race groups.

  3. Do education, yearsexp, military, computerskills, specialskills and female look balanced between race groups? Use t.test() to formally compare resume characteristics and interpret its output. Why do we care about whether those variables are balanced?

  4. The output of interest in the dataset is call - a dummy that takes one if the candidate was called back. Use t.test() to compare callbacks between White names and Black names. Is there a racial gap in the callback?

  5. Now, run a regression of call on race, education, yearsexp, military, computerskills, specialskills, and female. Does the estimate related to race change too much? What is the explanation for that behavior?

2.2 A/B testing in Practice [30 (+ 10) points]

Let’s say you work as a Data Scientist at MeetMeAtTheQuad, a dating app startup company from UIUC alumni. You want to measure the causal effect of the dating app like button size on a couple of important metrics. You decided to run an A/B test to check whether the app developers should increase the like button or not.

  1. Define some key metrics you would like to evaluate in this experiment

  2. What is the experimental unit? What will the treatment and control group see while using the app? Set up the hypothesis testing you have in mind (i.e., what is your \(H_{0}\) and \(H_{a}\))?

  3. One essential part of designing an experiment is knowing the sample size needed to test your hypothesis. Say you are running a t.test() on the number of likes per user to check differences between the control and treatment groups. Using the pwr package, find the sample size required for this experiment.

Note: assume a power equal to 0.8, a significance level of .05, two groups, and a minimum effect size of 0.5.

  1. What happens with your answer in c) when you try to detect a minimum effect of 0.1?

  2. Suppose you saw a statistically significant increase in the number of likes in the treatment group, but you did not see any effect on the number of matches. What might be the explanation for this pattern?

Hint: think about the two sides involved. To have a match, two persons need to like each other.

2.2.1 Power Analysis Relationships [+10 points]

Time for some simulations! Power analysis is a necessary procedure to conduct during the design phase of an experiment. I will establish the first power analysis relationship. Then, you do two more.

  1. Holding power and effect size constant, with a higher significance level, you need a smaller sample

Let’s simulate the required sample size for different significance levels (ranging from 0.001 to 0.1)

library(pwr)
### alpha is a vector of numbers from 0.001 to 0.1
alpha<-seq(from=0.001, to=0.1, by=0.003) 
alpha ## there are 34 numbers from 0.001 to 0.1 using 0.003 as increment
##  [1] 0.001 0.004 0.007 0.010 0.013 0.016 0.019 0.022 0.025 0.028 0.031 0.034 0.037 0.040
## [15] 0.043 0.046 0.049 0.052 0.055 0.058 0.061 0.064 0.067 0.070 0.073 0.076 0.079 0.082
## [29] 0.085 0.088 0.091 0.094 0.097 0.100
sample<-matrix(NA, ncol=1, nrow=34)
for(i in 1:length(alpha)){sample[i,1]<-pwr.t.test(d = 0.5, 
                                    sig.level = alpha[i], 
                                    power = 0.8)$n
}

data<-data.frame(alpha, sample)
plot(y=data$sample, x=data$alpha, type="l", ylab='Sample Size', xlab='Significance Level')

Now, it is your turn! Show the following:

  1. Holding significance level and effect size constant, more power will require more data

Hint: find the sample size n required for a range of power values (e.g., power from 0.60 to 0.95 by=0.01).

  1. Holding power and significance level constant, the larger the effect size between groups, the smaller the sample you need to find a statistically significant result (i.e., \(p-value<0.5\))

Hint: find the sample size n required for a range of minimum effect values(e.g., d from 0.001 to 1.5 by=0.05).