A Causal Question [30 points]

In the middle of 2014, The City of Fortaleza began an ongoing urban renewal project called “Areninhas.” The intervention consists of synthetic football turf, sometimes with a playground and an outdoor gym. Besides, there is a substantial increase in street lighting. Say the Mayor wants to know whether this public policy reduces violence, and the City Hall hires you to evaluate the Areninhas project. Let’s start answering a couple of questions:

  1. What are the outcome and treatment variables?

  2. What are the potential outcomes in this case?

  3. What plausible causal channel runs directly from the treatment to the outcome?

  4. Can you think about possible sources of selection bias in the naive comparison of outcomes by treatment status? Which way would you expect the bias to go and why?

City blocks of Fortaleza-CE, Brazil. Note: The map shows the first 25 areninhas in Fortaleza-CE, Brazil. The orange circle has 500 meters radius.

Now, say the City Hall decided to study crime prevention through environmental design. Assume that city blocks were randomized, and 25 got the football fields, while the rest are in the control group.

  1. How does randomization solve the selection bias problem you just mentioned?

  2. What can you say about the external validity of this study? Can you think about scenarios where the internal validity of this study is violated?

xkcd 2576: Control Group

a)

The observed outcome \(Y_{i}\) is violent crime rate per city block \(i\). The treatment \(D_{i}\) is the neighborhood intervention/presence of an “areninha” in block \(i\) - \(D_{i}\) is an indicator variable that takes on 1 if the neighborhood is treated 0 otherwise.

b)

Using a binary treatment variable to indicate the presence of the football field in certain city blocks \(i\):

\[\begin{equation*} \text{Potential Outcome}= \begin{cases}Y_{1i} & \text{if } D_{i}=1 \\ Y_{0i} & \text{if }D_{i}=0 \end{cases} \end{equation*}\]

where \(Y_{0i}\) indicates the crime rate had block \(i\) never experienced urban renewal, and \(Y_{1i}\) is the block’s crime rate if treated.

c)

The urban renew might reduce crime rates through (at least) 3 channels:

  1. Increase in street lighting (check this article)
  2. More eyes on streets (check this article)
  3. Stronger bonds between neighbors (check this video from Robert Sampson)

d)

The Town Hall explicitly targeted communities with low or very low Human Development Index (HDI). Those vulnerable areas experience, on average, more violence than the rest of the city. Hence, a naive comparison would show higher crime rates in treated areas, underestimating the true causal effect of this urban renewal policy.

e)

With a large enough sample (so the LLN kicks in), random assignment of treatment solves selection bias by making \(D_{i}\) independent of potential outcomes. Observed and unobserved characteristics are then balanced, and you are making an apples-to-apples comparison.

f)

Internal Validity

The primary threat to the internal validity of experiments is SUTVA violations. If the neighborhood intervention in certain city blocks generates crime displacement (i.e., crime moving from the blocks with areninhas to the surrounded areas) or diffusion of benefits (crime reduction in the neighboring region), this would be a SUTVA violation.

External Validity

Although Fortaleza is a large city with around 2.7 million inhabitants (its size is comparable to Houston or Chicago), the levels of violence there are incredibly high compared to other big cities, which might be a problem. Also, Brazilians are obsessed with football, so you might need to adapt this intervention in other countries.

Randomized Trials [70 (+ 10) points]

Racial Discrimination in the Labor Market [40 points]

We will use a dataset here from a randomized experiment conducted by Marianne Bertrand and Sendhil Mullainathan for this question. The researchers sent 4,870 fictitious resumes out to employers in response to job adverts in Boston and Chicago in 2001. They varied only the names of job applicants while leaving other relevant candidates’ attributes unchanged (i.e., candidates had similar qualifications). Some applicants had distinctly white-sounding names such as Greg Baker and Emily Walsh, whereas other resumes contained stereotypically black-sounding names such as Lakisha Washington or Jamal Jones. Hence, any difference in callback rates can solely be attributed to name manipulation.

  1. Illustrate this problem using the Potential Outcomes Framework

Hint: What is the unit of observation? What is the treatment \(D_{i}\) and the observed outcome \(Y_{i}\)? What are the potential outcomes?

  1. Create a dummy variable named female that takes one if sex=="f", and zero otherwise.

  2. The dataset contains information about candidates’ education (education), years of experience (yearsexp), military experience (military), computer and special skills (computerskills and specialskills), a dummy for gender (female), among others. Summarize that information by getting average values by race groups.

  3. Do education, yearsexp, military, computerskills, specialskills and female look balanced between race groups? Use t.test() to formally compare resume characteristics and interpret its output. Why do we care about whether those variables are balanced?

  4. The output of interest in the dataset is call - a dummy that takes one if the candidate was called back. Use t.test() to compare callbacks between White names and Black names. Is there a racial gap in the callback?

  5. Now, run a regression of call on race, education, yearsexp, military, computerskills, specialskills, and female. Does the estimate related to race change too much? What is the explanation for that behavior?

a)

For each resume of a fictitious job applicant \(i\), there is either a black-sounding name \(D_{i}=1\) or a white-sounding name \(D_{i}=0\). The resume also contains other characteristics such as years of experience, education, computer skills, etc.

The observed outcome \(Y_{i}\) is attached to certain treatment status. For example, Latoya is a black-sounding name (hence, \(D_{i}=1\)), and the outcome callback is represented by \(Y_{i}(1)=1\) since she got a callback. The counterfactual scenario is \(Y_{i}(0)\).

We can imagine the following: would Jamal had a callback from a potential employer had he carried a white-sounding name such as Matthew? Unfortunately, we cannot travel in time and change Jamal’s status, and that is the fundamental problem of causal inference: we can only observe one of the two potential outcomes. That’s why all the counterfactual scenarios are equal to ?.

What we can do is to observe multiple subjects and learn about the average treatment effects.

Callback
Resume Name Black-sounding name Y(1) Y(0) Education Computer Skills
1 Latoya 1 1 ? High School No
2 Matthew 0 ? 0 College Yes
3 Sarah 0 ? 1 High School Yes
….
n Jamal 1 0 ? College Yes

b)

setwd("D:/OneDrive - University of Illinois - Urbana/Causal Inference/ECON 474 Spring 2022/HW/HW1")
## Reading the data
resume<-readRDS("resume.RDS")
## female dummy
resume$female<-ifelse(resume$sex=="f", 1,0)

c)

## Using some tidyverse functions
library(tidyverse)

resume%>%group_by(race)%>%summarize(educ=mean(education), 
                                    exper=mean(yearsexp), 
                                    military=mean(military), 
                                    comp=mean(computerskills), 
                                    spec=mean(specialskills), 
                                    fem=mean(female))
## # A tibble: 2 x 7
##   race   educ exper military  comp  spec   fem
##   <chr> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
## 1 b      3.62  7.83   0.102  0.832 0.327 0.775
## 2 w      3.62  7.86   0.0924 0.809 0.330 0.764

d)

lapply(resume[,c('education', 'yearsexp', 'military', 
                 'computerskills', 'specialskills', 
                 'female')], 
       function(x) t.test(x ~ resume$race))
## $education
## 
##  Welch Two Sample t-test
## 
## data:  x by resume$race
## t = -0.24048, df = 4855.4, p-value = 0.81
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
##  -0.04510429  0.03524803
## sample estimates:
## mean in group b mean in group w 
##        3.616016        3.620945 
## 
## 
## $yearsexp
## 
##  Welch Two Sample t-test
## 
## data:  x by resume$race
## t = -0.18462, df = 4867.1, p-value = 0.8535
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
##  -0.3101545  0.2567664
## sample estimates:
## mean in group b mean in group w 
##        7.829569        7.856263 
## 
## 
## $military
## 
##  Welch Two Sample t-test
## 
## data:  x by resume$race
## t = 1.1129, df = 4858.8, p-value = 0.2658
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
##  -0.007193736  0.026084906
## sample estimates:
## mean in group b mean in group w 
##      0.10184805      0.09240246 
## 
## 
## $computerskills
## 
##  Welch Two Sample t-test
## 
## data:  x by resume$race
## t = 2.1664, df = 4854.9, p-value = 0.03033
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
##  0.002264635 0.045373969
## sample estimates:
## mean in group b mean in group w 
##       0.8324435       0.8086242 
## 
## 
## $specialskills
## 
##  Welch Two Sample t-test
## 
## data:  x by resume$race
## t = -0.21349, df = 4868, p-value = 0.831
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
##  -0.02927347  0.02352398
## sample estimates:
## mean in group b mean in group w 
##       0.3273101       0.3301848 
## 
## 
## $female
## 
##  Welch Two Sample t-test
## 
## data:  x by resume$race
## t = 0.88413, df = 4866.7, p-value = 0.3767
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
##  -0.01299869  0.03435393
## sample estimates:
## mean in group b mean in group w 
##       0.7745380       0.7638604

Although black-sounding names have, on average, more computer skills than white-sounding names, those two groups have all the other covariates balanced, which makes us closer to apples-to-apples comparisons.

e)

t.test(call~race, data=resume)
## 
##  Welch Two Sample t-test
## 
## data:  call by race
## t = -4.1147, df = 4711.6, p-value = 3.943e-05
## alternative hypothesis: true difference in means between group b and group w is not equal to 0
## 95 percent confidence interval:
##  -0.04729503 -0.01677067
## sample estimates:
## mean in group b mean in group w 
##      0.06447639      0.09650924

There is a statistically significant and meaningful difference between callback rates: white names receive 50% more callbacks for interviews.

f)

library(fixest)
reg1<-feols(call~race, se="hetero", data=resume)
reg2<-feols(call~race+education+yearsexp+
              military+computerskills+
              specialskills+female, 
              se="hetero", 
              data=resume)
etable(reg1,reg2)
##                               reg1               reg2
## Dependent Var.:               call               call
##                                                      
## (Intercept)     0.0645*** (0.0050)    0.0086 (0.0255)
## racew           0.0320*** (0.0078) 0.0314*** (0.0077)
## education                             0.0050 (0.0056)
## yearsexp                           0.0034*** (0.0009)
## military                              0.0069 (0.0125)
## computerskills                      -0.0195. (0.0113)
## specialskills                      0.0666*** (0.0093)
## female                                0.0059 (0.0093)
## _______________ __________________ __________________
## S.E. type       Heteroskedas.-rob. Heteroskedas.-rob.
## Observations                 4,870              4,870
## R2                         0.00347            0.02074
## Adj. R2                    0.00326            0.01933
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimates did not change much - the simple mean comparison gives a difference of 0.032 in favor of whites, and the regression estimate is 0.0314. Since these controls are uncorrelated with the treatment (by the nature of random assignment), we expected that. The standard errors decreased, which was also expected: although these covariates are uncorrelated with the treatment, they might explain callback rates, reducing the residual variance and lowering the standard errors of the regression estimates.

A/B testing in Practice [30 points]

Let’s say you work as a Data Scientist at MeetMeAtTheQuad, a dating app startup company from UIUC alumni. You want to measure the causal effect of the dating app like button size on a couple of important metrics. You decided to run an A/B test to check whether the app developers should increase the like button or not.

  1. Define some key metrics you would like to evaluate in this experiment

  2. What is the experimental unit? What will the treatment and control group see while using the app? Set up the hypothesis testing you have in mind (i.e., what is your \(H_{0}\) and \(H_{a}\))?

  3. One essential part of designing an experiment is knowing the sample size needed to test your hypothesis. Say you are running a t.test() on the number of likes per user to check differences between the control and treatment groups. Using the pwr package, find the sample size required for this experiment.

Note: assume a power equal to 0.8, a significance level of .05, two groups, and a minimum effect size of 0.5.

  1. What happens with your answer in c) when you try to detect a minimum effect of 0.1?

  2. Suppose you saw a statistically significant increase in the number of likes in the treatment group, but you did not see any effect on the number of matches. What might be the explanation for this pattern?

Hint: think about the two sides involved. To have a match, two persons need to like each other.

a)

Likes and matches are important metrics to be evaluated in the experiment.

b)

User is the experimental unit to be randomized. Users in the treatment group would see the big like button, and users in the control group would see the regular size like button. One can set up the hypothesis testing as

\[\begin{array}{rcl} H_{o} & \mu_{1}=\mu_{2} \\ H_{a} & \mu_{1}\neq \mu_{2} \end{array}\]

where \(\mu_{1}\) is the average outcome (likes or matches) in the treatment group and \(\mu_{2}\) is the average outcome in the control group.

c)

library(pwr)

pwr.t.test(d = 0.5, 
           sig.level = 0.05, 
           power = 0.8)
## 
##      Two-sample t test power calculation 
## 
##               n = 63.76561
##               d = 0.5
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Each group needs 64 users. Hence, the total sample size needed is 128.

d)

pwr.t.test(d = 0.1, 
           sig.level = 0.05, 
           power = 0.8)
## 
##      Two-sample t test power calculation 
## 
##               n = 1570.733
##               d = 0.1
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

The required sample size to detect such small effect is much bigger: 3142.

e)

If treated users are liking people in the control group (people who do not see the bigger like button), that might explain the pattern: more likes but same number of matches.

Power Analysis Relationships [+10 points]

Time for some simulations! Power analysis is a necessary procedure to conduct during the design phase of an experiment. I will establish the first power analysis relationship. Then, you do two more.

  1. Holding power and effect size constant, with a higher significance level, you need a smaller sample

Let’s simulate the required sample size for different significance levels (ranging from 0.001 to 0.1)

library(pwr)
### alpha is a vector of numbers from 0.001 to 0.1
alpha<-seq(from=0.001, to=0.1, by=0.003) 
alpha ## there are 34 numbers from 0.001 to 0.1 using 0.003 as increment
##  [1] 0.001 0.004 0.007 0.010 0.013 0.016 0.019 0.022 0.025 0.028 0.031 0.034 0.037 0.040
## [15] 0.043 0.046 0.049 0.052 0.055 0.058 0.061 0.064 0.067 0.070 0.073 0.076 0.079 0.082
## [29] 0.085 0.088 0.091 0.094 0.097 0.100
sample<-matrix(NA, ncol=1, nrow=34)
for(i in 1:length(alpha)){sample[i,1]<-pwr.t.test(d = 0.5, 
                                    sig.level = alpha[i], 
                                    power = 0.8)$n
}

data<-data.frame(alpha, sample)
plot(y=data$sample, x=data$alpha, type="l", ylab='Sample Size', xlab='Significance Level')

Now, it is your turn! Show the following:

  1. Holding significance level and effect size constant, more power will require more data

Hint: find the sample size n required for a range of power values (e.g., power from 0.60 to 0.95 by=0.01).

  1. Holding power and significance level constant, the larger the effect size between groups, the smaller the sample you need to find a statistically significant result (i.e., \(p-value<0.5\))

Hint: find the sample size n required for a range of minimum effect values(e.g., d from 0.001 to 1.5 by=0.05).

2.

power<-seq(from=0.60, to=0.95, by=0.01)

sample<-matrix(NA, ncol=1, nrow=length(power))
for(i in 1:length(power)){sample[i,1]<-pwr.t.test(d = 0.5, 
                                                  sig.level = 0.05, 
                                                  power = power[i])$n
}

data<-data.frame(power, sample)
plot(y=data$sample, x=data$power, type="l", ylab='Sample Size', xlab='Power')

3.

effect<-seq(from=0.001, to=1.5, by=0.05)

sample<-matrix(NA, ncol=1, nrow=length(effect))
for(i in 1:length(effect)){sample[i,1]<-pwr.t.test(d = effect[i], 
                                                  sig.level = 0.05, 
                                                  power = 0.8)$n
}

options(scipen = 999)
data<-data.frame(effect, sample)
plot(y=data$sample, x=data$effect, type="l", ylab='Sample Size', xlab='Effect Size')