4 NonLinear Regression

We saw how to describe relationships fitting a line through the data, but that does not work always. Here, we will focus on an approach to performing regression that does not rely on a linear relationship between covariates and outcome. Specifically, we want to deal with binary dependent variables (i.e., when \(Y\) takes on two values: 0 and 1). These lecture notes are based on Huntington-Klein (2021) chapter 13 (section 13.2.6).

4.1 Generalized Linear Models (GLM)

To extend regression when we need to allow nonlinearity, one can leverage the generalized linear models. Instead of writing the regression model as \(Y=\beta_{0}+\beta_{1}X\), our regression equation can be written as

\[Y=F(\beta_{0}+\beta_{1}X)\]

where \(F()\) is a link function. In this case, if \(F(x)=x\), \(Y=\beta_{0}+\beta_{1}X\), i.e., the linear regression is a specific case of the GLM.

If our \(Y\) is a binary variable, the regression model predicts the probability that \(Y\) is equal to one. So when we use a link function \(F()\) that works with a binary \(Y\), we are looking for

  1. \(F()\) that takes any value from \(-\infty\) to \(\infty\)
  2. \(F()\) that produces values between 0 and 1
  3. Whenever the input (\(\beta_{0}+\beta_{1}X\)) increases, the output should also increase

Many functions satisfy these criteria, but two are frequently used: the logit and the probit link functions.

The probit function looks like

\[Pr(Y=1|X)=\Phi(\beta_{0}+\beta_{1}X)\]

where \(\Phi\) is the cumulative distribution function (CDF) of the standard normal distribution.

The logit function is

\[F(\beta_{0}+\beta_{1}X)=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\] and we can write the logistic model as \(log(\frac{p}{1-p})=\beta_{0}+\beta_{1}X\). These two link functions produce very similar predictions. Note that the interpretation of the GLM is different from the OLS model. The most common way is to present the results of probit/logit models in terms of marginal effects, and the interpretation will be given using the titanic dataset.

For the exercise, we are going to use the titanic.RDS file, and you can check the variables’ description here. Read the file and create a dummy that takes one if the passenger was located in the first class and 0 otherwise.

library(tidyverse)
setwd("C:/Users/User/Desktop/474-Rlab/datasets")
titanic<-readRDS("titanic.RDS")
titanic$first_class<-ifelse(titanic$class==1,1,0)

To run the logit regression, use the glm() function with the input family = binomial(link = 'logit'). The dependent variable is survived, which takes on one if the passenger survived and 0 otherwise. The covariates are age, gender and first_class.

# Use the glm() function with
# the family = binomial(link = 'logit') option
reg <- glm( survived ~ first_class+ age+ sex, data = titanic,
          family = binomial(link = 'logit'))

# See the results
summary(reg)
## 
## Call:
## glm(formula = survived ~ first_class + age + sex, family = binomial(link = "logit"), 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9969  -0.6333  -0.6333   0.6829   1.8468  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.3372     0.2322   5.758 8.53e-09 ***
## first_class   1.2098     0.1407   8.597  < 2e-16 ***
## age          -0.6996     0.2252  -3.106   0.0019 ** 
## sex          -2.1425     0.1227 -17.455  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2769.5  on 2200  degrees of freedom
## Residual deviance: 2255.3  on 2197  degrees of freedom
## AIC: 2263.3
## 
## Number of Fisher Scoring iterations: 4

Although the regression output looks like the one from the lm(), we need to use the function margins() to interpret the coefficients in terms of average marginal probabilities. Run install.packages("margins") first if you do not have the package margins installed.

#install.packages("margins") 
library(margins)
# Get the average marginal effect of year
reg %>%
  margins(variables = c('first_class','age','sex')) %>%
  summary()
##       factor     AME     SE        z      p   lower   upper
##          age -0.1164 0.0372  -3.1253 0.0018 -0.1894 -0.0434
##  first_class  0.2013 0.0223   9.0467 0.0000  0.1577  0.2449
##          sex -0.3565 0.0148 -24.0667 0.0000 -0.3856 -0.3275

Since all the explanatory variables are dummies, it is easy to understand the marginal effects in that regression. For instance, if a passenger is located in the first class, that increases the survival probability by 20.13%. You can apply the same reasoning to interpret the average marginal effects (AME) of the dummies that represent adults and men.

4.2 A Taste of Machine Learning

Logistic regression is frequently used for classification tasks in machine learning, and it is an excellent algorithm for binary classification. The first step is to split the dataset into training and testing samples randomly. You will fit the model using 75% of the data and test your predictions’ accuracy using the other 25%.

library(caret) 
#creating indices
set.seed(123)
trainIndex<- createDataPartition(titanic$survived,p=0.75,list=FALSE)
 
titanic_train<- titanic[trainIndex,] #training data (75% of data)
titanic_test<- titanic[-trainIndex,] #testing data (25% of data)

Run the model with the titanic_train dataset.

model <- glm(survived ~ first_class+age+sex, family=binomial(link='logit'), data=titanic_train)
summary(model)
## 
## Call:
## glm(formula = survived ~ first_class + age + sex, family = binomial(link = "logit"), 
##     data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9788  -0.6566  -0.6566   0.9043   1.8113  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.3884     0.2649   5.242 1.59e-07 ***
## first_class   1.1225     0.1606   6.990 2.74e-12 ***
## age          -0.7054     0.2561  -2.754  0.00588 ** 
## sex          -2.1079     0.1418 -14.863  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2102.7  on 1650  degrees of freedom
## Residual deviance: 1730.8  on 1647  degrees of freedom
## AIC: 1738.8
## 
## Number of Fisher Scoring iterations: 4

Then, get the predicted probability for passenger’s survival according to the explanatory variables age, sex, and first_class. Let’s set the threshold at 0.5 - i.e., if the predicted probability is higher than 50%, we consider that the passenger survived. After that, display the confusion matrix: a 2x2 matrix that shows how the model classified each passenger in the testing sample vs. what happened with those passengers in reality.

probs<-predict(model, titanic_test, type="response")
preds<-ifelse(probs>0.5,1,0)
confusionMatrix(factor(preds), factor(titanic_test$survived))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 356  75
##          1  34  85
##                                          
##                Accuracy : 0.8018         
##                  95% CI : (0.766, 0.8343)
##     No Information Rate : 0.7091         
##     P-Value [Acc > NIR] : 4.471e-07      
##                                          
##                   Kappa : 0.4804         
##                                          
##  Mcnemar's Test P-Value : 0.0001275      
##                                          
##             Sensitivity : 0.9128         
##             Specificity : 0.5312         
##          Pos Pred Value : 0.8260         
##          Neg Pred Value : 0.7143         
##              Prevalence : 0.7091         
##          Detection Rate : 0.6473         
##    Detection Prevalence : 0.7836         
##       Balanced Accuracy : 0.7220         
##                                          
##        'Positive' Class : 0              
## 

Overall, the model is correct around 80% of the time.

References

Huntington-Klein, Nick. 2021. The Effect: An Introduction to Research Design and Causality. Chapman; Hall/CRC.