4 NonLinear Regression
We saw how to describe relationships fitting a line through the data, but that does not work always. Here, we will focus on an approach to performing regression that does not rely on a linear relationship between covariates and outcome. Specifically, we want to deal with binary dependent variables (i.e., when \(Y\) takes on two values: 0 and 1). These lecture notes are based on Huntington-Klein (2021) chapter 13 (section 13.2.6).
4.1 Generalized Linear Models (GLM)
To extend regression when we need to allow nonlinearity, one can leverage the generalized linear models. Instead of writing the regression model as \(Y=\beta_{0}+\beta_{1}X\), our regression equation can be written as
\[Y=F(\beta_{0}+\beta_{1}X)\]
where \(F()\) is a link function. In this case, if \(F(x)=x\), \(Y=\beta_{0}+\beta_{1}X\), i.e., the linear regression is a specific case of the GLM.
If our \(Y\) is a binary variable, the regression model predicts the probability that \(Y\) is equal to one. So when we use a link function \(F()\) that works with a binary \(Y\), we are looking for
- \(F()\) that takes any value from \(-\infty\) to \(\infty\)
- \(F()\) that produces values between 0 and 1
- Whenever the input (\(\beta_{0}+\beta_{1}X\)) increases, the output should also increase
Many functions satisfy these criteria, but two are frequently used: the logit
and the probit
link functions.
The probit
function looks like
\[Pr(Y=1|X)=\Phi(\beta_{0}+\beta_{1}X)\]
where \(\Phi\) is the cumulative distribution function (CDF) of the standard normal distribution.
The logit
function is
\[F(\beta_{0}+\beta_{1}X)=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\]
and we can write the logistic model as \(log(\frac{p}{1-p})=\beta_{0}+\beta_{1}X\). These two link functions produce very similar predictions. Note that the interpretation of the GLM is different from the OLS model. The most common way is to present the results of probit/logit models in terms of marginal effects, and the interpretation will be given using the titanic
dataset.
For the exercise, we are going to use the titanic.RDS file, and you can check the variables’ description here. Read the file and create a dummy that takes one if the passenger was located in the first class and 0 otherwise.
library(tidyverse)
setwd("C:/Users/User/Desktop/474-Rlab/datasets")
<-readRDS("titanic.RDS")
titanic$first_class<-ifelse(titanic$class==1,1,0) titanic
To run the logit regression, use the glm()
function with the input family = binomial(link = 'logit')
. The dependent variable is survived
, which takes on one if the passenger survived and 0 otherwise. The covariates are age
, gender
and first_class
.
# Use the glm() function with
# the family = binomial(link = 'logit') option
<- glm( survived ~ first_class+ age+ sex, data = titanic,
reg family = binomial(link = 'logit'))
# See the results
summary(reg)
##
## Call:
## glm(formula = survived ~ first_class + age + sex, family = binomial(link = "logit"),
## data = titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9969 -0.6333 -0.6333 0.6829 1.8468
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3372 0.2322 5.758 8.53e-09 ***
## first_class 1.2098 0.1407 8.597 < 2e-16 ***
## age -0.6996 0.2252 -3.106 0.0019 **
## sex -2.1425 0.1227 -17.455 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2769.5 on 2200 degrees of freedom
## Residual deviance: 2255.3 on 2197 degrees of freedom
## AIC: 2263.3
##
## Number of Fisher Scoring iterations: 4
Although the regression output looks like the one from the lm()
, we need to use the function margins()
to interpret the coefficients in terms of average marginal probabilities. Run install.packages("margins")
first if you do not have the package margins
installed.
#install.packages("margins")
library(margins)
# Get the average marginal effect of year
%>%
reg margins(variables = c('first_class','age','sex')) %>%
summary()
## factor AME SE z p lower upper
## age -0.1164 0.0372 -3.1253 0.0018 -0.1894 -0.0434
## first_class 0.2013 0.0223 9.0467 0.0000 0.1577 0.2449
## sex -0.3565 0.0148 -24.0667 0.0000 -0.3856 -0.3275
Since all the explanatory variables are dummies, it is easy to understand the marginal effects in that regression. For instance, if a passenger is located in the first class, that increases the survival probability by 20.13%. You can apply the same reasoning to interpret the average marginal effects (AME) of the dummies that represent adults and men.
4.2 A Taste of Machine Learning
Logistic regression is frequently used for classification tasks in machine learning, and it is an excellent algorithm for binary classification. The first step is to split the dataset into training and testing samples randomly. You will fit the model using 75% of the data and test your predictions’ accuracy using the other 25%.
library(caret)
#creating indices
set.seed(123)
<- createDataPartition(titanic$survived,p=0.75,list=FALSE)
trainIndex
<- titanic[trainIndex,] #training data (75% of data)
titanic_train<- titanic[-trainIndex,] #testing data (25% of data) titanic_test
Run the model with the titanic_train
dataset.
<- glm(survived ~ first_class+age+sex, family=binomial(link='logit'), data=titanic_train)
model summary(model)
##
## Call:
## glm(formula = survived ~ first_class + age + sex, family = binomial(link = "logit"),
## data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9788 -0.6566 -0.6566 0.9043 1.8113
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3884 0.2649 5.242 1.59e-07 ***
## first_class 1.1225 0.1606 6.990 2.74e-12 ***
## age -0.7054 0.2561 -2.754 0.00588 **
## sex -2.1079 0.1418 -14.863 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2102.7 on 1650 degrees of freedom
## Residual deviance: 1730.8 on 1647 degrees of freedom
## AIC: 1738.8
##
## Number of Fisher Scoring iterations: 4
Then, get the predicted probability for passenger’s survival according to the explanatory variables age
, sex
, and first_class
. Let’s set the threshold at 0.5 - i.e., if the predicted probability is higher than 50%, we consider that the passenger survived. After that, display the confusion matrix: a 2x2 matrix that shows how the model classified each passenger in the testing sample vs. what happened with those passengers in reality.
<-predict(model, titanic_test, type="response")
probs<-ifelse(probs>0.5,1,0)
predsconfusionMatrix(factor(preds), factor(titanic_test$survived))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 356 75
## 1 34 85
##
## Accuracy : 0.8018
## 95% CI : (0.766, 0.8343)
## No Information Rate : 0.7091
## P-Value [Acc > NIR] : 4.471e-07
##
## Kappa : 0.4804
##
## Mcnemar's Test P-Value : 0.0001275
##
## Sensitivity : 0.9128
## Specificity : 0.5312
## Pos Pred Value : 0.8260
## Neg Pred Value : 0.7143
## Prevalence : 0.7091
## Detection Rate : 0.6473
## Detection Prevalence : 0.7836
## Balanced Accuracy : 0.7220
##
## 'Positive' Class : 0
##
Overall, the model is correct around 80% of the time.