This lab will return to the titanic dataset to predict survival of a passenger. This dataset is obtained from the earth
package in R.
The data frame has 1046 observations on 6 variables.
set.seed(11142024)
library(tidyverse)
library(earth)
data("etitanic")
titanic <- etitanic |>
mutate(survived_factor = factor(survived))
glimpse(titanic)
What factors in the dataset do you think will influence whether a passenger survives? How do you expect that factors to change survival outcomes?
Regardless of your response to the first question, create figures to explore survival as a function of age
, sex
, and pclass
. As always, include informative titles, axes, and legends.
Construct a training set and a test set. If you plan to do model tuning, also create a validation set.
Use a logistic regression model to predict passenger survival. Summarize the model outcome using classification error (% of incorrect predictions on the test set).
#log_reg <- glm(survived_factor ~ , family = binomial(link = 'logit'), data = train_titanic)
Use a tree-based model to predict passenger survival. Summarize the model outcome using classification error (% of incorrect predictions on the test set).
Do your model outcomes match your intuition and data visualizations? Why or why not?