Read in the train.csv
data.
set.seed
).Read in the fallmembershipreport_20192020.xlsx
data.
Attending School ID
, School Name
, and all columns that represent the race/ethnicity percentages for the schools (there is example code in recent class slides).Join the two data sets.
If you have accessed outside data to help increase the performance of your models for the final project (e.g., NCES), you can read in and join those data as well.
Split joined data from above into a training set and test set, stratified by the outcome score
.
Use 10-fold CV to resample the training set, stratified by score
.
Create one recipe
to prepare your data for CART, bagged tree, and random forest models.
This lab could potentially serve as a template for your Premilinary Fit 2, or your final model prediction for the Final Project, so consider applying what might be your best model formula and the necessary preprocessing steps.
Create a parsnip
CART model using{rpart}
for the estimation, tuning the cost complexity and minimum \(n\) for a terminal node.
Create a workflow
object that combines your recipe
and your parsnip
objects.
Tune your model with tune_grid
grid = 10
to choose 10 grid points automaticallymetrics
argument, please include rmse
, rsq
, and huber_loss
{tictoc}
, or you could do something like:start_rf <- Sys.time()
#code to fit model
end_rf <- Sys.time()
end_rf - start_rf
parsnip
bagged tree model using{baguette}
cost_complexity
and min_n
Create a workflow
object that combines your recipe
and your bagged tree model specification.
Tune your model with tune_grid
grid = 10
to choose 10 grid points automaticallymetrics
argument, please include rmse
, rsq
, and huber_loss
control
argument, please include extract = function(x) extract_model(x)
to extract the model from each fit{baguette}
is optimized to run in parallel with the {future}
package. Consider using {future}
to speed up processing time (see the class slides)Show the single best estimates for each of the three performance metrics and the tuning parameter values associated with each.
Run the bag_roots
function below. Apply this function to the extracted bagged tree models from the previous step. This will output the feature at the root node for each of the decision trees fit.
bag_roots <- function(x){
x %>%
select(.extracts) %>%
unnest(cols = c(.extracts)) %>%
mutate(models = map(.extracts,
~.x$model_df)) %>%
select(-.extracts) %>%
unnest(cols = c(models)) %>%
mutate(root = map_chr(model,
~as.character(.x$fit$frame[1, 1]))) %>%
select(root)
}
parsnip
random forest model using {ranger}
importance = "permutation"
argument to run variable importanceCreate a workflow
object that combines your recipe
and your random forest model specification.
Fit your model
metrics
argument, please include rmse
, rsq
, and huber_loss
control
argument, please include extract = function(x) x
to extract the workflow from each fitShow the single best estimates for each of the three performance metrics.
Run the two functions in the code chunk below. Then apply the rf_roots
function to the results of your random forest model to output the feature at the root node for each of the decision trees fit in your random forest model.
rf_tree_roots <- function(x){
map_chr(1:1000,
~ranger::treeInfo(x, tree = .)[1, "splitvarName"])
}
rf_roots <- function(x){
x %>%
select(.extracts) %>%
unnest(cols = c(.extracts)) %>%
mutate(fit = map(.extracts,
~.x$fit$fit$fit),
oob_rmse = map_dbl(fit,
~sqrt(.x$prediction.error)),
roots = map(fit,
~rf_tree_roots(.))
) %>%
select(roots) %>%
unnest(cols = c(roots))
}
Produce a plot of the frequency of features at the root node of the trees in your bagged model.
Please explain why the bagged tree root node figure and the random forest root node figure are different.
Apply the fit
function to your random forest workflow
object and your full training data. In class we talked about the idea that bagged tree and random forest models use resampling, and one could use the OOB prediction error provided by the models to estimate model performance.
sqrt()
of this value to get the rmse. Or you can extract it by running: sqrt(fit-object-name-here$fit$fit$fit$prediction.error)
.Consider the four models you fit: (a) decision tree, (b) bagged tree, (c) random forest fit on resamples, and (d) random forest fit on the training data. Which model would you use for your final fit? Please consider the performance metrics as well as the run time, and briefly explain your decision.