The goal of this lab is to apply and tune K-nearest neighbor models and explore non-regular grids while using HPC via talapas.
Please complete this project using a GitHub repo. You will complete this lab by submitting a link to your repo. In your repo, please have data
, models
, and plots
folders, as well as a screenshots
folder. I will be asking you to take screenshots of your work at multiple points.
When you run the models on talapas, you can decide how much of the data to use. You are free to even stick with half of 1%. The idea is to get you practice working with talapas, but \(K\)NN models are just really inefficient and I’d like you to be able to work mostly interactively. As I (Daniel) worked through the lab with the full dataset, I found that the preliminary run took about 10-12 minutes, the tuning took a very long time, and the final fit was very fast. Note that if you sample, say, 50% of the data, the time it takes will not directly translate to 50% of the time for the full data, but it will obviously help.
R
and hit return to connect to R, install like normal, saving to a local library, then type q()
to exit R).edld654
data
and models
folderstrain.csv
into the data
folderIf you have trouble with this step, you will not be able to complete subsequent steps in the lab, so please ask for help.
On your local (not on talapas), create a new R script. Do the following
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
foreach::getDoParWorkers()
clusterEvalQ(cl, {library(tidymodels)})
train.csv
from the data
folder using a relative path (not here::here
).dplyr::sample_frac
or dplyr::slice_sample
(basically equivalent) to randomly select a proportion of 0.005 of all rows (i.e., half of one percent). This is so you can work with the models more easily locally, then you’ll comment this part out when you go to talapas.step_novel()
, step_dummy()
, or any other operation you apply to all_nominal()
, unless of course you want that operation to be conducted on your outcome too.source
it in your other scripts, or you can just copy and past it into each model fit script. We’ll be writing separate scripts for each model fit.fit_resamples
to fit the model. Save these resamples to an object..Rds
file in the models folder using a relative path..srun
script
.srun
file oversbatch my-batch-script.srun
, replacing my-batch-script
with the name of your .srun
file.squeue -u duckid
, replacing duckid
with your duckid. Take a screenshot of the status and place it in the screenshots folder.Create a new R script to conduct modeling tuning. You will need all the same pieces up to the optional part above (i.e., everything up to where you define the model). Create a non-regular, space-filling design grid using grid_max_entropy
and 25 parameter values.
neighbors
hyperparameter for c(1, 20)
along with the dist_power
hyperparameter.ggplot
plot to show a graphical representation of your non-regular grid. Save this in your “plots” folder. Note - you only need one of these for your group, not for each member. Comment this code out when you’re done so it doesn’t run when you fit the model with talapas.resamples
, using your specified recipe
and non-regular grid
..srun
script. Run the tuning script with the full data or the sample of rows you have chosen, saving the model tuning object into the models folder. Make sure to save it with a different name than your previous model. Take a screenshot of your models folder on talapas showing the two fitted models and place it in your screenshots folder.Conduct your final fit. Do this first on your local using the small proportion of rows, then replicate it on talapas. Again, please make sure to save the final fit as a new model object and take a screenshot showing your models folder after it has saved (the folder should have three model objects). Use:
select_best
to select your best tuned model.finalize_model
to finalize your model using your best tuned model.finalize_recipe
to finalize your recipe using your best tuned model (in this case, we didn’t tune anything in our recipe, but it’s a good habit to get into).last_fit
to run your last fit with your finalized_model
and finalized_recipe
on your initial data split. When working on talapas, this is the model object you should save and transfer to your local.collect_metrics
from your final results. Similar to your model tuning, you will report these on the model run on talapas. You can report these in the same Rmd as the preliminary fit and tuning, or in a third CSV that you write to your local.Please submit your lab on canvas by pasting a link to your GitHub repo.