Split the data into a training set and a testing set as two named objects. Produce the class type for the initial split object and the training and test sets.
2. Use code to show the proportion of the train.csv data that went to each of the training and test sets.
3. k-fold cross-validation
Use 10-fold cross-validation to resample the training data.
4. Use {purrr} to add the following columns to your k-fold CV object:
analysis_n = the n of the analysis set for each fold
assessment_n = the n of the assessment set for each fold
analysis_p = the proportion of the analysis set for each fold
assessment_p = the proportion of the assessment set for each fold
sped_p = the proportion of students receiving special education services (sp_ed_fg) in the analysis and assessment sets for each fold
5. Please demonstrate that that there are no common values in the id columns of the assessment data between Fold01 & Fold02, and Fold09 & Fold10 (of your 10-fold cross-validation object).
6. Try to answer these next questions without running similar code on real data.
For the following code vfold_cv(fictional_train, v = 20):
What is the proportion in the analysis set for each fold?
What is the proportion in the assessment set for each fold?
7. Use Monte Carlo CV to resample the training data with 20 resamples and .30 of each resample reserved for the assessment sets.
8. Please demonstrate that that there are common values in the id columns of the assessment data between Resample 8 & Resample 12, and Resample 2 & Resample 20in your MC CV object.
9. You plan on doing bootstrap resampling with a training set with n = 500.
What is the sample size of an analysis set for a given bootstrap resample?
What is the sample size of an assessment set for a given bootstrap resample?
If each row was selected only once for an analysis set: