Data Loading
Q1. Load the data from the source file and set up the target y and predictors X as expected by scikitlearn.
Train-Test Split
Q2. ) Create a train-test 80-20 split of the data while maintaining the same target value proportion in each of the training and testing partition. You should use the training partition for the subsequent analysis and then finally use the testing at the last step for final validation using the best model found. Optional Step: Create some charts for data exploration to gain an understanding of the data with respect to the given prediction problem.
Explain your observations along with each chart. Data Preprocessing & Feature Selection/Engineering Q3. Set up a data preparation pipeline using scikit-learn to perform the following preprocessing steps. A. If there are missing values in the data, take appropriate measures. B. Select the features as follows:
The variable DEP_TIME cannot be used for predicting new flights. Why? Briefly explain.
Create a new categorical variable by binning the scheduled departure time into 2-hour bins.
Drop the original variable CRS_DEP_TIME from the data to be analyzed and keep the new categorical variable.
Drop the variables DISTANCE and FL_DATE from the data set. What would be the reasons to do so? Think and provide possible explanation.
Handle the following categorical variables in the data using the one-hot encoding approach. How many new variables would you get as a result of this one-hot encoding? Explain.
day of week
carrier
departure airport (origin)
arrival airport
scheduled (binned/categorical) departure time
Weather is coded as 1 if there was a weather-delay. Would you need to use one-hot encoding for this variable? Why or why not? Explain and take the appropriate action. Using the pipeline, create a prepared training data set to be used for predictive modeling. MIST.6160: Advanced Data Mining Copyright 2020 Prof. Amit V. Deokar. All rights reserved. 3 Model Training and Validation
Q5.Select any two classification algorithms listed in Q6 below and demonstrate how to find the “best” hyperparameters for each of these two models with grid search using 5-fold cross-validation experimenting with 1-2 parameters in each case.