Which model would you prefer if the wine-tasting experts would like to gain some insights into the model?

Wine-Tasting Machine

In this assignment, we will practice building supervised machine learning with Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM), and Decision tree (DT), Random Forest (RF) classifiers, as compared with simple/baseline methods such as OneR and ZeroR. The data for this exercise comes from the wine industry.

Each record represents a sample of a specific wine product, the input attributes include its organoleptic characteristics, and the output denotes the quality class of each wine: {high, low}. The labels have been assigned by human wine-tasting experts, and we can treat that information as “ground truth” in this exercise.

Your job is to build the best model to predict wine quality from its characteristics, so that the winery could replace the costly services of professional sommeliers with your automated alternative, to enable quick and effective quality tracking of their wines at production facilities.

They need to know whether such change is feasible, and what extent of inaccuracies may be involved in using your tool.

You will be asked to run experiments in both WEKA and Python.

You are given two datasets red-wine.csv and white-wine.csv: Dataset folder

Deliverable:

A word doc with answers (including screenshots) to blackboard

Python Notebook to be uploaded to GitHub and shared the link in the above Google doc

WEKA Tasks (50 points)

Load red-wine.csv into WEKA (15 points)

Create a conditional distribution for each of the input variables with respect to output (click the “visualizing all” button, making sure you set the output correctly)

Comparing the plots for Sulphates and Alcohol, which one do you think is more predictive of the wine quality, and why?

Verify your answer by using a logistic regression model, is it consistent with your speculation in (b)? (hint: here you may use univariate logistic regression, the better performance, the more predictive a feature is. AUC is a good score for this purpose)

Fit a model using each of the following methods and report the performance metrics of 10-fold cross-validation using red-wine.csv as the training set (25 points)

Model ZeroR OneR LR NB DT SVM RF

 

AUC N/A N/A          
Accuracy              

 

Obtain the ROC curve for the best-performing model in terms of AUC score from the experiment above, paste a screenshot here and comment on its performance (5 points)

Using the best model obtained above in WEKA and run the model on white-wine.csv and report the AUC score, comment on the performance.  (hint: see WEKA reference section on how to get performance on an external independent test set) ( 5 points)

Python Tasks (50 points)

Submission:  Upload Python Notebook to GitHub links as homework 1

Read  red-wine.csv into Python as a data frame, use a pandas profiling tool (https://github.com/pandas-profiling/pandas-profiling) to create an HTML file, and paste a screenshot of the HTML file here (10 points)

Repeat the same experiments in WEKA Question 2,  and report the same metrics as in Question 2. To receive full credit, you will need to write a script to assemble the result as above in the form of Pandas data frame. Paste a screenshot of your result from your Python notebook here. Please make sure that there is a reasonable number of significant digits in reporting your output. (20 points)

Plot the ROC curve of the Random Forest classifier from the Python package, and paste a screenshot of your ROC curve here (10 points)

Using the best model obtained above in Q2 (python)  and running the model on white-wine.csv and reporting the AUC score, comment on the performance. (5 points)

Suppose all the models have comparable performance, which model would you prefer if the wine-tasting experts would like to gain some insights into the model? Note: there could be multiple model types fitting this criterion. (5 points)