With this exercise, you can learn more about statistics and apply hypothesis testing, effect sizes, and determine the confidence interval of an machine learning experiment.
Data and Libraries
Your task in this exercise is to apply statistical tests to compare the performance of two classification models. You can find everything you need in sklearn
and scipy.stats
. For this exercise, we use the Iris data. The data is available in sklearn
.
Repeated Training with Repeated Sampling
Train classification models with a 5-nearest neighbor classifier and random forest classifier (100 estimators) for the Iris data using 5 different randomized train/test-splits with 50% data as training data. Calculate Matthews Correlation Coefficient (MCC) for each of these classifiers and create two arrays: one with the MCC values of the nearest neighbor classifier and one for the random forest.
Statistical Comparison
Compare the summary statistics mean, standard deviation, median, min, and max of the estimates for the MCC of both classifiers. Use statistical tests to determine if there is a statistically significant difference between the classifiers. If the difference is significant, use Cohen's $d$ to estimate the effect size. Moreover, calculate the confidence interval for the mean value of MCC to estimate the reliability of your performance estimation.
Repeat all of the above with 10, 50, and 100 train/test splits. How do the results depend on the number of repetitions?