With this exercise, you can learn more about statistics and apply hypothesis testing, effect sizes, and determine the confidence interval of an machine learning experiment.
Data and Libraries
Your task in this exercise is to apply statistical tests to compare the performance of two classification models. You can find everything you need in sklearn
and scipy.stats
. For this exercise, we use the Iris data. The data is available in sklearn
.
Repeated Training with Repeated Sampling
Train classification models with a 5-nearest neighbor classifier and random forest classifier (100 estimators) for the Iris data using 5 different randomized train/test-splits with 50% data as training data. Calculate Matthews Correlation Coefficient (MCC) for each of these classifiers and create two arrays: one with the MCC values of the nearest neighbor classifier and one for the random forest.
Statistical Comparison
Compare the summary statistics mean, standard deviation, median, min, and max of the estimates for the MCC of both classifiers. Use statistical tests to determine if there is a statistically significant difference between the classifiers. If the difference is significant, use Cohen's $d$ to estimate the effect size. Moreover, calculate the confidence interval for the mean value of MCC to estimate the reliability of your performance estimation.
Repeat all of the above with 10, 50, and 100 train/test splits. How do the results depend on the number of repetitions?
import numpy as np
from sklearn.datasets import load_iris
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.metrics import matthews_corrcoef
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from scipy import stats
from statistics import mean, stdev
from math import sqrt
from scipy.stats import norm
from textwrap import TextWrapper
wrapper = TextWrapper(width=65)
def wrap_print(string):
print('\n'.join(wrapper.wrap(string)))
X, Y = load_iris(return_X_y=True)
knn = KNeighborsClassifier(5)
rf = RandomForestClassifier(n_estimators=100)
scores_rf = []
scores_knn = []
for repetitions in [5, 10, 50, 100]:
wrap_print("----------------------------------------------------------------")
wrap_print(
"estimating performances of KNN and random forest with %i train/test splits" % repetitions)
for i in range(0, repetitions):
# bootstrap loop
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.50, stratify=Y)
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)
scores_rf.append(matthews_corrcoef(Y_test, Y_pred))
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
scores_knn.append(matthews_corrcoef(Y_test, Y_pred))
s1 = scores_rf
s2 = scores_knn
def calc_ci(mu, sd, n, cl):
dev = norm.ppf((1-cl)/2)*sd/sqrt(n)
return mu+dev, mu-dev
s1_lower, s1_upper = calc_ci(np.mean(s1), np.std(s1), len(s1), 0.95)
s2_lower, s2_upper = calc_ci(np.mean(s2), np.std(s2), len(s2), 0.95)
print()
print("Random Forest: mean = %.4f [%.4f,%.4f]" % (
np.mean(s1), s1_lower, s1_upper))
print(" sd = %.4f" % np.std(s1))
print(" median = %.4f" % np.median(s1))
print(" min = %.4f" % np.min(s1))
print(" max = %.4f" % np.max(s1))
print("KNN: mean = %.4f [%.4f,%.4f]" % (
np.mean(s2), s2_lower, s2_upper))
print(" sd = %.4f" % np.std(s2))
print(" median = %.4f" % np.median(s2))
print(" min = %.4f" % np.min(s2))
print(" max = %.4f" % np.max(s2))
print()
alpha = 0.005
pval_shapiro1 = stats.shapiro(s1)[1]
pval_shapiro2 = stats.shapiro(s2)[1]
print('p-value of Shapiro-Wilk test for random forest data: %.4f' %
pval_shapiro1)
if pval_shapiro1 > alpha:
wrap_print('The test found that the data sample was normal, failing to reject the null hypothesis at significance level alpha=%.3f' % alpha)
else:
wrap_print(
'The test found that the data sample was not normal, rejecting the null hypothesis at significance level alpha=%.3f' % alpha)
print()
print('p-value of Shapiro-Wilk test for KNN data: %.4f' % pval_shapiro2)
if pval_shapiro2 > alpha:
wrap_print('The test found that the data sample was normal, failing to reject the null hypothesis at significance level alpha=%.3f' % alpha)
else:
wrap_print(
'The test found that the data sample was not normal, rejecting the null hypothesis at significance level alpha=%.3f' % alpha)
print()
if pval_shapiro1 > alpha and pval_shapiro2 > alpha:
wrap_print("Both populations normal. Using Welch's t-test.")
pval_equal = stats.ttest_ind(s1, s2, equal_var=False)[1]
print()
print("p-value of Welch's t-tests: %f" % pval_equal)
else:
wrap_print(
"At least one population not normal. Using Mann-Whitney-U test.")
pval_equal = stats.mannwhitneyu(s1, s2, alternative='two-sided')[1]
print()
print("p-value of Mann-Whitney-U test: %f" % pval_equal)
if pval_equal > alpha:
wrap_print('The test found that the population means are equal, failing to reject the null hypothesis at significance level alpha=%.3f' % alpha)
else:
wrap_print('The test found that the population means are not equal, rejecting the null hypothesis at significance level alpha=%.3f' % alpha)
# test conditions
s = sqrt(((len(s1)-1)*stdev(s1)**2 + (len(s2)-1)
* stdev(s2)**2)/(len(s1)+len(s2)-2))
cohens_d = (mean(s1) - mean(s2)) / s
if abs(cohens_d) < 0.01:
effsizestr = "negligible"
elif abs(cohens_d) < 0.2:
effsizestr = "very small"
elif abs(cohens_d) < 0.5:
effsizestr = "small"
elif abs(cohens_d) < 0.8:
effsizestr = "medium"
elif abs(cohens_d) < 1.2:
effsizestr = "large"
elif abs(cohens_d) < 2:
effsizestr = "very large"
else:
effsizestr = "huge"
wrap_print("Effect size (Cohen's d): %.3f - %s effect" %
(cohens_d, effsizestr))
We observe that while the statistical markers are fairly stable, the results of the statistics changes with an increasing number of iterations. With five and ten iterations, we do not find a statistically significant difference. We also observe that the confidence interval is fairly large, with a range of almost 5%. Once the sample size is increases to 50 and 100, we can observe that the difference between the two algorithms is significant. Similarly, we see that the confidence interval shrinks with increasing data size to a range of less than 1% with 100 samples.
This demonstrate the most important aspect of any data analysis: never trust results with very few samples.
Regarding the effect size, we observe that Cohen's $d$ can also be misleading. We see that the effect size is medium with $d \approx 0.7$. However, the actual difference between the mean values is less than 2%, i.e., the performance of both classifiers is very similar. Regardless, the effect size is relatively large, because the standard deviation is very small. This demonstrates that effect sizes, while effective, should be used with care, especially if the standard deviation is very small.