Statistics

With this exercise, you can learn more about statistics and apply hypothesis testing, effect sizes, and determine the confidence interval of an machine learning experiment.

Data and Libraries

Your task in this exercise is to apply statistical tests to compare the performance of two classification models. You can find everything you need in sklearn and scipy.stats. For this exercise, we use the Iris data. The data is available in sklearn.

Repeated Training with Repeated Sampling

Train classification models with a 5-nearest neighbor classifier and random forest classifier (100 estimators) for the Iris data using 5 different randomized train/test-splits with 50% data as training data. Calculate Matthews Correlation Coefficient (MCC) for each of these classifiers and create two arrays: one with the MCC values of the nearest neighbor classifier and one for the random forest.

Statistical Comparison

Compare the summary statistics mean, standard deviation, median, min, and max of the estimates for the MCC of both classifiers. Use statistical tests to determine if there is a statistically significant difference between the classifiers. If the difference is significant, use Cohen's $d$ to estimate the effect size. Moreover, calculate the confidence interval for the mean value of MCC to estimate the reliability of your performance estimation.

Repeat all of the above with 10, 50, and 100 train/test splits. How do the results depend on the number of repetitions?

import numpy as np

from sklearn.datasets import load_iris
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.metrics import matthews_corrcoef
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from scipy import stats
from statistics import mean, stdev
from math import sqrt
from scipy.stats import norm
from textwrap import TextWrapper

wrapper = TextWrapper(width=65)


def wrap_print(string):
    print('\n'.join(wrapper.wrap(string)))


X, Y = load_iris(return_X_y=True)

knn = KNeighborsClassifier(5)
rf = RandomForestClassifier(n_estimators=100)

scores_rf = []
scores_knn = []
for repetitions in [5, 10, 50, 100]:
    wrap_print("----------------------------------------------------------------")
    wrap_print(
        "estimating performances of KNN and random forest with %i train/test splits" % repetitions)
    for i in range(0, repetitions):
        # bootstrap loop
        X_train, X_test, Y_train, Y_test = train_test_split(
            X, Y, test_size=0.50, stratify=Y)
        rf.fit(X_train, Y_train)
        Y_pred = rf.predict(X_test)
        scores_rf.append(matthews_corrcoef(Y_test, Y_pred))
        knn.fit(X_train, Y_train)
        Y_pred = knn.predict(X_test)
        scores_knn.append(matthews_corrcoef(Y_test, Y_pred))

    s1 = scores_rf
    s2 = scores_knn

    def calc_ci(mu, sd, n, cl):
        dev = norm.ppf((1-cl)/2)*sd/sqrt(n)
        return mu+dev, mu-dev

    s1_lower, s1_upper = calc_ci(np.mean(s1), np.std(s1), len(s1), 0.95)
    s2_lower, s2_upper = calc_ci(np.mean(s2), np.std(s2), len(s2), 0.95)

    print()
    print("Random Forest: mean   = %.4f [%.4f,%.4f]" % (
        np.mean(s1), s1_lower, s1_upper))
    print("               sd     = %.4f" % np.std(s1))
    print("               median = %.4f" % np.median(s1))
    print("               min    = %.4f" % np.min(s1))
    print("               max    = %.4f" % np.max(s1))
    print("KNN:           mean   = %.4f [%.4f,%.4f]" % (
        np.mean(s2), s2_lower, s2_upper))
    print("               sd     = %.4f" % np.std(s2))
    print("               median = %.4f" % np.median(s2))
    print("               min    = %.4f" % np.min(s2))
    print("               max    = %.4f" % np.max(s2))
    print()

    alpha = 0.005
    pval_shapiro1 = stats.shapiro(s1)[1]
    pval_shapiro2 = stats.shapiro(s2)[1]

    print('p-value of Shapiro-Wilk test for random forest data:   %.4f' %
          pval_shapiro1)
    if pval_shapiro1 > alpha:
        wrap_print('The test found that the data sample was normal, failing to reject the null hypothesis at significance level alpha=%.3f' % alpha)
    else:
        wrap_print(
            'The test found that the data sample was not normal, rejecting the null hypothesis at significance level alpha=%.3f' % alpha)

    print()
    print('p-value of Shapiro-Wilk test for KNN data: %.4f' % pval_shapiro2)
    if pval_shapiro2 > alpha:
        wrap_print('The test found that the data sample was normal, failing to reject the null hypothesis at significance level alpha=%.3f' % alpha)
    else:
        wrap_print(
            'The test found that the data sample was not normal, rejecting the null hypothesis at significance level alpha=%.3f' % alpha)
    print()

    if pval_shapiro1 > alpha and pval_shapiro2 > alpha:
        wrap_print("Both populations normal. Using Welch's t-test.")
        pval_equal = stats.ttest_ind(s1, s2, equal_var=False)[1]
        print()
        print("p-value of Welch's t-tests: %f" % pval_equal)
    else:
        wrap_print(
            "At least one population not normal. Using Mann-Whitney-U test.")
        pval_equal = stats.mannwhitneyu(s1, s2, alternative='two-sided')[1]
        print()
        print("p-value of Mann-Whitney-U test: %f" % pval_equal)

    if pval_equal > alpha:
        wrap_print('The test found that the population means are equal, failing to reject the null hypothesis at significance level alpha=%.3f' % alpha)
    else:
        wrap_print('The test found that the population means are not equal, rejecting the null hypothesis at significance level alpha=%.3f' % alpha)
        # test conditions
        s = sqrt(((len(s1)-1)*stdev(s1)**2 + (len(s2)-1)
                  * stdev(s2)**2)/(len(s1)+len(s2)-2))
        cohens_d = (mean(s1) - mean(s2)) / s

        if abs(cohens_d) < 0.01:
            effsizestr = "negligible"
        elif abs(cohens_d) < 0.2:
            effsizestr = "very small"
        elif abs(cohens_d) < 0.5:
            effsizestr = "small"
        elif abs(cohens_d) < 0.8:
            effsizestr = "medium"
        elif abs(cohens_d) < 1.2:
            effsizestr = "large"
        elif abs(cohens_d) < 2:
            effsizestr = "very large"
        else:
            effsizestr = "huge"

        wrap_print("Effect size (Cohen's d): %.3f - %s effect" %
                   (cohens_d, effsizestr))

----------------------------------------------------------------
estimating performances of KNN and random forest with 5
train/test splits

Random Forest: mean   = 0.9183 [0.9051,0.9315]
               sd     = 0.0151
               median = 0.9210
               min    = 0.9022
               max    = 0.9423
KNN:           mean   = 0.9377 [0.9187,0.9566]
               sd     = 0.0216
               median = 0.9403
               min    = 0.9061
               max    = 0.9610

p-value of Shapiro-Wilk test for random forest data:   0.3981
The test found that the data sample was normal, failing to reject
the null hypothesis at significance level alpha=0.005

p-value of Shapiro-Wilk test for KNN data: 0.4484
The test found that the data sample was normal, failing to reject
the null hypothesis at significance level alpha=0.005

Both populations normal. Using Welch's t-test.

p-value of Welch's t-tests: 0.183827
The test found that the population means are equal, failing to
reject the null hypothesis at significance level alpha=0.005
----------------------------------------------------------------
estimating performances of KNN and random forest with 10
train/test splits

Random Forest: mean   = 0.9233 [0.9066,0.9401]
               sd     = 0.0331
               median = 0.9210
               min    = 0.8658
               max    = 0.9803
KNN:           mean   = 0.9478 [0.9369,0.9587]
               sd     = 0.0215
               median = 0.9600
               min    = 0.9061
               max    = 0.9803

p-value of Shapiro-Wilk test for random forest data:   0.2464
The test found that the data sample was normal, failing to reject
the null hypothesis at significance level alpha=0.005

p-value of Shapiro-Wilk test for KNN data: 0.1937
The test found that the data sample was normal, failing to reject
the null hypothesis at significance level alpha=0.005

Both populations normal. Using Welch's t-test.

p-value of Welch's t-tests: 0.029240
The test found that the population means are equal, failing to
reject the null hypothesis at significance level alpha=0.005
----------------------------------------------------------------
estimating performances of KNN and random forest with 50
train/test splits

Random Forest: mean   = 0.9255 [0.9184,0.9326]
               sd     = 0.0291
               median = 0.9210
               min    = 0.8658
               max    = 0.9803
KNN:           mean   = 0.9386 [0.9318,0.9454]
               sd     = 0.0281
               median = 0.9403
               min    = 0.8222
               max    = 1.0000

p-value of Shapiro-Wilk test for random forest data:   0.0211
The test found that the data sample was normal, failing to reject
the null hypothesis at significance level alpha=0.005

p-value of Shapiro-Wilk test for KNN data: 0.0005
The test found that the data sample was not normal, rejecting the
null hypothesis at significance level alpha=0.005

At least one population not normal. Using Mann-Whitney-U test.

p-value of Mann-Whitney-U test: 0.002456
The test found that the population means are not equal, rejecting
the null hypothesis at significance level alpha=0.005
Effect size (Cohen's d): -0.454 - small effect
----------------------------------------------------------------
estimating performances of KNN and random forest with 100
train/test splits

Random Forest: mean   = 0.9255 [0.9213,0.9297]
               sd     = 0.0277
               median = 0.9210
               min    = 0.8602
               max    = 0.9803
KNN:           mean   = 0.9411 [0.9374,0.9448]
               sd     = 0.0242
               median = 0.9403
               min    = 0.8222
               max    = 1.0000

p-value of Shapiro-Wilk test for random forest data:   0.0002
The test found that the data sample was not normal, rejecting the
null hypothesis at significance level alpha=0.005

p-value of Shapiro-Wilk test for KNN data: 0.0000
The test found that the data sample was not normal, rejecting the
null hypothesis at significance level alpha=0.005

At least one population not normal. Using Mann-Whitney-U test.

p-value of Mann-Whitney-U test: 0.000000
The test found that the population means are not equal, rejecting
the null hypothesis at significance level alpha=0.005
Effect size (Cohen's d): -0.600 - medium effect

We observe that while the statistical markers are fairly stable, the results of the statistics changes with an increasing number of iterations. With five and ten iterations, we do not find a statistically significant difference. We also observe that the confidence interval is fairly large, with a range of almost 5%. Once the sample size is increases to 50 and 100, we can observe that the difference between the two algorithms is significant. Similarly, we see that the confidence interval shrinks with increasing data size to a range of less than 1% with 100 samples.

This demonstrate the most important aspect of any data analysis: never trust results with very few samples.

Regarding the effect size, we observe that Cohen's $d$ can also be misleading. We see that the effect size is medium with $d \approx 0.7$. However, the actual difference between the mean values is less than 2%, i.e., the performance of both classifiers is very similar. Regardless, the effect size is relatively large, because the standard deviation is very small. This demonstrates that effect sizes, while effective, should be used with care, especially if the standard deviation is very small.