With this exercise, you can learn more about classification. You can try out the algorithms on a data set and compare the performance of the different classifiers with different performance metrics.
Please note that the biggest challenge of this exercise is to select good hyper parameters of the algorithms, e.g., tree depths, activation functions, etc. The performance of the algorithms depends on this. In the bonus task, you can see that this can quickly consume huge amounts of computational capacity.
Libraries and Data
Your task in this exercise is pretty straight forward: apply different classification algorithms to a data set, evaluate the results, and determine the best algorithm. You can find everything you need in sklearn
. We use data about dominant types of trees in forests in this exercise.
Training and test data
Before you can start building classifiers, you need to separate the data into training and test data. Because the data is quite large, please use 5% of the data for training, and 95% of the data for testing. Because you are selecting such a small subset, it could easily happen that not all classes are represented the same way in the training and in the test data. Use stratified sampling to avoid this.
Train, Test, Evaluate
Now that training and test data are available, you can try out the classifiers from Chapter 7. You will notice that some classifiers may require a long amount of time for training and may, therefore, not be suitable for the analysis of this data set.
Try to find a classifier that works well with the data. On this data, this means two things:
- Training and prediction in an acceptable amount of time. Use "less than 10 minutes" as definition for acceptable on this exercise sheet.
- Good prediction performance as measured with MCC, recall, precision, and F-Measure.
The different classifiers have different parameters, also known as hyper parameters, e.g., the depth of a tree, or the number of trees used by a random forest. Try to find good parameters to improve the results.
Bonus Task (will not be discussed during the exercise and no sample solution)
Other than trying out, you can also automatically tune your hyper parameters, if you have a training, a validation, and a test set. This is also supported by sklearn
directly. You may use this to try out how such automated tuning affets your results. But beware: this can easily consume large amounts of computational capacity!