With this exercise you can learn more about clustering. You learn how to pick suitable parameters for different algorithms and how to interpret results of the clustering of high-dimensional data.
Please note that the biggest challenge of this exercise is not to determine some clusters, but to determine if these clusters are good and what each cluster represents.
Libraries and Data
We use the boston house price data in this exercise. The data is available as part of sklearn
for Python.
Last week we explored the boston data, this week we use it for clustering. You will apply both $k$-means clustering and DB clustering to the boston data using all fourteen columns. Functions for all clustering algorithms are available in sklearn
for Python. If you experience problems, ensure that your sklearn
version is at least 0.22, your matplotlib
version is at least 3.0.1, and your seaborn
version is at least 0.9.0.
There are a couple of problems with clustering data like the boston data, that you will have to solve during this exercise.
- The different features of the data are on different scales, which influences the results.
- The data has fourteen dimensions. This makes visualizing the clusters difficult. You can try a dimension reduction technique like Principle Component Analysis (PCA) to get only two dimensions or use pair-wise plots. Both have advantages and drawbacks, which you should explore as part of this exercise.
$k$-Means Clustering
Use $k$-Means to cluster the data and find a suitable number of clusters for $k$. Use a combination of knowledge you already have about the data, visualizations, as well as the within-sum-of-squares to determine a suitable number of clusters.
Hierarchical Clustering
(Note: Hierarchical clustering is also known as agglomerative clustering and can be found under that name in sklearn
. This task requires at least sklearn
version 0.22, which is still under development (October 2019). You can find guidance on how to install packages in Jupyter notebook here and regarding the development version of sklearn
here.)
Use hierarchical clustering with single linkage to determine clusters within the housing data. Find a suitable cut-off for the clusters using a dendrogram.
Compare the Clustering Results
How are the clustering results different between the algorithms? Consider, e.g., the number of clusters, the shape of clusters, general problems with using the algorithms, and the insights you get from each algorithm.
You may also use this to better understand the differences between the algorithms. For example, how are the results from EM clustering different/similar to the results of the $k$-Means clustering? Is there a relationship between the WSS and the BIC? How are the mean values of EM related to the centroids of $k$-Means? What is the relationship between the parameters for DBSCAN and the cut-off for the hierarchical clustering?