Search
Cluster Analysis

With this exercise you can learn more about clustering. You learn how to pick suitable parameters for different algorithms and how to interpret results of the clustering of high-dimensional data.

Please note that the biggest challenge of this exercise is not to determine some clusters, but to determine if these clusters are good and what each cluster represents.

Libraries and Data

We use the boston house price data in this exercise. The data is available as part of sklearn for Python.

Last week we explored the boston data, this week we use it for clustering. You will apply both $k$-means clustering and DB clustering to the boston data using all fourteen columns. Functions for all clustering algorithms are available in sklearn for Python. If you experience problems, ensure that your sklearn version is at least 0.22, your matplotlib version is at least 3.0.1, and your seaborn version is at least 0.9.0.

There are a couple of problems with clustering data like the boston data, that you will have to solve during this exercise.

  • The different features of the data are on different scales, which influences the results.
  • The data has fourteen dimensions. This makes visualizing the clusters difficult. You can try a dimension reduction technique like Principle Component Analysis (PCA) to get only two dimensions or use pair-wise plots. Both have advantages and drawbacks, which you should explore as part of this exercise.

$k$-Means Clustering

Use $k$-Means to cluster the data and find a suitable number of clusters for $k$. Use a combination of knowledge you already have about the data, visualizations, as well as the within-sum-of-squares to determine a suitable number of clusters.

EM Clustering

(Note: EM clustering is also known as Gaussian Mixture Models and can be found in the mixture package of sklearn.)

Use the EM algorithm to determine multivariate clusters in the data. Determine a suitable number of clusters using the Bayesian Information Criterion (BIC).

DBSCAN Clustering

Use DBSCAN to cluster the data and find suitable values for $epsilon$ and $minPts$. Use a combination of knowledge you already have about the data and visualizations.

Hierarchical Clustering

(Note: Hierarchical clustering is also known as agglomerative clustering and can be found under that name in sklearn. This task requires at least sklearn version 0.22, which is still under development (October 2019). You can find guidance on how to install packages in Jupyter notebook here and regarding the development version of sklearn here.)

Use hierarchical clustering with single linkage to determine clusters within the housing data. Find a suitable cut-off for the clusters using a dendrogram.

Compare the Clustering Results

How are the clustering results different between the algorithms? Consider, e.g., the number of clusters, the shape of clusters, general problems with using the algorithms, and the insights you get from each algorithm.

You may also use this to better understand the differences between the algorithms. For example, how are the results from EM clustering different/similar to the results of the $k$-Means clustering? Is there a relationship between the WSS and the BIC? How are the mean values of EM related to the centroids of $k$-Means? What is the relationship between the parameters for DBSCAN and the cut-off for the hierarchical clustering?