This exercise is for the data exploration and your task is to understand a data set based on the description, statistics, and from visualizations.
Data for this exercise
We use the boston house price data in this exercise. The data is available as part of sklearn
for Python. The description of the data is provided together with the actual data and should be the starting point for your analysis of the data.
Descriptive statistics of the boston data
Explore the boston data using descriptive statistics. Calculate the central tendency with the mean and median, the variability through the standard deviation and the IQR, as well as the range of the data. The real task is understanding something about the data from these results. For example, what can you learn about the CRIM feature from the mean and the median?
Visualizations
The Python library matplotlib
is great for creating all kinds of visualizations. There are even libraries on top of matplotlib
that facilitate relatively complex visualizations in a single line of code like seaborn
.
Analyze single features of the boston data
Visually analyze features zn
and indus
of the boston data. Use the techniques described in Chapter 3, i.e., histograms and density plots (with/without rugs). What can you learn about these features from these plots? What are the advantages and drawbacks of the different plots for this data?
Analyze the pair-wise relationships between the features of the boston data
Next, analyze the pair-wise relationships between all fourteen features of the boston data. First, analyze their relationship through scatter plots. Then, create a heatmap of the correlations between the features. What did you learn about the data?