This is a sample solution for the association rule mining exercise. This does not mean that this is the only way to solve this exercise. As with any programming task - and also with most data analysis tasks - there are multiple solutions for the same problem.
Libraries and Data
The first part of the exercise is about association rule mining. In Python, you can use the mlxtend
library for the mining of association rules.
We use data about store baskets in this exercise. You can use the following code to load the data. The code creates a list of records, where each record is a list of the items that are part of the transaction.
import urllib.request
records = []
# directly load from the url instead of using the file
for line in urllib.request.urlopen("https://user.informatik.uni-goettingen.de/~sherbold/store_data.csv"):
# this also means we need to decode the binary string into ascii
records.append(line.decode('ascii').strip().split(','))
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder
# we first need to create a one-hot encoding of our transactions
te = TransactionEncoder()
te_ary = te.fit_transform(records)
data_df = pd.DataFrame(te_ary, columns=te.columns_)
# use support of 0.005 - low threshold may include to many candidates
# careful selection of rules based on other metrics required
# this means that we use a higher confidence
frequent_itemsets = apriori(pd.DataFrame(
data_df), min_support=0.005, use_colnames=True)
frequent_itemsets
from mlxtend.frequent_patterns import association_rules
# use a high confidence to counterbalance the low support threshold
# order by lift to have the "best" rules at the top
association_rules(frequent_itemsets, metric="confidence",
min_threshold=0.5).sort_values('lift', ascending=False)
The rules make sense, but are only helpful in a limited way, because we only have two different items as consequent: spaghetti and mineral water. The first three rules indicate that if a person buys a product that is often used to make pasta sauce (e.g., ground beef, vegetables, olive oil), they also often buy pasta. While this makes sense, this rule may also be there, because people tend to buy pasta a lot. However, the lift of about 3 indicates that these rules are three times as likely as a random effect. The shrimp does not really match this pattern.
The other rules are likely effective, but an random artifact: the problem is that mineral water is part of many transactions:
# the mean of a one hot encoded column is the percentage that this value occurs
data_df['mineral water'].mean()
Thus, the rule to predict mineral water is almost universally good.
from sklearn.model_selection import train_test_split
# split the data into two sets with 50% of the data
X_train, X_test = train_test_split(data_df, test_size=0.5, random_state=42)
# create frequent itemsets
frequent_itemsets_train = apriori(pd.DataFrame(
X_train), min_support=0.005, use_colnames=True)
frequent_itemsets_test = apriori(pd.DataFrame(
X_test), min_support=0.005, use_colnames=True)
# rules for first set
association_rules(frequent_itemsets_train, metric="confidence",
min_threshold=0.5).sort_values('lift', ascending=False)
# rules for second set
association_rules(frequent_itemsets_test, metric="confidence",
min_threshold=0.5).sort_values('lift', ascending=False)
The "mineral water" rules seem to be fairly stable and are present in both splits. The "spaghetti" rules seem to be more random. While there are such rules in both splits, they are not the same in the splits. In general, it seems like spaghetti may also be just something that is bought very often, same as mineral water, and that the rules may be good suggestions for cross-sell, but that there is probably not a strong association between the antecedents and the consequents for all rules, i.e., there does not seem to be causality.