Association Rule Mining

This is a sample solution for the association rule mining exercise. This does not mean that this is the only way to solve this exercise. As with any programming task - and also with most data analysis tasks - there are multiple solutions for the same problem.

Libraries and Data

The first part of the exercise is about association rule mining. In Python, you can use the mlxtend library for the mining of association rules.

We use data about store baskets in this exercise. You can use the following code to load the data. The code creates a list of records, where each record is a list of the items that are part of the transaction.

import urllib.request

records = []
# directly load from the url instead of using the file
for line in urllib.request.urlopen("https://user.informatik.uni-goettingen.de/~sherbold/store_data.csv"):
    # this also means we need to decode the binary string into ascii
    records.append(line.decode('ascii').strip().split(','))

Finding frequent itemsets

Once you have the transactional records, use the apriori algorithm to find frequent itemsets with a suitable threshold for support for this data. Try to find a suitable threshold for the minimal support such that you can state a clear reason why you picked this threshold.

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder

# we first need to create a one-hot encoding of our transactions
te = TransactionEncoder()
te_ary = te.fit_transform(records)
data_df = pd.DataFrame(te_ary, columns=te.columns_)

# use support of 0.005 - low threshold may include to many candidates
# careful selection of rules based on other metrics required
# this means that we use a higher confidence
frequent_itemsets = apriori(pd.DataFrame(
    data_df), min_support=0.005, use_colnames=True)
frequent_itemsets

	support	itemsets
0	0.020397	(almonds)
1	0.008932	(antioxydant juice)
2	0.033329	(avocado)
3	0.008666	(bacon)
4	0.010799	(barbecue sauce)
5	0.014265	(black tea)
6	0.009199	(blueberries)
7	0.011465	(body spray)
8	0.033729	(brownies)
9	0.008666	(bug spray)
10	0.005866	(burger sauce)
11	0.087188	(burgers)
12	0.030129	(butter)
13	0.081056	(cake)
14	0.009732	(candy bars)
15	0.015331	(carrots)
16	0.025730	(cereals)
17	0.046794	(champagne)
18	0.059992	(chicken)
19	0.006133	(chili)
20	0.163845	(chocolate)
21	0.010532	(cider)
22	0.008399	(clothes accessories)
23	0.080389	(cookies)
24	0.051060	(cooking oil)
25	0.031862	(cottage cheese)
26	0.013198	(eggplant)
27	0.179709	(eggs)
28	0.027063	(energy bar)
29	0.026663	(energy drink)
...	...	...
695	0.005066	(ground beef, shrimp, mineral water)
696	0.005066	(ground beef, mineral water, soup)
697	0.017064	(spaghetti, mineral water, ground beef)
698	0.005466	(ground beef, tomatoes, mineral water)
699	0.006133	(ground beef, spaghetti, olive oil)
700	0.006399	(ground beef, pancakes, spaghetti)
701	0.005999	(ground beef, shrimp, spaghetti)
702	0.005599	(ground beef, tomatoes, spaghetti)
703	0.005999	(herb & pepper, mineral water, spaghetti)
704	0.005599	(spaghetti, mineral water, low fat yogurt)
705	0.008532	(mineral water, milk, olive oil)
706	0.007866	(pancakes, mineral water, milk)
707	0.007866	(shrimp, mineral water, milk)
708	0.008532	(mineral water, milk, soup)
709	0.015731	(spaghetti, mineral water, milk)
710	0.006532	(tomatoes, mineral water, milk)
711	0.006133	(turkey, mineral water, milk)
712	0.005066	(whole wheat rice, mineral water, milk)
713	0.007199	(spaghetti, milk, olive oil)
714	0.005866	(tomatoes, spaghetti, milk)
715	0.005199	(mineral water, soup, olive oil)
716	0.010265	(spaghetti, mineral water, olive oil)
717	0.011465	(spaghetti, pancakes, mineral water)
718	0.006799	(spaghetti, mineral water, salmon)
719	0.008532	(spaghetti, shrimp, mineral water)
720	0.007466	(spaghetti, mineral water, soup)
721	0.009332	(spaghetti, tomatoes, mineral water)
722	0.006399	(spaghetti, turkey, mineral water)
723	0.006266	(spaghetti, whole wheat rice, mineral water)
724	0.005066	(pancakes, spaghetti, olive oil)

725 rows × 2 columns

Mining rules from the frequent itemsets

Determine good rules from the results for this data. Use lift and confidence as metrics for your evaluations.

from mlxtend.frequent_patterns import association_rules

# use a high confidence to counterbalance the low support threshold
# order by lift to have the "best" rules at the top
association_rules(frequent_itemsets, metric="confidence",
                  min_threshold=0.5).sort_values('lift', ascending=False)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
13	(ground beef, shrimp)	(spaghetti)	0.011465	0.174110	0.005999	0.523256	3.005315	0.004003	1.732354
6	(frozen vegetables, ground beef)	(spaghetti)	0.016931	0.174110	0.008666	0.511811	2.939582	0.005718	1.691742
9	(frozen vegetables, olive oil)	(spaghetti)	0.011332	0.174110	0.005733	0.505882	2.905531	0.003760	1.671444
8	(frozen vegetables, soup)	(mineral water)	0.007999	0.238368	0.005066	0.633333	2.656954	0.003159	2.077178
17	(olive oil, soup)	(mineral water)	0.008932	0.238368	0.005199	0.582090	2.441976	0.003070	1.822476
7	(frozen vegetables, olive oil)	(mineral water)	0.011332	0.238368	0.006532	0.576471	2.418404	0.003831	1.798297
15	(milk, soup)	(mineral water)	0.015198	0.238368	0.008532	0.561404	2.355194	0.004909	1.736520
2	(chocolate, soup)	(mineral water)	0.010132	0.238368	0.005599	0.552632	2.318395	0.003184	1.702471
3	(eggs, cooking oil)	(mineral water)	0.011732	0.238368	0.006399	0.545455	2.288286	0.003603	1.675590
5	(frozen vegetables, ground beef)	(mineral water)	0.016931	0.238368	0.009199	0.543307	2.279277	0.005163	1.667711
16	(turkey, milk)	(mineral water)	0.011332	0.238368	0.006133	0.541176	2.270338	0.003431	1.659967
19	(spaghetti, soup)	(mineral water)	0.014265	0.238368	0.007466	0.523364	2.195614	0.004065	1.597933
12	(ground beef, soup)	(mineral water)	0.009732	0.238368	0.005066	0.520548	2.183798	0.002746	1.588546
0	(chocolate, chicken)	(mineral water)	0.014665	0.238368	0.007599	0.518182	2.173871	0.004103	1.580745
11	(ground beef, pancakes)	(mineral water)	0.014531	0.238368	0.007466	0.513761	2.155327	0.004002	1.566375
4	(ground beef, eggs)	(mineral water)	0.019997	0.238368	0.010132	0.506667	2.125563	0.005365	1.543848
18	(spaghetti, salmon)	(mineral water)	0.013465	0.238368	0.006799	0.504950	2.118363	0.003589	1.538496
1	(chocolate, olive oil)	(mineral water)	0.016398	0.238368	0.008266	0.504065	2.114649	0.004357	1.535749
10	(ground beef, milk)	(mineral water)	0.021997	0.238368	0.011065	0.503030	2.110308	0.005822	1.532552
14	(milk, olive oil)	(mineral water)	0.017064	0.238368	0.008532	0.500000	2.097595	0.004465	1.523264

The rules make sense, but are only helpful in a limited way, because we only have two different items as consequent: spaghetti and mineral water. The first three rules indicate that if a person buys a product that is often used to make pasta sauce (e.g., ground beef, vegetables, olive oil), they also often buy pasta. While this makes sense, this rule may also be there, because people tend to buy pasta a lot. However, the lift of about 3 indicates that these rules are three times as likely as a random effect. The shrimp does not really match this pattern.

The other rules are likely effective, but an random artifact: the problem is that mineral water is part of many transactions:

# the mean of a one hot encoded column is the percentage that this value occurs
data_df['mineral water'].mean()

0.23836821757099053

Thus, the rule to predict mineral water is almost universally good.

Validation of the rules

Randomly split your records into two sets with roughly 50% of data each. Now use the Apriori algorithm to determine rules on both of these sets. Do you find similar rules on both sets? What does the similarity/the differences indicate?

from sklearn.model_selection import train_test_split

# split the data into two sets with 50% of the data
X_train, X_test = train_test_split(data_df, test_size=0.5, random_state=42)

# create frequent itemsets
frequent_itemsets_train = apriori(pd.DataFrame(
    X_train), min_support=0.005, use_colnames=True)
frequent_itemsets_test = apriori(pd.DataFrame(
    X_test), min_support=0.005, use_colnames=True)

# rules for first set
association_rules(frequent_itemsets_train, metric="confidence",
                  min_threshold=0.5).sort_values('lift', ascending=False)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
27	(spaghetti, whole wheat pasta)	(milk)	0.010133	0.135733	0.005067	0.500000	3.683694	0.003691	1.728533
5	(grated cheese, eggs)	(spaghetti)	0.009600	0.181333	0.005867	0.611111	3.370098	0.004126	2.105143
15	(frozen vegetables, soup)	(spaghetti)	0.008533	0.181333	0.005067	0.593750	3.274357	0.003519	2.015179
10	(frozen vegetables, ground beef)	(spaghetti)	0.016800	0.181333	0.009600	0.571429	3.151261	0.006554	1.910222
21	(ground beef, shrimp)	(spaghetti)	0.012267	0.181333	0.006933	0.565217	3.117008	0.004709	1.882933
33	(frozen vegetables, mineral water, milk)	(spaghetti)	0.011200	0.181333	0.006133	0.547619	3.019958	0.004102	1.809684
14	(frozen vegetables, olive oil)	(spaghetti)	0.011200	0.181333	0.005867	0.523810	2.888655	0.003836	1.719200
0	(cake, chocolate)	(spaghetti)	0.013333	0.181333	0.006933	0.520000	2.867647	0.004516	1.705556
6	(herb & pepper, eggs)	(spaghetti)	0.012800	0.181333	0.006400	0.500000	2.757353	0.004079	1.637333
32	(spaghetti, milk, eggs)	(mineral water)	0.009067	0.249600	0.006133	0.676471	2.710219	0.003870	2.319418
12	(frozen vegetables, soup)	(mineral water)	0.008533	0.249600	0.005600	0.656250	2.629207	0.003470	2.182982
2	(pancakes, cooking oil)	(mineral water)	0.008267	0.249600	0.005333	0.645161	2.584781	0.003270	2.114764
23	(pancakes, low fat yogurt)	(mineral water)	0.010133	0.249600	0.006133	0.605263	2.424933	0.003604	1.901013
34	(frozen vegetables, spaghetti, milk)	(mineral water)	0.010133	0.249600	0.006133	0.605263	2.424933	0.003604	1.901013
16	(grated cheese, ground beef)	(mineral water)	0.011467	0.249600	0.006667	0.581395	2.329308	0.003805	1.792622
11	(frozen vegetables, olive oil)	(mineral water)	0.011200	0.249600	0.006400	0.571429	2.289377	0.003604	1.750933
19	(ground beef, pancakes)	(mineral water)	0.014667	0.249600	0.008267	0.563636	2.258159	0.004606	1.719667
3	(spaghetti, cooking oil)	(mineral water)	0.016267	0.249600	0.009067	0.557377	2.233081	0.005007	1.695348
9	(frozen vegetables, ground beef)	(mineral water)	0.016800	0.249600	0.009333	0.555556	2.225783	0.005140	1.688400
28	(olive oil, soup)	(mineral water)	0.009333	0.249600	0.005067	0.542857	2.174908	0.002737	1.641500
26	(milk, soup)	(mineral water)	0.016267	0.249600	0.008800	0.540984	2.167402	0.004740	1.634800
17	(ground beef, low fat yogurt)	(mineral water)	0.009867	0.249600	0.005333	0.540541	2.165627	0.002871	1.633224
18	(ground beef, milk)	(mineral water)	0.020267	0.249600	0.010933	0.539474	2.161353	0.005875	1.629440
1	(eggs, cooking oil)	(mineral water)	0.012533	0.249600	0.006667	0.531915	2.131069	0.003538	1.603127
25	(pancakes, milk)	(mineral water)	0.018667	0.249600	0.009867	0.528571	2.117674	0.005207	1.591758
4	(grated cheese, eggs)	(mineral water)	0.009600	0.249600	0.005067	0.527778	2.114494	0.002671	1.589082
20	(ground beef, soup)	(mineral water)	0.010133	0.249600	0.005333	0.526316	2.108637	0.002804	1.584178
30	(spaghetti, soup)	(mineral water)	0.015467	0.249600	0.008000	0.517241	2.072281	0.004140	1.554400
24	(milk, olive oil)	(mineral water)	0.018667	0.249600	0.009600	0.514286	2.060440	0.004941	1.544941
31	(ground beef, spaghetti, chocolate)	(mineral water)	0.009867	0.249600	0.005067	0.513514	2.057346	0.002604	1.542489
8	(eggs, red wine)	(mineral water)	0.009867	0.249600	0.005067	0.513514	2.057346	0.002604	1.542489
22	(herb & pepper, milk)	(mineral water)	0.010400	0.249600	0.005333	0.512821	2.054569	0.002737	1.540295
13	(frozen vegetables, whole wheat rice)	(mineral water)	0.010933	0.249600	0.005600	0.512195	2.052064	0.002871	1.538320
29	(spaghetti, salmon)	(mineral water)	0.015733	0.249600	0.008000	0.508475	2.037158	0.004073	1.526676
7	(eggs, olive oil)	(mineral water)	0.013333	0.249600	0.006667	0.500000	2.003205	0.003339	1.500800

# rules for second set
association_rules(frequent_itemsets_test, metric="confidence",
                  min_threshold=0.5).sort_values('lift', ascending=False)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
20	(pancakes, olive oil)	(spaghetti)	0.010930	0.166889	0.005599	0.512195	3.069080	0.003774	1.707878
16	(tomatoes, olive oil)	(mineral water)	0.007465	0.227139	0.005065	0.678571	2.987466	0.003370	2.404455
4	(chocolate, soup)	(mineral water)	0.008798	0.227139	0.005865	0.666667	2.935055	0.003867	2.318582
15	(olive oil, soup)	(mineral water)	0.008531	0.227139	0.005332	0.625000	2.751614	0.003394	2.060962
13	(turkey, milk)	(mineral water)	0.011464	0.227139	0.006931	0.604651	2.662026	0.004328	1.954883
14	(shrimp, olive oil)	(mineral water)	0.009064	0.227139	0.005332	0.588235	2.589754	0.003273	1.876947
12	(milk, soup)	(mineral water)	0.014130	0.227139	0.008264	0.584906	2.575095	0.005055	1.861891
9	(frozen vegetables, olive oil)	(mineral water)	0.011464	0.227139	0.006665	0.581395	2.559641	0.004061	1.846278
1	(chocolate, chicken)	(mineral water)	0.012797	0.227139	0.007198	0.562500	2.476452	0.004291	1.766538
5	(eggs, cooking oil)	(mineral water)	0.010930	0.227139	0.006132	0.560976	2.469741	0.003649	1.760405
7	(eggs, soup)	(mineral water)	0.009064	0.227139	0.005065	0.558824	2.460267	0.003006	1.751817
2	(pancakes, chicken)	(mineral water)	0.009064	0.227139	0.005065	0.558824	2.460267	0.003006	1.751817
3	(chocolate, olive oil)	(mineral water)	0.014396	0.227139	0.007998	0.555556	2.445879	0.004728	1.738936
8	(frozen vegetables, ground beef)	(mineral water)	0.017062	0.227139	0.009064	0.531250	2.338872	0.005189	1.648769
18	(spaghetti, soup)	(mineral water)	0.013063	0.227139	0.006931	0.530612	2.336064	0.003964	1.646529
6	(ground beef, eggs)	(mineral water)	0.018662	0.227139	0.009864	0.528571	2.327079	0.005625	1.639401
11	(ground beef, turkey)	(mineral water)	0.010397	0.227139	0.005332	0.512821	2.257734	0.002970	1.586398
19	(whole wheat rice, spaghetti)	(mineral water)	0.013863	0.227139	0.006931	0.500000	2.201291	0.003783	1.545721
0	(light cream)	(mineral water)	0.012263	0.227139	0.006132	0.500000	2.201291	0.003346	1.545721
17	(spaghetti, salmon)	(mineral water)	0.011197	0.227139	0.005599	0.500000	2.201291	0.003055	1.545721
10	(ground beef, tomatoes)	(mineral water)	0.012263	0.227139	0.006132	0.500000	2.201291	0.003346	1.545721

The "mineral water" rules seem to be fairly stable and are present in both splits. The "spaghetti" rules seem to be more random. While there are such rules in both splits, they are not the same in the splits. In general, it seems like spaghetti may also be just something that is bought very often, same as mineral water, and that the rules may be good suggestions for cross-sell, but that there is probably not a strong association between the antecedents and the consequents for all rules, i.e., there does not seem to be causality.