# Association Rule Mining

This is a sample solution for the association rule mining exercise. This does not mean that this is the only way to solve this exercise. As with any programming task - and also with most data analysis tasks - there are multiple solutions for the same problem. 

## Libraries and Data

The first part of the exercise is about association rule mining. In Python, you can use the ```mlxtend``` library for the mining of association rules. 

We use data about [store baskets](https://user.informatik.uni-goettingen.de/~sherbold/store_data.csv) in this exercise. You can use the following code to load the data. The code creates a list of records, where each record is a list of the items that are part of the transaction.

In [8]:
import urllib.request

records = []
# directly load from the url instead of using the file
for line in urllib.request.urlopen("https://user.informatik.uni-goettingen.de/~sherbold/store_data.csv"):
    # this also means we need to decode the binary string into ascii
    records.append(line.decode('ascii').strip().split(','))

## Finding frequent itemsets

Once you have the transactional records, use the apriori algorithm to find frequent itemsets with a suitable threshold for support for this data. Try to find a suitable threshold for the minimal support such that you can state a clear reason why you picked this threshold. 

In [14]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder

# we first need to create a one-hot encoding of our transactions
te = TransactionEncoder()
te_ary = te.fit_transform(records)
data_df = pd.DataFrame(te_ary, columns=te.columns_)

# use support of 0.005 - low threshold may include to many candidates
# careful selection of rules based on other metrics required
# this means that we use a higher confidence
frequent_itemsets = apriori(pd.DataFrame(
    data_df), min_support=0.005, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.020397,(almonds)
1,0.008932,(antioxydant juice)
2,0.033329,(avocado)
3,0.008666,(bacon)
4,0.010799,(barbecue sauce)
5,0.014265,(black tea)
6,0.009199,(blueberries)
7,0.011465,(body spray)
8,0.033729,(brownies)
9,0.008666,(bug spray)


## Mining rules from the frequent itemsets

Determine good rules from the results for this data. Use lift and confidence as metrics for your evaluations. 

In [34]:
from mlxtend.frequent_patterns import association_rules

# use a high confidence to counterbalance the low support threshold
# order by lift to have the "best" rules at the top
association_rules(frequent_itemsets, metric="confidence",
                  min_threshold=0.5).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
13,"(ground beef, shrimp)",(spaghetti),0.011465,0.17411,0.005999,0.523256,3.005315,0.004003,1.732354
6,"(frozen vegetables, ground beef)",(spaghetti),0.016931,0.17411,0.008666,0.511811,2.939582,0.005718,1.691742
9,"(frozen vegetables, olive oil)",(spaghetti),0.011332,0.17411,0.005733,0.505882,2.905531,0.00376,1.671444
8,"(frozen vegetables, soup)",(mineral water),0.007999,0.238368,0.005066,0.633333,2.656954,0.003159,2.077178
17,"(olive oil, soup)",(mineral water),0.008932,0.238368,0.005199,0.58209,2.441976,0.00307,1.822476
7,"(frozen vegetables, olive oil)",(mineral water),0.011332,0.238368,0.006532,0.576471,2.418404,0.003831,1.798297
15,"(milk, soup)",(mineral water),0.015198,0.238368,0.008532,0.561404,2.355194,0.004909,1.73652
2,"(chocolate, soup)",(mineral water),0.010132,0.238368,0.005599,0.552632,2.318395,0.003184,1.702471
3,"(eggs, cooking oil)",(mineral water),0.011732,0.238368,0.006399,0.545455,2.288286,0.003603,1.67559
5,"(frozen vegetables, ground beef)",(mineral water),0.016931,0.238368,0.009199,0.543307,2.279277,0.005163,1.667711


The rules make sense, but are only helpful in a limited way, because we only have two different items as consequent: spaghetti and mineral water. The first three rules indicate that if a person buys a product that is often used to make pasta sauce (e.g., ground beef, vegetables, olive oil), they also often buy pasta. While this makes sense, this rule may also be there, because people tend to buy pasta a lot. However, the lift of about 3 indicates that these rules are three times as likely as a random effect. The shrimp does not really match this pattern. 

The other rules are likely effective, but an random artifact: the problem is that mineral water is part of many transactions:

In [40]:
# the mean of a one hot encoded column is the percentage that this value occurs
data_df['mineral water'].mean()

0.23836821757099053

Thus, the rule to predict mineral water is almost universally good. 

## Validation of the rules

Randomly split your records into two sets with roughly 50% of data each. Now use the Apriori algorithm to determine rules on both of these sets. Do you find similar rules on both sets? What does the similarity/the differences indicate?

In [44]:
from sklearn.model_selection import train_test_split

# split the data into two sets with 50% of the data
X_train, X_test = train_test_split(data_df, test_size=0.5, random_state=42)

# create frequent itemsets
frequent_itemsets_train = apriori(pd.DataFrame(
    X_train), min_support=0.005, use_colnames=True)
frequent_itemsets_test = apriori(pd.DataFrame(
    X_test), min_support=0.005, use_colnames=True)

In [46]:
# rules for first set
association_rules(frequent_itemsets_train, metric="confidence",
                  min_threshold=0.5).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
27,"(spaghetti, whole wheat pasta)",(milk),0.010133,0.135733,0.005067,0.5,3.683694,0.003691,1.728533
5,"(grated cheese, eggs)",(spaghetti),0.0096,0.181333,0.005867,0.611111,3.370098,0.004126,2.105143
15,"(frozen vegetables, soup)",(spaghetti),0.008533,0.181333,0.005067,0.59375,3.274357,0.003519,2.015179
10,"(frozen vegetables, ground beef)",(spaghetti),0.0168,0.181333,0.0096,0.571429,3.151261,0.006554,1.910222
21,"(ground beef, shrimp)",(spaghetti),0.012267,0.181333,0.006933,0.565217,3.117008,0.004709,1.882933
33,"(frozen vegetables, mineral water, milk)",(spaghetti),0.0112,0.181333,0.006133,0.547619,3.019958,0.004102,1.809684
14,"(frozen vegetables, olive oil)",(spaghetti),0.0112,0.181333,0.005867,0.52381,2.888655,0.003836,1.7192
0,"(cake, chocolate)",(spaghetti),0.013333,0.181333,0.006933,0.52,2.867647,0.004516,1.705556
6,"(herb & pepper, eggs)",(spaghetti),0.0128,0.181333,0.0064,0.5,2.757353,0.004079,1.637333
32,"(spaghetti, milk, eggs)",(mineral water),0.009067,0.2496,0.006133,0.676471,2.710219,0.00387,2.319418


In [45]:
# rules for second set
association_rules(frequent_itemsets_test, metric="confidence",
                  min_threshold=0.5).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
20,"(pancakes, olive oil)",(spaghetti),0.01093,0.166889,0.005599,0.512195,3.06908,0.003774,1.707878
16,"(tomatoes, olive oil)",(mineral water),0.007465,0.227139,0.005065,0.678571,2.987466,0.00337,2.404455
4,"(chocolate, soup)",(mineral water),0.008798,0.227139,0.005865,0.666667,2.935055,0.003867,2.318582
15,"(olive oil, soup)",(mineral water),0.008531,0.227139,0.005332,0.625,2.751614,0.003394,2.060962
13,"(turkey, milk)",(mineral water),0.011464,0.227139,0.006931,0.604651,2.662026,0.004328,1.954883
14,"(shrimp, olive oil)",(mineral water),0.009064,0.227139,0.005332,0.588235,2.589754,0.003273,1.876947
12,"(milk, soup)",(mineral water),0.01413,0.227139,0.008264,0.584906,2.575095,0.005055,1.861891
9,"(frozen vegetables, olive oil)",(mineral water),0.011464,0.227139,0.006665,0.581395,2.559641,0.004061,1.846278
1,"(chocolate, chicken)",(mineral water),0.012797,0.227139,0.007198,0.5625,2.476452,0.004291,1.766538
5,"(eggs, cooking oil)",(mineral water),0.01093,0.227139,0.006132,0.560976,2.469741,0.003649,1.760405


The "mineral water" rules seem to be fairly stable and are present in both splits. The "spaghetti" rules seem to be more random. While there are such rules in both splits, they are not the same in the splits. In general, it seems like spaghetti may also be just something that is bought very often, same as mineral water, and that the rules may be good suggestions for cross-sell, but that there is probably not a strong association between the antecedents and the consequents for all rules, i.e., there does not seem to be causality. 