Chapter 8 Association rule mining

While \(K\)-means clustering is used primarily for numerical measures, we often have data where the variables are categorical. A common database type that we haven’t yet met comes in the form of a list of sets of items. A classical example is that of market basket analysis. A retail store has a large number of transactions during a particular time period, and is interested whether certain groups of items are consistently purchased together. With knowledge of such associations the store layout could be designed to place items optimally with respect to each other, or the data might be used for cross-selling (discounting an item if you purchase the second), catalogue design, or to identify customer segments based on purchasing patterns.

A number of other examples are similar to this, however. Results from questionnaires (particularly when the questions permit categorical answers) yield similar data, and the analyst may be interested in associations between the answers to different questions. Data from web searches may also be analysed in the same manner. People that search for certain terms may be more likely to also search for other terms. Further, there’s the possibility that improved search results65 might be gleaned: If a person has previously searched for ‘Apple’ and later searches for ‘Leopard’ then it may be that they are interested in Mac OS 10.5 (codenamed ‘Leopard’) rather than whether leopards eat apples.

Typically, such datasets consist of a large number of possible items and a large number of collections of those items, where each collection may contain only a small fraction of the total items available. A supermarket for example may have many thousands of products, yet each transaction may consist of at most a few hundred items, and in many cases fewer than ten items. Table 8.1 shows a few entries from a typical transaction database.

Table 8.1: An example supermarket database with six transactions.
transaction items
1 bread, milk
2 beer
3 butter
4 bread, butter, milk
5 bread, butter
6 bread, beer

The retailer is primarily interested in whether a product is purchased (presence/absence) rather than how many instances of the same item were purchased in the same transaction, so the data may typically be represented by a binary matrix as in Table 8.2.

Table 8.2: The binary matrix of transactions from Table 8.1
transaction bread milk beer butter
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 0 1
5 1 0 0 1
6 1 0 1 0

Association rules provide information about implications between item sets: if a transaction has items from set \(A\), do they also tend to have items from set \(B\)? This is an example of a rule which is an implication of the form \(A \Rightarrow B\), where \(A, B\) are itemsets with the requirement that they are disjoint (\(A \cap B = \emptyset\)). \(A\) is known as the antecedent (or lhs) and \(B\) is the consequent (or rhs) of the rule.

From Table 8.1, an example rule for the supermarket might be \(\{\textrm{milk},\textrm{bread}\} \Rightarrow \{\textrm{butter}\}\), meaning that if the customer purchases milk and bread, they also purchase butter. Even with this small dataset (just 4 items), there are 50 association rules. In general, for \(d\) items, the number of association rules is given by \[3^d - 2^{d+1} + 1,\] and thus is exponential in the number of items. The majority of these rules are of no use: It will be unlikely that there are transactions in the data set that contains the rules, thus we need to be able to select the interesting rules from all possible rules, without having to generate all possible rules to begin with.

The support \(\mathop{\mathrm{supp}}(A)\) of an itemset \(A\) is defined as the proportion of transactions in the database which contain the itemset. For the example database in Table 8.1, the itemset \(\{\textrm{milk},\textrm{bread}\}\) has a support of \(2/6\) since it occurs in 2 out of the 6 transactions, whereas the itemset \(\{\textrm{bread}\}\) has support \(5/6\). The support of a rule \(A \Rightarrow B\) is then just the support of \(A \cup B\), i.e. the proportion of transactions in the database containing all the items in the rule.

The confidence of a rule is defined as \[\mathop{\mathrm{conf}}(A \Rightarrow B) = \frac{\mathop{\mathrm{supp}}(A \cup B)}{\mathop{\mathrm{supp}}(A)},\] and can be interpreted as an estimate of the probability \(P(B|A)\), the probability of finding the consequent of the rule given the antecedent. From Table 8.1 we see that the rule \(\{\textrm{milk}, \textrm{bread}\} \Rightarrow \{\textrm{butter}\}\) has confidence \(\frac{1/6}{2/6} = 0.5\), so that 50% of the transactions containing milk and bread also contain butter.

Interesting association rules are those that satisfy both a minimum support clause and a minimum confidence constraint simultaneously. The minimal support constraint allows rules to be mined efficiently, as if an itemset \(B\) has high support, then we know that all subsets of \(B\) must also have high support. This allows algorithms to be designed that mine all maximal frequent itemsets efficiently.

Even with such constraints, however, there may still be a large number of association rules found. Further, one must be careful to consider association rules with high support and high confidence as being interesting, as example 8.1 shows.

Example 8.1 Coffee and tea drinking

Suppose that 1000 people were asked whether they drank tea or coffee, or both. The results of this are given in the contigency table 8.3.

Table 8.3: Contigency table of drinking preferences for 1000 people.
coffee no coffee total
tea 150 650 800
no tea 50 150 200
total 200 800 1000

The association rule \(\{\textrm{coffee}\}\Rightarrow\{\textrm{tea}\}\) may be considered, and we find it has relatively high support (15%) and confidence (75%). At first glance it might appear that those that drink coffee also tend to drink tea. However, the probability of a person drinking tea is 80% regardless of whether they drink coffee, and this reduces to 75% if they drink coffee. Thus, knowing that a particular person drinks coffee actually reduces the chance that they’re a tea drinker! The high confidence of the rule \(\{\textrm{coffee}\}\Rightarrow\{\textrm{tea}\}\) is therefore misleading. The pitfall of high confidence is due to the fact that the confidence measure does not take into account the support of the rule consequent. Once we take into account the support of tea drinkers, it is no surprise that many coffee drinkers also drink tea.

A popular measure useful for ranking association rules in terms of “interestingness" is lift, which is defined as \[\mathop{\mathrm{lift}}(A \Rightarrow B) = \frac{\mathop{\mathrm{supp}}(A \cup B)}{\mathop{\mathrm{supp}}(A)\mathop{\mathrm{supp}}(B)}.\] This may be interpreted as the deviation of the support of the rule from what might be expected if the antecedent and consequent were independent. A large lift value indicates a strong positive association. A lift of 1 indicates no association, and a lift smaller than one indicates a negative association. The lift of the rule \(\{\textrm{milk},\textrm{bread}\}\Rightarrow\{\textrm{butter}\}\) is \(1\). The lift of the rule \(\{\textrm{coffee}\}\Rightarrow\{\textrm{tea}\}\) is \(\frac{0.15}{(0.2)(0.8)} = 0.9375\), indicating a slight negative correlation.

While the former statistical criteria for interesting-ness of association rules is important, often subjective measures may be of more interest to the data-miner. An association might be considered uninteresting if it provides only information that was already known (or suspected). The associations that provide connections that were not previously known are those that are subjectively of interest. For example, the association \(\{\textrm{bread}\}\Rightarrow\{\textrm{butter}\}\) is not subjectively interesting, whereas \(\{\textrm{nappies}\}\Rightarrow\{\textrm{beer}\}\) is interesting!

There have been several algorithms developed for mining association rules, with the Apriori and Eclat algorithms being two of the more popular techniques. The R package arules implements both algorithms.

Example 8.2 Household incomes and demographic information

A total of \(N=9409\) questionnaires containg 502 questions were filled out by shopping mall customers in the San Francisco Bay area. The dataset consists of 14 demographic attributes, one of which is annual household income.

We start by loading the data in and checking the frequency of items in the dataset with support greater than 5%, given in Figure 8.1.

library(arules)
data("Income")
itemFrequencyPlot(Income, support=0.05, cex.names=0.8)
The freqency of items in the Income data set with support greater than 5%

Figure 8.1: The freqency of items in the Income data set with support greater than 5%

We can re-do this plot using ggplot instead which will make it a little nicer. We first use enframe() to convert the named vector returned by itemFrequency() to a tibble:

Income |>
  itemFrequency() |>
  enframe(name="item", value="support") |>
  filter(support >= 0.05) |>
  ggplot() +
  geom_col(mapping = aes(x=support, y=item))

There are several other variables with support less than 5% - all variables can be seen using labels(Income)$items. The next step is to generate association rules using the apriori function. We specify a support of 0.01 and confidence of 0.6, yielding just over a million rules.

rules <- Income |> apriori(parameter=list(support=0.01, confidence=0.6))
#> Apriori
#> 
#> Parameter specification:
#>  confidence minval smax arem  aval originalSupport maxtime support minlen
#>         0.6    0.1    1 none FALSE            TRUE       5    0.01      1
#>  maxlen target  ext
#>      10  rules TRUE
#> 
#> Algorithmic control:
#>  filter tree heap memopt load sort verbose
#>     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#> 
#> Absolute minimum support count: 68 
#> 
#> set item appearances ...[0 item(s)] done [0.00s].
#> set transactions ...[50 item(s), 6876 transaction(s)] done [0.01s].
#> sorting and recoding items ... [49 item(s)] done [0.00s].
#> creating transaction tree ... done [0.00s].
#> checking subsets of size 1 2 3 4 5 6 7 8 9 10
#> Warning in apriori(Income, parameter = list(support = 0.01, confidence = 0.6)):
#> Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
#>  done [0.53s].
#> writing ... [1200801 rule(s)] done [0.19s].
#> creating S4 object  ... done [0.72s].
rules
#> set of 1200801 rules
summary(rules)
#> set of 1200801 rules
#> 
#> rule length distribution (lhs + rhs):sizes
#>      1      2      3      4      5      6      7      8      9     10 
#>      7    382   5447  34793 119273 245153 315139 265573 153451  61583 
#> 
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   6.000   7.000   7.121   8.000  10.000 
#> 
#> summary of quality measures:
#>     support          confidence        coverage            lift        
#>  Min.   :0.01003   Min.   :0.6000   Min.   :0.01003   Min.   : 0.6573  
#>  1st Qu.:0.01265   1st Qu.:0.7238   1st Qu.:0.01556   1st Qu.: 1.1386  
#>  Median :0.01745   Median :0.8333   Median :0.02152   Median : 1.3910  
#>  Mean   :0.02476   Mean   :0.8272   Mean   :0.03076   Mean   : 1.6330  
#>  3rd Qu.:0.02792   3rd Qu.:0.9423   3rd Qu.:0.03476   3rd Qu.: 1.9247  
#>  Max.   :0.91289   Max.   :1.0000   Max.   :1.00000   Max.   :24.3140  
#>      count       
#>  Min.   :  69.0  
#>  1st Qu.:  87.0  
#>  Median : 120.0  
#>  Mean   : 170.2  
#>  3rd Qu.: 192.0  
#>  Max.   :6277.0  
#> 
#> mining info:
#>    data ntransactions support confidence
#>  Income          6876    0.01        0.6
#>                                                                        call
#>  apriori(data = Income, parameter = list(support = 0.01, confidence = 0.6))

The rules may be inspected via the inspect command. Sorting the rules first based on confidence or lift is useful to obtain the most interesting (objectively!) relationships first:

rules |>
  sort(by="confidence") |>
  head(n = 5) |>
  inspect()
#>     lhs                              rhs                           support confidence   coverage     lift count
#> [1] {marital status=widowed}      => {dual incomes=not married} 0.02937755          1 0.02937755 1.671366   202
#> [2] {marital status=divorced}     => {dual incomes=not married} 0.09787667          1 0.09787667 1.671366   673
#> [3] {marital status=single}       => {dual incomes=not married} 0.40910413          1 0.40910413 1.671366  2813
#> [4] {occupation=military,                                                                                      
#>      ethnic classification=white} => {language in home=english} 0.01207097          1 0.01207097 1.095428    83
#> [5] {marital status=single,                                                                                    
#>      ethnic classification=other} => {dual incomes=not married} 0.01163467          1 0.01163467 1.671366    80

As you can see, there are rules with confidence=1, indicating that all transactions in the datasets containing the lhs also contain the rhs. The items dual income=not married and marital status=single are clearly giving the same information. We can filter out all such rules using the subset command66:

rules <- rules |>
  subset(confidence < 1)

rules |> sort(by="confidence") |>
  head(n=2) |>
  inspect()
#>     lhs                                              rhs           support confidence  coverage     lift count
#> [1] {marital status=single,                                                                                   
#>      occupation=student,                                                                                      
#>      householder status=live with parents/family} => {age=14-34} 0.1137289  0.9987229 0.1138743 1.706141   782
#> [2] {marital status=single,                                                                                   
#>      occupation=student,                                                                                      
#>      dual incomes=not married,                                                                                
#>      householder status=live with parents/family} => {age=14-34} 0.1137289  0.9987229 0.1138743 1.706141   782

These are a little more interesting, but hardly surprising - if you’re a student living with your parents/family, then you’re very likely to be less than 35 years of age.

We can also inspect things interactively via the inspectDT function in arulesViz, though you’ll want to make sure you don’t have too many rules first:

library(arulesViz)
rules |>
  head(n=100) |>
  inspectDT()

The subset() command is useful for filtering things down somewhat. We can use the %in%, %ain%, and %pin% operators to filter what appears on the lhs or rhs of the rules. The %in% operator matches any of the following complete strings, the %ain% operator matches all of the strings, and %pin% partially matches any of the following strings. Let us start by finding rules that associate with high income, with moderate lift.

high_income <- rules |>
  subset(rhs %in% "income=$40,000+" & lift > 1.2)
high_income |>
  sort(by="confidence") |>
  head(n=3) |>
  inspect()
#>     lhs                                      rhs                  support confidence   coverage     lift count
#> [1] {education=college graduate,                                                                              
#>      years in bay area=1-9,                                                                                   
#>      dual incomes=yes,                                                                                        
#>      number of children=0,                                                                                    
#>      householder status=own,                                                                                  
#>      language in home=english}            => {income=$40,000+} 0.01308901  0.9782609 0.01337987 2.591110    90
#> [2] {sex=male,                                                                                                
#>      age=35+,                                                                                                 
#>      education=college graduate,                                                                              
#>      occupation=professional/managerial,                                                                      
#>      dual incomes=yes,                                                                                        
#>      number of children=0,                                                                                    
#>      householder status=own}              => {income=$40,000+} 0.01265271  0.9775281 0.01294357 2.589169    87
#> [3] {marital status=married,                                                                                  
#>      education=college graduate,                                                                              
#>      years in bay area=1-9,                                                                                   
#>      dual incomes=yes,                                                                                        
#>      number of children=0,                                                                                    
#>      householder status=own,                                                                                  
#>      language in home=english}            => {income=$40,000+} 0.01265271  0.9775281 0.01294357 2.589169    87

This suggests that a college graduate in a household with two incomes that own their home is likely to have a high household income, as one would expect. A little more interesting is the association with no children. Let us look at a different measure of wealth, home ownership. Obviously home ownership will be associated with high household income, but what about home ownership by those that aren’t earning a high annual income?

home_own <- rules |>
  subset(lhs %in% "income=$0-$40,000" & 
         rhs %in% "householder status=own" &
         lift > 1.2)
home_own |>
  sort(by="confidence") |>
  head(n=2) |>
  inspect()
#>     lhs                               rhs                         support confidence   coverage     lift count
#> [1] {income=$0-$40,000,                                                                                       
#>      sex=female,                                                                                              
#>      marital status=married,                                                                                  
#>      age=35+,                                                                                                 
#>      dual incomes=no,                                                                                         
#>      number of children=0,                                                                                    
#>      type of home=house,                                                                                      
#>      ethnic classification=white}  => {householder status=own} 0.01163467  0.9756098 0.01192554 2.596088    80
#> [2] {income=$0-$40,000,                                                                                       
#>      sex=female,                                                                                              
#>      marital status=married,                                                                                  
#>      age=35+,                                                                                                 
#>      dual incomes=no,                                                                                         
#>      number of children=0,                                                                                    
#>      type of home=house,                                                                                      
#>      ethnic classification=white,                                                                             
#>      language in home=english}     => {householder status=own} 0.01163467  0.9756098 0.01192554 2.596088    80

The recurring theme in these rules are older people in a single-income household with no children. Finally, we look at one of the variables not present in Figure 8.1, i.e. one of those variables that is rare in the dataset: whether the person had answered the question regarding how long they’d resided in the San Francisco bay area.

new_to_bay <- rules |>
  subset(rhs %in% "years in bay area=1-9" &
         lift > 1.2)
new_to_bay |>
  sort(by="lift") |>
  head(n=5) |>
  inspect()
#>     lhs                                 rhs                        support confidence   coverage     lift count
#> [1] {education=no college graduate,                                                                            
#>      occupation=military,                                                                                      
#>      number in household=1}          => {years in bay area=1-9} 0.01032577  0.8452381 0.01221640 2.391711    71
#> [2] {occupation=military,                                                                                      
#>      householder status=rent,                                                                                  
#>      language in home=english}       => {years in bay area=1-9} 0.01018034  0.8139535 0.01250727 2.303187    70
#> [3] {occupation=military,                                                                                      
#>      householder status=rent}        => {years in bay area=1-9} 0.01105294  0.8000000 0.01381617 2.263704    76
#> [4] {income=$0-$40,000,                                                                                        
#>      sex=male,                                                                                                 
#>      occupation=military}            => {years in bay area=1-9} 0.01003490  0.7931034 0.01265271 2.244189    69
#> [5] {income=$0-$40,000,                                                                                        
#>      age=14-34,                                                                                                
#>      occupation=military}            => {years in bay area=1-9} 0.01003490  0.7931034 0.01265271 2.244189    69

With small numbers, one shouldn’t read too much into this, but it is interesting that the military occupation comes up in the top 5 association rules.


  1. or, for the paranoid, a more complete profile of each user.↩︎

  2. Note, however, that we may also be filtering out useful information here, so an alternate might be to drop superfluous items instead↩︎