Back to Index

Analytics: Association Rule Mining

Data mining techniques can produce diverse types of knowledge. Often, this knowledge is represented as a model of estimated classification probabilities. In many cases, the classification task is to determine the class of an instance. A very simple example might be that if an instance has the attribute of being round, red and juicy, it will be classified as an apple. Association rule mining works in much the same way, except that combinations of attributes ("red" for example) are predicted as well. In other words, attributes as well as class are mined.

Association rule mining has become particularly popular among marketers. In fact, the traditional example of association rule mining is also known as market basket analysis. In this example, the dataset comes from grocery store purchases. The learning task is to find which items are frequently purchased together. This type of knowledge can be used by marketers to plan store layouts, and place frequently bought items in close proximity to each other, thus improving sales.

At an abstract level, association rule mining is concerned with relationships between items in a dataset. Association rule mining defines a given transaction as a subset of the set of possible items. Specific groupings of items are referred to as an itemset (note that in data mining literature, "itemset" is preferred over "item set.) Often, association rule mining seeks out itemsets which have a certain frequency or minimum support, meaning that they are represented in a relatively high number of transactions. These transactions are known simply as frequent itemsets.

Association rule mining is known to produce large numbers of rules. Unlike classification, association rules are seeking out relationships between all the attributes in the dataset, thus creating a large number of possible patterns. Many of these patterns are uninteresting or unimportant, and a great deal of work has been done to reduce the number of association rules produced in these cases.

One way to reduce the number of rules is to seek out what are known as maxpatterns and frequent closed itemsets. Maxpatterns are a type of pattern where "superpatterns" of itself are not frequent items. Superpatterns describe a set relationship where a particular pattern are entirely contained within another pattern. Frequent closed itemsets are frequent itemsets with no proper supersets. This means that there are no other sets that contain the frequent itemset.

The CLOSET Algorithm

The CLOSET algorithm was designed to extract frequent closed itemsets from large databases. Invented by Jain Pei, Jiawei Han and Runying Mao, it offers an efficient method for producing association rules. It minimizes both computational and cognitive cost in association rule analysis, by limiting the results to just frequent closed itemsets.

The algorithm begins by scanning for frequent items. The algorithm than divides the frequent items by finding just the frequent closed itemsets. The CLOSET technique continues by recursively mining the subsets of the frequent item closed sets. The algorithm then effectively creates conditional databases of the frequent closed-items seperately from the initial transactional database.

The actual mechanics of this process can become quite complex. It begins by counting the amount of support for items, and including any item above a minimum support level in a list, defined by the particular data being studied. The list of items meeting this criteria becomes the "f list" of frequently occurring items. The process of dividing the search space then takes each item and produces a new set for it, excluding each of the previous items. Next, the algorithm populates these sets by searching for items which meet the criteria for exclusion. This can be conceived of as creating a number of conditional databases of frequent closed itemsets.

The CLOSET approach is mainly remarkable for its efficiency, which is a result of four optimization methods. The first of these methods is compressing both the original transactional database and the generated conditional databases into an FP-tree structure. FP-trees, also known as prefix trees, are constructed such that transactions with the same prefix share portions of the path down the tree. The details of this structure are complex; suffice to say that this effectively compresses the databases.

The next two optimizations extract items in various ways. The second extracts every item appearing in the conditional databases of the frequent item subsets. This helps reduce the size of the FP-tree and improves the overall speed of the recursive process by combining some items. The third exploits the natural structure of the FP-tree by directly extracting frequent closed itemsets. Since the items have been arranged by prefix, this allows for natural harvesting of the closed itemsets. The final method prunes out frequent items which have the same level of support and can be expressed as a subset of other itemsets.

The CLOSET algorithm is a remarkably efficient but highly complex technique. It allows for reasonably fast mining of frequent itemsets from data with control over the number of rules or itemsets generated. For further details and a more rigorous description of the underlying mathematics, consult "CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets," written by the originators of this technique.


Sources:

Han J, Kamber M. Data Mining: Concepts and Techniques. Second edition, 2006. Morgan Kaufmann.

Pei,J, Han J., Mao R. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, May 2000.

Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. Second edition, 2005. Morgan Kaufmann.


Go to the Index Page or back to the Top of this page