Market-Basket Model | Frequent ItemSets Mining | Association Rules

The Market-Basket Concept:

Let's suppose we are working for a Market, and we would like to know which sets of items often appear together in baskets. To perform such analysis somehow we must observe all the baskets and infer such information. Within this scenario we suppose the number of baskets to be very large if compared to the number of items or to the average number of items in a basket.


Frequent Itemsets:

But, what does it mean frequent? Well, to be more formal we need to introduce the concept of support threshold indicated with s. Considering a set of items I, the support for I is the number of baskets for which I is a subset. Then I is considered frequent if its support is greater or equal to s.

Although the concept Frequent ItemSets at first was applied in the field of markets marketing investigations, there are also different fields where it could be applied:

  • Related Concepts: where words represent items and documents represent baskets.
  • Biomarkers: if we consider genes and blood proteins and deseases. Each basket is the set of data about a patient. A frequent itemset consisting of one or more biomarker and a deseas might suggest a test for that deseas.

Association Rules:

We can imagine an association rule as an if-then statement. It can be represented as $I \rightarrow j$ where I is a set of items and j is another item. The implication is that if all the items of I appear in a basket then "most likely" also j will be in the same basket as well.
To formalize such a concept we need to introduce the formal notion of likely defining the confidence of an association rule $I \rightarrow j $:

The confidence of an association rule $I \rightarrow j $ is defined as the ratio of the support for $ I \cup {j} $ to the support for I.

That is the confidence of the rule is the fraction of the baskets containing all of I that also contain j.

In the next post "Frequent Itemsets Mining: The A-Priori Algorithm in Python explained". I am going to talk about the most famous algorithm for Frequent Itemsets Mining, the A-Priori Algorithm.

Note: there exist different variants of the A-Priori algorithms and different algorithms, worth to mention the most performant one: FP-Growth.

Show Comments