Frequent Itemsets Mining: The A-Priori Algorithm in Python explained.

In the previous post "Market-Basket Model | Frequent ItemSets Mining | Association Rules" we introduced concepts related to the Frequent Itemsents Mining task, while in this one we are going to explain the A-Priori Algorithm and we'll implement this famous algorithm in pure Pyhton3.x to achieve such a complex task.

A basket can be represented as a set of integers such as `{1, 3, 9, 11}`. In order to find frequent itemsets given a basket there is the (theoretical) need to compute all its possible subsets, this can be computationally expensive. Indeed if we consider a basket composed of $ n $ elements and we want to compute just subsets of two elements then the number of all possible subsets of two (or k=2) elements is going to be $ \binom{n}{k} = \frac{n!}{k!(n-k)!} $ (see combinations or the proof). In the case of a basket of 20 items the number of all the possible subsets composed by two elements is 190, while for subset with cardinality 3 the number of all possible subsets rises to 1140, while the number of all subsets with cardinality from 0 to n is going to be $2^n$. Luckly, often in real world problems we only need to compute small frequent itemsets with 2 or 3 elements possibly with high support in order to infer association rules with high confidence, making the computational process not so expensive and not so memory greedy. Furthermore often it is possible to compute frequent itemsets for bigger k because many of the items in each basket won't be able to participate to the frequent itemsets.

The Monotonicity of Itemsets

The A-Priori algorithm takes advantage of just one property/observation called the monotonicity of itemsets:

If a set $ I $ of items is frequent, then also every subset of $ I $ is frequent.

Indeed, given $ J \subseteq I $ then all the baskets that contain all the items of $ I $ contain also all the items of $ J $. It follows that the count of $ J $ is at least equal to the count of  $ I $. The A-Priori algorithm makes use of this property to make information more compact discarding useless information.

Given a support threshold $ s $ then we call an itemset maximal if no superset is frequent. If we store just maximal itemsets we know that all the subsets of a maximal are frequent and we also know that a set that is not a subset of a maximal can not be frequent.
In short:

  • Given a maximal all its subsets are frequent.
  • A superset of a maximal is not frequent.

The A-Priori Algorithm

As mentioned the algorithm tries to avoid to process useless information discarding some sets a-priori given the previously mentioned monotonicity property of itemsets. First of all the algorithm iterates through all the baskets counting the occurrences of each item. After this process the algorithm finds all the frequent singletons according with the selected threshold s. On the second stage the algorithm counts all the pairs composed by frequent items discarding all the pairs composed by at least one singleton not satisfying the threshold and so on for bigger itemsets.

The algorithm makes a pass over the data for each set-size k. For each size k there will be two sets of itemsets:

  1. $ C_k $ is the set of candidate itemsets of size $ k $.
  2. $ L_k $ is the set of truly frequent itemsets of size $ k $.

So the algorithm for each  $ k $ alternates two phases: constructing the candidate sets and filtering the frequent ones out of them. $ C_1 $ is going to contain all the singletons, while $ L_1 $ is going to be populated with the elements which appear within baskets at least as many times as the support threshold s. Then the set $ C_2 $ is going to be the set of all the candidate pairs made out of the frequent singletons contained in $ L_1 $. Then again the algorithm passes through the data to check how many times each candidate pair appears within baskets discarding the ones not satisfying the threshold and so on for bigger itemsets... The algorithm can be stopped because of several reasons: no more candidate can be generated (no over the threshold itemsets) or is it possible to stop it if we've found frequent itemsets of a certain previously defined itemset cardinality.

Python3.x implementation:

Follows the Python3 implementation of the A-Priori algorithm for itemsets mining. The the main function is get_frequent_items_sets which takes as input the dataset (a list of lists of integers representing a list of baskets), the support and the number of iterations. The rest of the code is quite self explainatory and abusyvely commented, thus, it's possible to read it all along the way and understand the main points even if one does not really speak Python.  

from collections import defaultdict

# input:
#     data: List of lists containing numeric values
#     min_sup: Minimum support ti be considered frequent (we select values that appears at least many times as s in baskets)
#	  steps: Number of steps in case we would like to find just couples takes 2 (default 0 terminates when there are no more candidates to compute)
#  returns:
#	  solution: dictionary of sets of frozensets representing frequent itemsets, contains both numeric values and tuples.
#		(be carefull on iterating on it)

def get_frequent_items_sets(data,min_sup,steps=0):

	# we transform the dataset in a list of sets
	transactions = list()

	# Temporary dictionary to count occurrences
	items = defaultdict(lambda: 0)

	# Returned dictionary
	solution = dict()
	L_set = set()

	# Fills transactions and counts "singletons"
	for line in data:
		transaction = set(line)
		transactions.append(transaction)
		for element in transaction:
			items[element]+=1

	# Add to the solution all frequent items
	for item, count in items.items():
		if count >= min_sup:
			L_set.add(frozenset([item]))

	# Generalize the steps it ends when there are no candidates
	# or if the user provided as input the number of parameters
	k = 2
	solution[k-1] = L_set
	while L_set != set([]) and k != steps+1:
		L_set = create_candidates(L_set,k)
		C_set = frequent_items(L_set,transactions,min_sup)
		if C_set != set([]): solution[k]=C_set
		L_set = C_set
		k = k + 1
	return solution


# Creates candidates joining the same set
# input:
#    itemSet: a set of frequent items
#     length: the cardinality of generated combinations
#  returns:
#	   set: a set containing all combinations with cardinality equal to length
#		(be carefull on iterating on it)

def create_candidates(itemSet, length):
        return set([i.union(j) for i in itemSet for j in itemSet if len(i.union(j)) == length])


# Checks occurrences of items in transactions and returns frequent items
# input:
#    items: a set of frequent items (candidates)
#     transactions: list of sets representing baskets
#  returns:
#	   _itemSet: a set containing all frequent candidates (a subset of inputs)

def frequent_items(items, transactions, min_sup):
	_itemSet = set()
	counter = defaultdict(int)
	localDict = defaultdict(int)
	for item in items:
		for transaction in transactions:
			if item.issubset(transaction):
				localDict[item] += 1

	for item, count in localDict.items():
		if count >= min_sup:
			_itemSet.add(item)
	return _itemSet

example execution code:

In [1]: get_frequent_items_sets([[1,2,3],[0,1,2],[4,3,1,2],[1,2,3,4]], min_sup=3)

Out[1]: 
{1: {frozenset({3}), frozenset({2}), frozenset({1})},
 2: {frozenset({1, 3}), frozenset({2, 3}), frozenset({1, 2})},
 3: {frozenset({1, 2, 3})}}

Here at this address I suggest to give a read to a similar Python implementation wich outputs also association rules.

The code contained within this article has to be considered for illustration proposes and not for production environments (do not reinvent the wheel and check for some robust libraries!)

BONUS: here you can find some more elaborated solutions (including the state of the art: FP-Growth) to compute frequent itemsets mining both in pure python (serial) and making use of PySpark (map-reduce)!

That was all!

Show Comments