Mining Frequent Pattern And Association

Gopal Khadka
5 min readJun 4, 2024

--

https://image.slideserve.com/199186/frequent-pattern-mining-l.jpg

Frequent Pattern Mining is a fundamental data mining technique used to discover frequently occurring patterns, relationships, or associations within large datasets.

The goal of frequent pattern mining is to identify items, events, or subsets of data that appear together frequently in the dataset. These frequent patterns can provide valuable insights and help in tasks such as market basket analysis, recommendation systems, and anomaly detection.

For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set, is a frequent item set.

Association Rules

Association rules are “if-then” rules that show the probability relationship between items in the given large data set. It has huge no of applications in areas like sales business and medical data sets to find the probabilistic associative relationship between different data items in the dataset.

Association rule mining is the procedure that helps to find the correlations, frequent pattern or association to derive some kind of meaningful insights and knowledge from the database.

In case of market basket analysis (analysing the items bought together by a customer in a single purchase), association rule mining helps to find the reason why items are bought together. Even if there is no logical reasoning of the relationship of the item, association mining may suggest otherwise.

For example: In the above picture, we can see that Bread & Milk and Diaper & Beer have association. The association of Bread and Milk is logical and reasonable. But for the diaper and beer, we can’t see a logical reasoning at first glance.

Market Basket Analysis

https://www.kdnuggets.com/2019/12/market-basket-analysis.html

Market basket analysis is a data mining technique to identify and recognize the association between the items bought together in a single basket (purchase) by customer. It is widely by commerce platforms (both online and physical) to understand their customers’ purchase patterns better for more sales and profit. Furthermore, it can be used to optimize the placement of the products, manage the pricing of the goods and target the right audience through marketing campaigns.

Key steps for market basket analysis are:

  1. Data Preprocessing: Collected data is cleaned and transformed per requirement. The data basically contains the information of the sold goods, customers, time of purchase and location of the store.
  2. Frequent Item set mining: Frequent itemset mining algorithm like Apriori algorithm helps to identify set of items that are bought together by customer.
  3. Association Rule Generation: Generate association rule that describe the relationship between the items along with other measuring factors like support and confidence (discussed below).
  4. Rule Evaluation and Interpretation: Analyse the rule by gaining the most meaningful insights to be applied on business. Interpret the insights visually for better and simple understanding.

Concept of Support and Confidence

https://t4tutorials.com/support-confidence-minimum-support-frequent-itemset-in-data-mining/

In the area of the frequent pattern mining, concept of support and confidence are very important. Let’s talk about support first.

Support is frequency of item set (set of items) in the database. It is the ratio of no of transaction in which the item appears to total no. of transactions. The items with support value equal to or greater than the given threshold support value (which you define) will be used potentially for data analysis since they have higher frequency compared to other item sets.

Confidence is the measure of the strength or reliability of the association relationship. The confidence of an association rule “X ⇒ Y” is defined as the conditional probability of the consequent (Y) given the antecedent (X). Confidence represents the likelihood of item(Y) occurring given that item(X) has occurred.

A high confidence value indicates the high and strong association of the item set in the database. Similar to the support, a minimum threshold value is used to filter out the items with weak and low association rules that doesn’t meet the desired level of the confidence.

Apriori Algorithm

https://image3.slideserve.com/5761955/the-apriori-algorithm-example-l.jpg

Apriori algorithms is one of the most used algorithms for finding the frequent item set from the database. It was given by R. Agrawal and R. Srikant in 1996 for the purpose of finding the frequent data set and generate the association rules. The name of this algorithm is Apriori because it uses prior knowledge of the frequent item set in the database.

Apriori property: All non-empty subsets of the frequent item set must be frequent. Conversely, all subsets of the infrequent item set must be infrequent.

How does Apriori algorithm work?

1. Frequent Item set generation

The algorithm starts by finding the all item set that have the support value equal to or more than the minimum support threshold value.
It then iteratively generate candidate k-item sets (k= 2,3,4,5) by combining (k-1) frequent item sets. For example: If we get 4 frequent individual item set, we can combine one with another to make up an item set of 2 items.
The support of each candidate set is calculated to satisfy the minimum threshold value of support.

2. Association rule generation

Once the frequent item set has been identified, this algorithm generates the association rule for those item set. For each frequent item set, the algorithm partitions the frequent item set into two non-empty set. The confidence value for each partition is found to create the rule. The rule must satisfy the minimum confidence threshold value.

We have already discussed the concept and formula of support and confidence above. So let’s discuss an example:

Example dataset for the algorithm

Let’s say the minimum support is 50% and the minimum confidence is 60%.

Step 1: Frequent Item set Generation

- 1-itemsets: {Bread}, {Butter}, {Milk}, {Eggs}
- 2-itemsets: {Bread, Butter}, {Bread, Milk}, {Bread, Eggs}, {Butter, Milk}, {Butter, Eggs}, {Milk, Eggs}
- 3-itemsets: {Bread, Butter, Milk}, {Bread, Butter, Eggs}, {Bread, Milk, Eggs}, {Butter, Milk, Eggs}
- 4-itemsets: {Bread, Butter, Milk, Eggs}
Now we only have one item set of 4 items that satisfy the minimum support threshold value.

Step 2: Association rule generation

Now that we have 4 items, we can generate a lot of association rules through 1 to 3 relationship or 2 to 2 relationship or vice versa. After that, we only select those relationship whose confidence value satisfies the given confidence threshold value.

Limitations of Apriori Algorithm

  1. Computationally expensive due to large no of candidate set generation
  2. Requires multiple scans of database for frequent item set generation and association rules
  3. Works poorly of the sparse dataset due to lower correlations of item set
  4. Performs well on categorical data but not on numeric data

--

--

Gopal Khadka

Aspiring coder and writer, passionate about crafting compelling stories through code and words. Merging creativity and technology to connect with others.