Mining Association Rules

Mining Association Rules:-

Multi Level Association Rules:-

Multi level association rules are the rules generated for the use of market analysis or the business purpose so as to decide the further strategies of the market and help them gain their profit. The rules generated will be used by them to decide which strategy to be applied for increasing profit.

Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemset at appropriate levels could be quite useful. Transaction database can be encoded based on dimensions and levels.

Multilevel association rules, also known as multi-level association rules or hierarchical association rules, extend the concept of traditional association rule mining by considering hierarchical relationships between items. In traditional association rule mining, transactions are typically represented as a flat list of items without considering any hierarchical structure. However, in many real-world scenarios, items may have hierarchical relationships, such as product categories, organizational hierarchies, or taxonomies. Multilevel association rule mining aims to discover association rules that involve items at different levels of the hierarchy, allowing for more meaningful and interpretable insights. These rules capture dependencies between items at different levels of abstraction, providing a richer understanding of the data.

Why multi level association rules:-

Sometimes, at primitive data level, data does not show any significant pattern. But there are useful information hiding behind.

The goal of Multiple-Level Association Analysis is to find the hidden information in or between levels of abstraction.

Requirements:-

Two general requirements to do multiple-level association rule mining:

1) Provide data at multiple levels of abstraction. (a common practice now)

2) Find efficient methods for multiple-level rule mining. (our

focus)

Min-support is set and if we have inly single min support then we have some drawbacks as:-

If the min-sup is too high, we are losing information from lower levels.

If the min-sup is too low, we are gaining too many rules from higher levels, many of them are useless.

The process of mining includes several steps as:-

Data preparation.
Item generation.
Rule generation.
Rule evaluation.
Rule selection and interpretation

Multilevel association rule mining is useful in various domains such as retail, e-commerce, recommendation systems, and supply chain management, where hierarchical relationships exist among items or entities. It enables organizations to uncover complex dependencies and patterns that may not be captured by traditional association rule mining techniques.

Frequent Itemset Mining methods:-

Frequent itemset mining is a data mining technique used to identify sets of items (itemsets) that frequently occur together in transactional datasets. These itemsets are considered "frequent" if their occurrence frequency meets or exceeds a predefined minimum support threshold. Frequent itemset mining is a fundamental step in association rule mining, where association rules are derived from frequent itemsets. Algorithms have been developed for mining. some of them are:-

Apriori Algorithm
ECLAT Algorithm
FP Growth
Hash based Algorithm

These algorithms are foundational in frequent itemset mining and are widely used in various applications such as market basket analysis, recommender systems, and bioinformatics. The choice of algorithm depends on factors such as the size of the dataset, the available computational resources, and the desired scalability and efficiency.

Apriori Algorithm:-

The Apriori algorithm is a classic and fundamental algorithm in data mining and frequent itemset mining. It is used to discover frequent itemsets in a transactional database and to derive association rules from these itemsets. Developed by Agrawal and Srikant in 1994, the Apriori algorithm efficiently identifies itemsets that meet a minimum support threshold, which indicates the minimum frequency required for an itemset to be considered "frequent."

The steps are as follows:-

Generate Candidate Itemsets:-

Begin by identifying all unique items present in the dataset. These items will form the initial set of candidate 1-itemsets.
The algorithm then iteratively generates candidate itemsets of higher lengths (k) based on the frequent (k-1)-itemsets discovered in the previous iteration.
To generate candidate (k+1)-itemsets, Apriori joins pairs of frequent k-itemsets that share the same (k-1) items. Only the itemsets resulting from these joins that satisfy the Apriori property are considered candidate itemsets.

Prune Candidate Itemsets:-

After generating candidate itemsets, the algorithm prunes those that contain subsets that are not frequent. This pruning step exploits the Apriori property, which states that if an itemset is infrequent, all its supersets will also be infrequent.
By eliminating candidate itemsets that do not meet the minimum support threshold, the Apriori algorithm reduces the number of itemsets that need to be counted in subsequent iterations, leading to computational efficiency.

Count Support for Candidate Itemsets:-

After pruning, the remaining candidate itemsets are counted in the transactional database to determine their support, i.e., the frequency of occurrence in the dataset.
The support count of each candidate itemset is compared to the minimum support threshold. Only itemsets that meet or exceed the minimum support threshold are considered frequent.

Repeat Iterations:-

The process of generating candidate itemsets, pruning, and counting support is repeated until no new frequent itemsets can be discovered.

Extract Association Rules:-

Once frequent itemsets have been identified, association rules are derived from these itemsets.
Association rules are generated by partitioning frequent itemsets into non-empty subsets (antecedent) and their complements (consequent) and calculating the confidence of each rule.
Association rules that meet minimum confidence thresholds are retained as the final output.

The Apriori algorithm is widely used in various applications, including market basket analysis, recommendation systems, and web usage mining. Despite its efficiency improvements, Apriori's performance can degrade with large datasets due to its combinatorial nature. However, optimizations such as pruning strategies, efficient data structures, and parallelization techniques can help mitigate these scalability challenges.

Algorithm for Apriori

Step 1: Load the dataset containing transactions.

Step 2: Generate a list of frequent 1-itemsets by counting the occurrences of each item in the transactions and selecting only the items that meet the minimum support threshold.

Step 3: Using 1-itemsets calculate a minimum support.

Step 4: Repeat the following steps until no more frequent itemsets can be found:

4.1 Generate a list of candidate itemsets by joining the frequent (k-1)-itemsets with each other

to form k-itemsets.

4.2 Prune the candidate itemsets by removing those that contain subsets that are not frequent (k-

1)-itemsets.

4.3 Count the occurrences of each candidate itemset in the transactions and select only the

itemsets that meet the minimum support threshold to form the frequent k-itemsets.

4.4 Generate association rules from the frequent k-itemsets that satisfy the minimum confidence

threshold.

Step 5: Return the frequent itemsets and association rules.

ECLAT Algorithm:-

Eclat (Equivalence Class Clustering and Bottom-Up Lattice Traversal) is another popular algorithm for frequent itemset mining, similar to the Apriori algorithm. Developed by Zaki et al. in 1997, Eclat efficiently identifies frequent itemsets without generating candidate itemsets explicitly, making it particularly suitable for datasets with a large number of transactions and items. Both Apriori and FP-growth use horizontal data format. Alternatively data can also be represented in vertical format. The steps to be followed are:-

Transaction Database Representation:-

Eclat begins by representing the transactional database in a vertical format, where each transaction is represented as a list of items along with their corresponding transaction identifiers (TID).
Transactions are sorted based on item frequency, and items within each transaction are arranged in ascending order.

Generate Initial Itemsets:-

Eclat initializes with a set of single-item frequent itemsets, each containing a single item and the list of TIDs in which it occurs.
These initial itemsets are sorted based on item frequency.

Frequent Itemset Mining:-

Eclat uses a depth-first search (DFS) approach to recursively enumerate frequent itemsets.
At each level of the search, the algorithm joins pairs of frequent itemsets that share a common prefix (i.e., have the same items except for the last one) to form candidate itemsets.
The intersection of the TID lists of the joined itemsets is used to determine the support count of the candidate itemset. If the support count exceeds the minimum support threshold, the itemset is considered frequent.

Recursive Exploration:-

Eclat recursively explores the search space, pruning branches that cannot lead to frequent itemsets based on the minimum support threshold.
The algorithm stops exploring a branch when it encounters an empty or non-frequent itemset.

Association Rule Generation:-

After identifying frequent itemsets, association rules can be derived from these itemsets, similar to the Apriori algorithm.
Association rules are generated by partitioning frequent itemsets into non-empty subsets (antecedent) and their complements (consequent) and calculating the confidence of each rule.

Eclat is known for its efficiency and scalability, especially for datasets with high dimensionality or a large number of transactions. By leveraging vertical transaction representation and recursive exploration, Eclat avoids generating candidate itemsets explicitly, which can result in significant computational savings, particularly for sparse datasets.

While Eclat offers advantages in terms of efficiency, it may not be as effective as the Apriori algorithm in scenarios where the dataset is dense or contains a large number of infrequent itemsets. Additionally, the complexity of association rule generation in Eclat can be higher compared to Apriori, especially for datasets with a high number of frequent itemsets. Overall, the choice between Apriori and Eclat depends on the specific characteristics of the dataset and the computational resources available.

Properties of mining with vertical data format:-

Take the advantage of the Apriori property in the generation of candidate (k+1)-itemset from k-itemsets.
No need to scan the database to find the support of (k+1) itemsets, for k>=1.
The TID_set of each k-itemset carries the complete information required for counting such support.
The TID-sets can be quite long, hence expensive to manipulate.
Use diffset technique to optimize the support count computation

Algorithm for Eclat

Step 1: Load the dataset containing transactions and convert it into the vertical format, where each

item is associated with a list of transaction IDs that contain the item.

Step 2: Sort the items in decreasing order of their support counts.

Step 3: For each item, find its equivalence class, which is the set of all items that share the same

transaction IDs as the item.

Step 4: Recursively generate frequent itemsets by merging the equivalence classes of the items that

satisfy the minimum support threshold.

Step 5: Generate the frequent itemsets.

FP Growth:-

The FP-Growth (Frequent Pattern Growth) algorithm is a powerful and efficient method for mining frequent itemsets in transactional databases, similar to the Apriori and Eclat algorithms. Developed by Han et al. in 2000, FP-Growth is particularly well-suited for datasets with a large number of transactions and items. It efficiently discovers frequent itemsets without the need to generate candidate itemsets explicitly, making it highly scalable and effective for high-dimensional datasets. The steps are as follows:-

Transaction Database Representation:-

Like Eclat, FP-Growth begins by representing the transactional database in a vertical format, where each transaction is represented as a list of items along with their corresponding transaction identifiers (TID).
Transactions are sorted based on item frequency, and items within each transaction are arranged in descending order of frequency.

Construct FP-Tree:-

FP-Growth constructs a compact data structure called the FP-tree (Frequent Pattern tree) to represent the transaction database efficiently.
The FP-tree consists of a root node and a set of conditional trees, each corresponding to a frequent item in the dataset.
Transactions are sequentially inserted into the FP-tree, and frequent items are linked together based on their occurrence in transactions.

Mine Frequent Itemsets:-

Once the FP-tree has been constructed, FP-Growth recursively mines frequent itemsets by traversing the tree.
At each node of the FP-tree, the algorithm explores all possible combinations of item paths (conditional patterns) to generate frequent itemsets.
Frequent itemsets are identified by recursively exploring conditional FP-trees, which represent conditional databases obtained by removing infrequent items.

Association Rule Generation:-

After identifying frequent itemsets, association rules can be derived from these itemsets, similar to the Apriori and Eclat algorithms.
Association rules are generated by partitioning frequent itemsets into non-empty subsets (antecedent) and their complements (consequent) and calculating the confidence of each rule.

FP-Growth offers several advantages over traditional algorithms such as Apriori and Eclat.

Efficient Data Structure: The FP-tree data structure allows for compact representation of transactional data and facilitates efficient pattern mining.
Reduced Candidate Generation: FP-Growth does not generate candidate itemsets explicitly, leading to fewer scans of the transactional database and improved scalability.
Parallelization: FP-Growth can be parallelized to leverage multi-core architectures and distributed computing frameworks for further performance improvements.

Overall, FP-Growth is a powerful and versatile algorithm for frequent itemset mining, especially in scenarios involving large-scale datasets with high dimensionality. It has been widely adopted in various domains such as retail, market basket analysis, recommendation systems, and bioinformatics.

Algorithm for FP Growth

Step 1: Take the dataset.

Step 2: Calculate the support for each item.

Step 3: Select only those items that have more than minimum support.

Step 4: Arrange the items in descending order, And store in list1.

Step 5: Ordered according to sorted order:

For i from 2 to max_row+1:

For j from 0 to length of list1:

if list1[j] == cell(1,j):

ordered = ordered + cell(i, j)

break;

Step 6: Generate the fp tree.

Step 7: Generate the frequent items.

Hash based Algorithm:-

Hash-based algorithms for frequent itemset mining are techniques that utilize hash functions to efficiently handle large-scale datasets and reduce memory requirements. These algorithms are particularly useful when dealing with high-dimensional data or when memory constraints are a concern. The steps are as follows:-

Counting and Pruning with Hash Tables:-

In this approach, each itemset is assigned a unique hash code based on its items. Hash tables are then used to count the occurrence of itemsets in the dataset efficiently.
The algorithm maintains a hash table to store counts for each itemset encountered during the scan of the transactional database.
After scanning the database, infrequent itemsets can be pruned based on their count in the hash table.

Hash Tree:-

The hash tree is a hierarchical data structure that combines the benefits of hash tables and trees.
In a hash tree, each node contains a hash table for storing counts of itemsets. Child nodes are organized based on a hash function applied to the itemsets, allowing for efficient traversal and pruning of infrequent itemsets.

Multiresolution Mining:-

Multiresolution mining techniques utilize hash functions to hash itemsets into buckets based on different levels of resolution.
Items are grouped into buckets based on their hash codes, and itemsets are formed by combining items from different buckets.
By varying the resolution of the hash function, the algorithm can control the level of detail at which itemsets are generated, allowing for efficient mining of frequent itemsets at different levels of granularity.

Randomized Hashing:-

Randomized hashing techniques leverage random hash functions to distribute itemsets across hash tables or buckets.
These algorithms introduce randomness into the process of itemset counting and mining, which can help mitigate skewness in the distribution of itemsets and improve load balancing across hash tables.
Randomized hashing can be combined with other pruning techniques to further improve the efficiency of frequent itemset mining.

Hash-based algorithms offer several advantages for frequent itemset mining, including:

Memory Efficiency: Hash-based algorithms require less memory compared to traditional algorithms like Apriori, especially for large-scale datasets.
Scalability: Hash-based techniques can handle high-dimensional data and large transactional databases efficiently.
Load Balancing: Hashing helps distribute itemsets evenly across hash tables, reducing the likelihood of memory bottlenecks and improving overall performance.

Overall, hash-based algorithms are effective tools for frequent itemset mining, particularly in scenarios where memory constraints or scalability issues are a concern. They have been successfully applied in various domains such as market basket analysis, recommendation systems, and bioinformatics.

Algorithm for hash based

Step 1: Take the dataset.

Step 2: Assign each bucket count=0;

Step 3:

For i from 2 to max_row+1

For j from 1 to number of column

For k from j+1 to number of column + 1

if cell(i, j) == 1 AND cell(i, k) == 1:

print cell(1, j) , cell(1, k)

hashfun = (j*10+k)%7

(Note : according to hash function value increment related bucket count)

Step 4: calculate support(take the average of bucket counts).

Step 5: If the bucket count >= support then

print 1

else print 0

Search This Blog

Beginners guide to Data Science