Mining Association Rules
Mining Association Rules:-
Multi Level Association Rules:-
Multi level association rules are the rules generated for the use of market analysis or the business purpose so as to decide the further strategies of the market and help them gain their profit. The rules generated will be used by them to decide which strategy to be applied for increasing profit.
Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemset at appropriate levels could be quite useful. Transaction database can be encoded based on dimensions and levels.
Multilevel association rules, also known as multi-level association rules or hierarchical association rules, extend the concept of traditional association rule mining by considering hierarchical relationships between items. In traditional association rule mining, transactions are typically represented as a flat list of items without considering any hierarchical structure. However, in many real-world scenarios, items may have hierarchical relationships, such as product categories, organizational hierarchies, or taxonomies. Multilevel association rule mining aims to discover association rules that involve items at different levels of the hierarchy, allowing for more meaningful and interpretable insights. These rules capture dependencies between items at different levels of abstraction, providing a richer understanding of the data.
- Data preparation.
- Item generation.
- Rule generation.
- Rule evaluation.
- Rule selection and interpretation
- Apriori Algorithm
- ECLAT Algorithm
- FP Growth
- Hash based Algorithm
Apriori Algorithm:-
- Begin by identifying all unique items present in the dataset. These items will form the initial set of candidate 1-itemsets.
- The algorithm then iteratively generates candidate itemsets of higher lengths (k) based on the frequent (k-1)-itemsets discovered in the previous iteration.
- To generate candidate (k+1)-itemsets, Apriori joins pairs of frequent k-itemsets that share the same (k-1) items. Only the itemsets resulting from these joins that satisfy the Apriori property are considered candidate itemsets.
- After generating candidate itemsets, the algorithm prunes those that contain subsets that are not frequent. This pruning step exploits the Apriori property, which states that if an itemset is infrequent, all its supersets will also be infrequent.
- By eliminating candidate itemsets that do not meet the minimum support threshold, the Apriori algorithm reduces the number of itemsets that need to be counted in subsequent iterations, leading to computational efficiency.
- After pruning, the remaining candidate itemsets are counted in the transactional database to determine their support, i.e., the frequency of occurrence in the dataset.
- The support count of each candidate itemset is compared to the minimum support threshold. Only itemsets that meet or exceed the minimum support threshold are considered frequent.
- The process of generating candidate itemsets, pruning, and counting support is repeated until no new frequent itemsets can be discovered.
- Once frequent itemsets have been identified, association rules are derived from these itemsets.
- Association rules are generated by partitioning frequent itemsets into non-empty subsets (antecedent) and their complements (consequent) and calculating the confidence of each rule.
- Association rules that meet minimum confidence thresholds are retained as the final output.
ECLAT Algorithm:-
- Eclat begins by representing the transactional database in a vertical format, where each transaction is represented as a list of items along with their corresponding transaction identifiers (TID).
- Transactions are sorted based on item frequency, and items within each transaction are arranged in ascending order.
- Eclat initializes with a set of single-item frequent itemsets, each containing a single item and the list of TIDs in which it occurs.
- These initial itemsets are sorted based on item frequency.
- Eclat uses a depth-first search (DFS) approach to recursively enumerate frequent itemsets.
- At each level of the search, the algorithm joins pairs of frequent itemsets that share a common prefix (i.e., have the same items except for the last one) to form candidate itemsets.
- The intersection of the TID lists of the joined itemsets is used to determine the support count of the candidate itemset. If the support count exceeds the minimum support threshold, the itemset is considered frequent.
- Eclat recursively explores the search space, pruning branches that cannot lead to frequent itemsets based on the minimum support threshold.
- The algorithm stops exploring a branch when it encounters an empty or non-frequent itemset.
- After identifying frequent itemsets, association rules can be derived from these itemsets, similar to the Apriori algorithm.
- Association rules are generated by partitioning frequent itemsets into non-empty subsets (antecedent) and their complements (consequent) and calculating the confidence of each rule.
- Take the advantage of the Apriori property in the generation of candidate (k+1)-itemset from k-itemsets.
- No need to scan the database to find the support of (k+1) itemsets, for k>=1.
- The TID_set of each k-itemset carries the complete information required for counting such support.
- The TID-sets can be quite long, hence expensive to manipulate.
- Use diffset technique to optimize the support count computation
FP Growth:-
- Like Eclat, FP-Growth begins by representing the transactional database in a vertical format, where each transaction is represented as a list of items along with their corresponding transaction identifiers (TID).
- Transactions are sorted based on item frequency, and items within each transaction are arranged in descending order of frequency.
- FP-Growth constructs a compact data structure called the FP-tree (Frequent Pattern tree) to represent the transaction database efficiently.
- The FP-tree consists of a root node and a set of conditional trees, each corresponding to a frequent item in the dataset.
- Transactions are sequentially inserted into the FP-tree, and frequent items are linked together based on their occurrence in transactions.
- Once the FP-tree has been constructed, FP-Growth recursively mines frequent itemsets by traversing the tree.
- At each node of the FP-tree, the algorithm explores all possible combinations of item paths (conditional patterns) to generate frequent itemsets.
- Frequent itemsets are identified by recursively exploring conditional FP-trees, which represent conditional databases obtained by removing infrequent items.
- After identifying frequent itemsets, association rules can be derived from these itemsets, similar to the Apriori and Eclat algorithms.
- Association rules are generated by partitioning frequent itemsets into non-empty subsets (antecedent) and their complements (consequent) and calculating the confidence of each rule.
- Efficient Data Structure: The FP-tree data structure allows for compact representation of transactional data and facilitates efficient pattern mining.
- Reduced Candidate Generation: FP-Growth does not generate candidate itemsets explicitly, leading to fewer scans of the transactional database and improved scalability.
- Parallelization: FP-Growth can be parallelized to leverage multi-core architectures and distributed computing frameworks for further performance improvements.
Hash based Algorithm:-
- In this approach, each itemset is assigned a unique hash code based on its items. Hash tables are then used to count the occurrence of itemsets in the dataset efficiently.
- The algorithm maintains a hash table to store counts for each itemset encountered during the scan of the transactional database.
- After scanning the database, infrequent itemsets can be pruned based on their count in the hash table.
- The hash tree is a hierarchical data structure that combines the benefits of hash tables and trees.
- In a hash tree, each node contains a hash table for storing counts of itemsets. Child nodes are organized based on a hash function applied to the itemsets, allowing for efficient traversal and pruning of infrequent itemsets.
- Multiresolution mining techniques utilize hash functions to hash itemsets into buckets based on different levels of resolution.
- Items are grouped into buckets based on their hash codes, and itemsets are formed by combining items from different buckets.
- By varying the resolution of the hash function, the algorithm can control the level of detail at which itemsets are generated, allowing for efficient mining of frequent itemsets at different levels of granularity.
- Randomized hashing techniques leverage random hash functions to distribute itemsets across hash tables or buckets.
- These algorithms introduce randomness into the process of itemset counting and mining, which can help mitigate skewness in the distribution of itemsets and improve load balancing across hash tables.
- Randomized hashing can be combined with other pruning techniques to further improve the efficiency of frequent itemset mining.
- Memory Efficiency: Hash-based algorithms require less memory compared to traditional algorithms like Apriori, especially for large-scale datasets.
- Scalability: Hash-based techniques can handle high-dimensional data and large transactional databases efficiently.
- Load Balancing: Hashing helps distribute itemsets evenly across hash tables, reducing the likelihood of memory bottlenecks and improving overall performance.
Comments
Post a Comment