10 Apriori

This chapter describes Apriori, the algorithm used by Oracle Data Mining for calculating association rules.

See Also:

This chapter contains the following topics:

About Apriori
Metrics for Association Rules
Data for Association Rules

About Apriori

An association mining problem can be decomposed into two subproblems:

Find all combinations of items in a set of transactions that occur with a specified minimum frequency. These combinations are called frequent itemsets. (See "Frequent Itemsets" for an example.)
Calculate rules that express the probable co-occurrence of items within frequent itemsets. (See "Example: Calculating Rules from Frequent Itemsets".)

Apriori calculates the probability of an item being present in a frequent itemset, given that another item or items is present.

Association rule mining is not recommended for finding associations involving rare events in problem domains with a large number of items. Apriori discovers patterns with frequency above the minimum support threshold. Therefore, in order to find associations involving rare events, the algorithm must run with very low minimum support values. However, doing so could potentially explode the number of enumerated itemsets, especially in cases with a large number of items. This could increase the execution time significantly. Classification or anomaly detection may be more suitable for discovering rare events when the data has a high number of attributes.

Association Rules

The Apriori algorithm calculates rules that express probabilistic relationships between items in frequent itemsets For example, a rule derived from frequent itemsets containing A, B, and C might state that if A and B are included in a transaction, then C is likely to also be included.

An association rule states that an item or group of items implies the presence of another item with some probability. Unlike decision tree rules, which predict a target, association rules simply express correlation.

Antecedent and Consequent

The IF component of an association rule is known as the antecedent. The THEN component is known as the consequent. The antecedent and the consequent are disjoint; they have no items in common.

Oracle Data Mining supports association rules that have one or more items in the antecedent and a single item in the consequent.

Confidence

Rules have an associated confidence, which is the conditional probability that the consequent will occur given the occurrence of the antecedent.

The ASSO_MIN_CONFIDENCE setting specifies the minimum confidence for rules. The model eliminates any rules with confidence below the required minimum. See Oracle Database PL/SQL Packages and Types Reference for details on model settings for association rules.

Example: Calculating Rules from Frequent Itemsets

Table 10-2 and Table 10-2 show the itemsets and frequent itemsets that were calculated in Chapter 8. The frequent itemsets are the itemsets that occur with a minimum support of 67%; at least 2 of the 3 transactions must include the itemset.

Table 10-1 Itemsets

Transaction	Itemsets
11	(B,D) (B,E) (D,E) (B,D,E)
12	(A,B) (A,C) (A,E) (B,C) (B,E) (C,E) (A,B,C) (A,B,E) (A,C,E) (B,C,E)
13	(B,C) (B,D) (B,E) (C,D) (C,E) (D,E) (B,C,D) (B,C,E) (B,D,E) (C,D,E)

Table 10-2 Frequent Itemsets with Minimum Support 67%

Itemset	Transactions	Support
(B,C)	12 and 13	67%
(B,D)	11 and 13	67%
(B,E)	11, 12, and 13	100%
(C,E)	12 and 13	67%
(D,E)	11 and 13	67%
(B,C,E)	12 and 13	67%
(B,D,E)	11 and 13	67%

A rule expresses a conditional probability. Confidence in a rule is calculated by dividing the probability of the items occurring together by the probability of the occurrence of the antecedent.

For example, if B (antecedent) is present, what is the chance that C (consequent) will also be present? What is the confidence for the rule "IF B, THEN C"?

As shown in Table 10-1:

All 3 transactions include B (3/3 or 100%)
Only 2 transactions include both B and C (2/3 or 67%)
Therefore, the confidence of the rule "IF B, THEN C" is 67/100 or 67%.

Table 10-3, shows the rules that could be derived from the frequent itemsets in Table 10-2.

Table 10-3 Frequent Itemsets and Rules

Frequent Itemset	Rules	prob(antecedent and consequent) / prob(antecedent)	Confidence
(B,C)	(If B then C) (If C then B)	67/100 67/67	67% 100%
(B,D)	(If B then D) (If D then B)	67/100 67/67	67% 100%
(B,E)	(If B then E) (If E then B)	100/100 100/100	100% 100%
(C,E)	(If C then E) (If E then C)	67/67 67/100	100% 67%
(D,E)	(If D then E) I(f E then D)	67/67 67/100	100% 67%
(B,C,E)	(If B and C then E) (If B and E then C) (If C and E then B)	67/67 67/100 67/67	100% 67% 100%
(B,D,E)	(If B and D then E) (If B and E then D) (If D and E then B)	67/67 67/100 67/67	100% 67% 100%

If the minimum confidence is 70%, ten rules will be generated for these frequent itemsets. If the minimum confidence is 60%, sixteen rules will be generated

Tip:

Increase the minimum confidence if you want to decrease the build time for the model and generate fewer rules.

Metrics for Association Rules

Minimum support and confidence are used to influence the build of an association model. Support and confidence are also the primary metrics for evaluating the quality of the rules generated by the model. Additionally, Oracle Data Mining supports lift for association rules. These statistical measures can be used to rank the rules and hence the usefulness of the predictions.

Support

The support of a rule indicates how frequently the items in the rule occur together. For example, cereal and milk might appear together in 40% of the transactions. If so, the following two rules would each have a support of 40%.

cereal implies milk
milk implies cereal

Support is the ratio of transactions that include all the items in the antecedent and consequent to the number of total transactions.

Support can be expressed in probability notation as follows.

support(A implies B) = P(A, B)

See Also:

"Frequent Itemsets"

Confidence

The confidence of a rule indicates the probability of both the antecedent and the consequent appearing in the same transaction. Confidence is the conditional probability of the consequent given the antecedent. For example, cereal might appear in 50 transactions; 40 of the 50 might also include milk. The rule confidence would be:

cereal implies milk with 80% confidence

Confidence is the ratio of the rule support to the number of transactions that include the antecedent.

Confidence can be expressed in probability notation as follows.

confidence (A implies B) = P (B/A), which is equal to P(A, B) / P(A)

Lift

Both support and confidence must be used to determine if a rule is valid. However, there are times when both of these measures may be high, and yet still produce a rule that is not useful. For example:

Convenience store customers who buy orange juice also buy milk with 
a 75% confidence. 
The combination of milk and orange juice has a support of 30%.

This at first sounds like an excellent rule, and in most cases, it would be. It has high confidence and high support. However, what if convenience store customers in general buy milk 90% of the time? In that case, orange juice customers are actually less likely to buy milk than customers in general.

A third measure is needed to evaluate the quality of the rule. Lift indicates the strength of a rule over the random co-occurrence of the antecedent and the consequent, given their individual support. It provides information about the improvement, the increase in probability of the consequent given the antecedent. Lift is defined as follows.

(Rule Support) /(Support(Antecedent) * Support(Consequent))

This can also be defined as the confidence of the combination of items divided by the support of the consequent. So in our milk example, assuming that 40% of the customers buy orange juice, the improvement would be:

30% / (40% * 90%)

which is 0.83 – an improvement of less than 1.

Any rule with an improvement of less than 1 does not indicate a real cross-selling opportunity, no matter how high its support and confidence, because it actually offers less ability to predict a purchase than does random chance.

Data for Association Rules

Association models are designed to use transactional data. Nulls in transactional data are assumed to represent values that are known but not present in the transaction. For example, three items out of hundreds of possible items might be purchased in a single transaction. The items that were not purchased are known but not present in the transaction.

Transactional data, by its nature, is sparse. Only a small fraction of the attributes are non-zero or non-null in any given row. Apriori interprets all null values as indications of sparsity.

See Also:

"Transactions"

Equi-width binning is not recommended for association models. When Apriori uses equi-width binning, outliers cause most of the data to concentrate in a few bins, sometimes a single bin. As a result, the discriminating power of the algorithm can be significantly reduced.

Note:

Apriori is not affected by Automatic Data Preparation.

Oracle Data Mining Application Developer's Guide for information about transactional data, nested columns, and missing value treatment

Chapter 20, "Text Mining"