In our example, we will use Hotel dataset (source and details about dataset is here.
This dataset contains data about visits of fictive hotel like length of stay, weather, type of visit and others.
Our question will be - does type of visit (personal/business) depend on country from which visitor is?
We will look for rules GCity -> VTypeOfVisit with minimal implied probability 60% and base 50 records.
This example gives us following results
We may see that result was obtained instantly, from 56 verifications we got 12 results. We have listed all rules so we can see that all rules have in succedent VTypeOfVisit(private). From Berlin, we have 85.7% private visits (rule id = 1), from Dresden, we have 90.8% private visits (rule id = 5). First rule is printed in detail so we can see also fourfold table and other quantifier values.
4.1 General cleverminer procedure call
CleverMiner is a Python library (class). It loads Panda dataframe with categorial data, prepares it to internal format for quick patterns mining and applies GUHA procedure.
where
- df is panda dataframe with all variables categorised (number of distinct values should be small)
- proc is a GUHA procedure. Currently supported procedures are 4ftMiner, CFMiner and SD4ftMiner
- quantifier is a list of conditions (based on procedure, see section for individual procedure)
- target is name of target variable (applicable for CFMiner procedure only)
- cedent (ante, succ, cond, ...) is a definition of a set of relevant cedents cedent (cedent is a conjunction or disjunction of literals). Literal is in a form A(a) where A is an attribute and a is a subset of its values.
Full example of procedure call is here
Each definition of a set of relevant cedents contain
- list of attributes, for each
- name – name of attribute A (must correspond to existing attribute name in dataframe)
- type – available values are subset for subsets, lcut for left cut, rcut for right cut, seq sequence, one for one category (type defines a type of a set a in literal A(a), see following section)
- minlen – number, minimal number of categories (i.e. values of A) in the literal A(a)
- maxlen – number, maximal number of categories in the literal A(a) - e.g. we will look for single GCity or single GState or combination of 2 GStates.
- minlen – number, minimal length of cedent (minimal number of literals in the cedent in the rule)
- maxlen – number, maximal length of cedent (maximal number of literals in the cedent in the rule) - e.g. for antecedent, we will look for rules with one or two attributes from the list GState,GCity because minimal length is 1 and maximal length is 2
- type - type of cedent – how literals (attributes and values) are combined - available values are con and dis for conjunction/disjunction
Procedures and supported quantifiers follow immediatelly.
4.2 4ft Miner
4ft Miner procedure looks for rules $$ A \Rightarrow_{Base,p} S | C $$
where A = A_1(a_1) & A_2(a_2) & ... & A_n(a_n) is called antecedent and
S = S_1(s_1) & S_2(s_2) & ... & S_m(s_m) is called succedent, C = C_1(c_1) & C_2(c_2) & ... & C_k(c_k) is called condition in case of conjunction. Note that and all of cedents can be conjuction or disjunction of attributes and its values.
This procedure search all possible attributes and its values and verifies it against conditions call quantifiers (typically p = minimal conditional probability P(S|A) and Base denotes minimal number of records
that satisfies both A and S. Dataset is filtered out to records that satisfy condition C. Note that condition C is optional.
As an example, we will use following code
Cedents to be defined for 4ftMiner are
- ante antecedent (or left hand side of the rule)
- succ succedent (or right hand side of the rule)
- cond condition
Note
Condition is optional in 4ftMiner. If you want to use 4ftMiner without condition, simply omit this cedent in procedure call.
For quantifiers, available parameters are
- Base - Base, minimal absolute number of items satisfying both antecedent and succedent, given by condition respectively
- RelBase - Relative base, Base divided by total number of items, given by condition respectively
- conf - confidence - conditional probability of P(S|A) or what is the percentage of items that satisfies S from items that satisfies A
- aad - above average difference - P(S|A)/P(S)-1 or how many times is probability of S increased when using only records that satisfy A compared to all records (how much A improves probability of S) minus one, i.e. how much is the probability increased.
- bad - below average difference - negative value of aad how much A decreases probability of S
Warning
All control strings / keys are case sensitive.
4.3 CFMiner
CFMiner procedure finds histogram of target variable given by specified condition (cedent). For example, if share price is growing, you may look for share types by several atributes for which share price is declining.
This procedure has single cedent cond that denotes condition.
Possible quantifier values are
- Base - number of records that satisfies condition
- RelBase - relative number of records that satisfies condition, Base / Total number of records
- S_Up - consecutive steps up in histogram
- S_Down - consecutive steps down in histogram
- S_Any_Up - total number of steps up in histogram
- S_Any_Down - total number of steps down in histogram
- Max - maximal value in histogram
- Min - minimal value in histogram
- RelMax - relative maximal value in histogram (out of sum of all values in histogram)
- RelMin - minimal value in histogram (out of sum of all values in histogram)
- RelMax_leq - relative maximal value in histogram (out of sum of all values in histogram) - upper bound
- RelMin_leq - minimal value in histogram (out of sum of all values in histogram) - upper bound
Note
Most quantifiers are naturally greater or equal, like base, confidence etc. Sometimes, also upper band is needed (like relmax, relmin) - e.g. to have similar histogram values, you can bound min and max values close to avegare.
Therefore, next quantifierq with _leq extesion has been introduced (leq = less or equal).
Warning
All control strings / keys are case sensitive.
Example of CFMiner procedure call is
4.4 SD4ft Miner
SD4ft Miner procedure looks for change in attributes in 4ft Miner like rules that changes conf (confidence, or implied probability) at least by defined value.
As an example, we will use following code
Cedents to be defined for SD4ftMiner are
- ante antecedent (or left hand side of the rule)
- succ succedent (or right hand side of the rule)
- frst first set (sub-matrix used in first rule)
- scnd second set (sub-matrix used in second rule)
- cond condition
Note
Condition is optional in 4ftMiner. If you want to use 4ftMiner without condition, simply omit this cedent in procedure call.
For quantifiers, available parameters are
- FrstBase - Base, minimal absolute number of items satisfying both antecedent and succedent, given by condition respectively for the first rule
- ScndBase - Base, minimal absolute number of items satisfying both antecedent and succedent, given by condition respectively for the second rule
- FrstRelBase - Relative base, Base divided by total number of items for the first rule
- ScndRelBase - Relative base, Base divided by total number of items for the second rule
- Frstconf - confidence - conditional probability of P(S|A) or what is the percentage of items that satisfies S from items that satisfies A for the first rule
- Scndconf - confidence - conditional probability of P(S|A) or what is the percentage of items that satisfies S from items that satisfies A for the second rule
- Deltaconf - absolute difference of confidences in first and second rule
- Ratioconf - relative difference of confidences in first and second rule
- Ratioconf_leq - relative difference of confidences in first and second rule - upper bound
Warning
All control strings / keys are case sensitive.
6.1 Input dataset
Input dataset is panda dataframe where all attributes should be categorial. Note that all attributes are prepared to internal binary bitchain form so xif you have large dataset with many unused
attributes please consider to reduce this dataset to improve computing time.
All attributes should be categorial and ordered. String attributes are typically ordered by string sorting for sequences and cuts.
6.2 Working with results
Results are returned in JSON format that is not indented. If you will work directly with this JSON, you may format it in some online JSON formatter.
There are also functions to print list of rules and print individual rule.
- print_summary prints out task processing summary like time elapsed, number of verifications and number of rules found
- print_rulelist prints out simplified list of all rules in human readable format
- print_rule(i) prints out details of rule with id i
6.3 Advanced options (expert use only)
You may use advanced options (use it only when you really know what you are doing).
- no_optimizations switches off optimizations (going to branches where no rule may exists)
- max_categories maximum number of categories allowed in input. Default is 100.