1 Introduction Last updated: 2022-08-06

Project CleverMiner is Python implementation of GUHA procedures. It has several goals:

  • Provide implementation of generalised aproiri algorithm
  • Serve as an educational tool
  • Widespread to wide public that apriori can be extended

Note

As this project is an efficient implementation of GUHA procedures, it contains a lot of variety and many theoretical concepts and exmaples are not finished yet. Project is in continuous development. For GUHA tutorials with older Windows application, please see https://lispminer.vse.cz.

2 Installation

CleverMiner is a Python package. It assumes that you are familiar with Python. Package supports Python 3.6+ (3.10+ recommended).

Installation is very easy. Cleverminer is packaged at pipy (PYthon Package Index), so installation is

Code Example: pip install cleverminer

If you have already package installed and you need upgrade to newest version, use pip install cleverminer --upgrade

3 Quick tutorial

This section makes quick introduction to several GUHA procedures and show simple examples of tasks than can be solved by CleverMiner.

In our example, we will use Hotel dataset (source and details about dataset is here. This dataset contains data about visits of fictive hotel like length of stay, weather, type of visit and others.

Our question will be - does type of visit (personal/business) depend on country from which visitor is?

We will look for rules GCity -> VTypeOfVisit with minimal implied probability 60% and base 50 records.

This example gives us following results
We may see that result was obtained instantly, from 56 verifications we got 12 results. We have listed all rules so we can see that all rules have in succedent VTypeOfVisit(private). From Berlin, we have 85.7% private visits (rule id = 1), from Dresden, we have 90.8% private visits (rule id = 5). First rule is printed in detail so we can see also fourfold table and other quantifier values.

4 Overview of CleverMiner Procedures

This section shows which type of tasks you may solve with CleverMiner. CleverMiner procedures are implementation of GUHA procedures. Each procedure can solve different type of task and mine different pattern of rules.

Note

Every procedure consists of cedents (e.g left hand side and right hand side of association rules). Full possibilities how to define cedents are shown in following section. For now, let assume that cedent is conjuction/disjunction of literals and literal is condition that attribute has one of specified values.

In general, you will specify which attributes to use and how can individual values be combined and CleverMiner will try all possibilities and verify condition (e.g. at least 50% of cases that satisfy A must also satisfy B where A and B is valid for at least 100 cases).

4.1 General cleverminer procedure call

CleverMiner is a Python library (class). It loads Panda dataframe with categorial data, prepares it to internal format for quick patterns mining and applies GUHA procedure.

where
  • df is panda dataframe with all variables categorised (number of distinct values should be small)
  • proc is a GUHA procedure. Currently supported procedures are 4ftMiner, CFMiner and SD4ftMiner
  • quantifier is a list of conditions (based on procedure, see section for individual procedure)
  • target is name of target variable (applicable for CFMiner procedure only)
  • cedent (ante, succ, cond, ...) is a definition of a set of relevant cedents cedent (cedent is a conjunction or disjunction of literals). Literal is in a form A(a) where A is an attribute and a is a subset of its values.

Full example of procedure call is here

Each definition of a set of relevant cedents contain
  • list of attributes, for each
    • name – name of attribute A (must correspond to existing attribute name in dataframe)
    • type – available values are subset for subsets, lcut for left cut, rcut for right cut, seq sequence, one for one category (type defines a type of a set a in literal A(a), see following section)
    • minlen – number, minimal number of categories (i.e. values of A) in the literal A(a)
    • maxlen – number, maximal number of categories in the literal A(a) - e.g. we will look for single GCity or single GState or combination of 2 GStates.
  • minlen – number, minimal length of cedent (minimal number of literals in the cedent in the rule)
  • maxlen – number, maximal length of cedent (maximal number of literals in the cedent in the rule) - e.g. for antecedent, we will look for rules with one or two attributes from the list GState,GCity because minimal length is 1 and maximal length is 2
  • type - type of cedent – how literals (attributes and values) are combined - available values are con and dis for conjunction/disjunction
Procedures and supported quantifiers follow immediatelly.

4.2 4ft Miner

4ft Miner procedure looks for rules $$ A \Rightarrow_{Base,p} S | C $$

where A = A_1(a_1) & A_2(a_2) & ... & A_n(a_n) is called antecedent and S = S_1(s_1) & S_2(s_2) & ... & S_m(s_m) is called succedent, C = C_1(c_1) & C_2(c_2) & ... & C_k(c_k) is called condition in case of conjunction. Note that and all of cedents can be conjuction or disjunction of attributes and its values. This procedure search all possible attributes and its values and verifies it against conditions call quantifiers (typically p = minimal conditional probability P(S|A) and Base denotes minimal number of records that satisfies both A and S. Dataset is filtered out to records that satisfy condition C. Note that condition C is optional.

As an example, we will use following code

Cedents to be defined for 4ftMiner are
  • ante antecedent (or left hand side of the rule)
  • succ succedent (or right hand side of the rule)
  • cond condition

Note

Condition is optional in 4ftMiner. If you want to use 4ftMiner without condition, simply omit this cedent in procedure call.

For quantifiers, available parameters are
  • Base - Base, minimal absolute number of items satisfying both antecedent and succedent, given by condition respectively
  • RelBase - Relative base, Base divided by total number of items, given by condition respectively
  • conf - confidence - conditional probability of P(S|A) or what is the percentage of items that satisfies S from items that satisfies A
  • aad - above average difference - P(S|A)/P(S)-1 or how many times is probability of S increased when using only records that satisfy A compared to all records (how much A improves probability of S) minus one, i.e. how much is the probability increased.
  • bad - below average difference - negative value of aad how much A decreases probability of S

Warning

All control strings / keys are case sensitive.

Note

Detailed description of this procedure can be found at https://lispminer.vse.cz/guhate/doku.php?id=lm_guha_te_pravidlo

4.3 CFMiner

CFMiner procedure finds histogram of target variable given by specified condition (cedent). For example, if share price is growing, you may look for share types by several atributes for which share price is declining.

This procedure has single cedent cond that denotes condition.

Possible quantifier values are

  • Base - number of records that satisfies condition
  • RelBase - relative number of records that satisfies condition, Base / Total number of records
  • S_Up - consecutive steps up in histogram
  • S_Down - consecutive steps down in histogram
  • S_Any_Up - total number of steps up in histogram
  • S_Any_Down - total number of steps down in histogram
  • Max - maximal value in histogram
  • Min - minimal value in histogram
  • RelMax - relative maximal value in histogram (out of sum of all values in histogram)
  • RelMin - minimal value in histogram (out of sum of all values in histogram)
  • RelMax_leq - relative maximal value in histogram (out of sum of all values in histogram) - upper bound
  • RelMin_leq - minimal value in histogram (out of sum of all values in histogram) - upper bound

Note

Most quantifiers are naturally greater or equal, like base, confidence etc. Sometimes, also upper band is needed (like relmax, relmin) - e.g. to have similar histogram values, you can bound min and max values close to avegare.

Therefore, next quantifierq with _leq extesion has been introduced (leq = less or equal).

Warning

All control strings / keys are case sensitive.

Example of CFMiner procedure call is

Note

Detailed description of this procedure can be found at https://lispminer.vse.cz/guhate/doku.php?id=lm_guha_te_cf_proc

4.4 SD4ft Miner

SD4ft Miner procedure looks for change in attributes in 4ft Miner like rules that changes conf (confidence, or implied probability) at least by defined value.

As an example, we will use following code

Cedents to be defined for SD4ftMiner are
  • ante antecedent (or left hand side of the rule)
  • succ succedent (or right hand side of the rule)
  • frst first set (sub-matrix used in first rule)
  • scnd second set (sub-matrix used in second rule)
  • cond condition

Note

Condition is optional in 4ftMiner. If you want to use 4ftMiner without condition, simply omit this cedent in procedure call.

For quantifiers, available parameters are
  • FrstBase - Base, minimal absolute number of items satisfying both antecedent and succedent, given by condition respectively for the first rule
  • ScndBase - Base, minimal absolute number of items satisfying both antecedent and succedent, given by condition respectively for the second rule
  • FrstRelBase - Relative base, Base divided by total number of items for the first rule
  • ScndRelBase - Relative base, Base divided by total number of items for the second rule
  • Frstconf - confidence - conditional probability of P(S|A) or what is the percentage of items that satisfies S from items that satisfies A for the first rule
  • Scndconf - confidence - conditional probability of P(S|A) or what is the percentage of items that satisfies S from items that satisfies A for the second rule
  • Deltaconf - absolute difference of confidences in first and second rule
  • Ratioconf - relative difference of confidences in first and second rule
  • Ratioconf_leq - relative difference of confidences in first and second rule - upper bound

Warning

All control strings / keys are case sensitive.

Note

Detailed description of this procedure can be found at https://lispminer.vse.cz/guhate/doku.php?id=lm_guha_te_sd4ft_proc

4.5 Advanced CleverMiner procedure call

CleverMiner now support methods. You can initialize class (read data) once and mine rules several times. To do this, following steps to be held

  • call method with df parameter only (or all parameters for first rule mining)
  • next rule mining can be done via .mine method that have same parameters as constructor but df is ommitted.

Full example follows

5 Literal types

Cedent consist of literals. Each literal is attribute and its possible values.

In previous example, we may see cedent of length 2 to 2 (that means both attributes must be present in rule) where first attribute is type subset and second is type of one category.

All literal types (except one category) contains minimal and maximal length (number of values) where one category specifies individual category value as shown in example above.
Available literal types are

  • one - one category, value key denotes which category will be used
  • subset - all subsets of length minlen to maxlen are verified
  • lcut - left cuts of length minlen to maxlen are verified
  • rcut - right cuts of length minlen to maxlen are verified
  • seq - sequences of length minlen to maxlen are verified

6. Overall package usage

Section intro goes here. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque finibus condimentum nisl id vulputate. Praesent aliquet varius eros interdum suscipit. Donec eu purus sed nibh convallis bibendum quis vitae turpis. Duis vestibulum diam lorem, vitae dapibus nibh facilisis a. Fusce in malesuada odio.

6.1 Input dataset

Input dataset is panda dataframe where all attributes should be categorial. Note that all attributes are prepared to internal binary bitchain form so xif you have large dataset with many unused attributes please consider to reduce this dataset to improve computing time.

All attributes should be categorial and ordered. String attributes are typically ordered by string sorting for sequences and cuts.

6.2 Working with results

Results are returned in JSON format that is not indented. If you will work directly with this JSON, you may format it in some online JSON formatter.

There are also functions to print list of rules and print individual rule.

  • print_summary prints out task processing summary like time elapsed, number of verifications and number of rules found
  • print_rulelist prints out simplified list of all rules in human readable format
  • print_rule(i) prints out details of rule with id i

6.3 Advanced options (expert use only)

You may use advanced options (use it only when you really know what you are doing).

  • no_optimizations switches off optimizations (going to branches where no rule may exists)
  • max_categories maximum number of categories allowed in input. Default is 100.

7 Disclaimer

Note that this documentation should be used as companion to https://lispminer.vse.cz. Authors take no warranty when using these pages nor package itself.

Danger

Note that package is under development and function calls and parameter names or structure may change. If you need compatibility, use PRO version.

Danger

Authors take no warranty when using this site or package itself.