This page provides explanation and support for the programs to estimate Bayes Probability, as it is used in c;assifications. As the programs and this explanation page use specific terms and abbreviations, and these are best demonstrated with examples, this introduction panel will describe the example used and the terminology.
The format of data entry and explanation of results produced are in the Help and Hints panel of the program pages.
Before we start: Modern computer perform calculations with precision to 14 decimal points. The programs in this page display results with precision to 4 decimal points. In the introduction panels however, probability values are displayed to 2 decimal points to conserve space and make reading easier. Minor differences to the results from the program may occasionally arise, and some of the probabilities do not total to 1. Reader should be aware and not be confused by this.
The Example
Please be aware that the data is artificially created and not real. The ample size is also deliberately small for easy display.
Predictor
Outcome
Hair
Eye
French
German
Italian
H_Dark
E_Blue
3
1
2
H_Dark
E_Brown
1
1
3
H_Dark
E_Others
1
1
1
H_Light
E_Blue
2
1
1
H_Light
E_Brown
2
1
2
H_Light
E_Others
1
5
1
a priori
0.5
0.33
0.17
We wish to develop a Bayesian model to identify the ethnicity of people, based on hair and eye color. To build our model, we recruited 10 each of known French, German, and Italians, and observed their hair and eye color. We then use the Bayesian model to predict ethnicity using hair and eye colors, in a community with an expected ratios of French:German:Italian of 3:2:1, normalized to a priori probabilities of 0.5:0.33:0.17.
The count of combinations are presented in the table to the right, and the explanation of terms and abbreviations used are as follows
Bayesian Probability
Bayesian Probability Theory is a mathematical model of making decisions based on experience. The process is to predict, using a set of predictors to determine the probabilities of alternative outcomes. In the Bayesian context, prediction is not to forecast the future, nor to establish what may be true, but to logically apply the observed values (attributes) of predictors to calculate how confident we can be, in terms of probabilities (a number between 0 and 1, or a percentage), for each of the alternative outcomes contained in our model.
The process of Bayesian decisions can be separated into the following stages
We begin by nominating the a priori probabilities (π), our confidence in believing each of the alternative outcomes to be correct, before taking predictors into consideration. This can be established by the following
We can decide that we do not know, and assign the same value as a priori probability to all outcomes
We can base the a priori probabilities on knowledge, from experience, research, previously collected data, heresay, cultural belief, or simply a guess
We can propose a priori probabilities as a hypothesis to explore, such as "if the a priori probabilities are ...., then ....."
From our example, in the community we will use our Bayesian model (noth west of Switzerland), Census informs us that the ratio of French:German:Italian are 3:2:1. These are normalized to probabilities by dividing each value by the total to 0.5:0.33:0.17
We then use the coefficients of our model to apply the attributes of predictors to change a priori probabilities to a posteriori probabilities. The coefficients are developed using a set of reference data, in our example, 10 cases of each ethnicity. Each coefficient is the probability of seeing an attribute given the outcome P(a|o), obtained by dividing the number of cases with each pair of attribute/outcome by the sample size of that outcome in the reference data. Both the Basic and Naive Bayes model use P(a|o) as coefficients, but they are calculated, presented, and used differently. Details of this are presented in the 2 subsequent panels.
The coefficients P(a|o) interacts with the alternatives in the predictor(s) to estimate the a posteriori probability. This is term the a posteriori probability, and commonly referred to as the Bayesian probability
When there is only 1 predictor, as in the Basic Bayes model, attribute (a) represents each alternative of the predictor, and the Bayesian probability is probability given attribute πP(o|a)
When there are more than 1 predictor, as in the Naive Bayes model, pattern (p) represents an array of attributes, one from each predictor, and the Bayesian probability is probability given pattern πP(o|p)
Two types of a posteriori probability can therefore be calculated using the coefficients we developed
Probability of outcome using only the predictor(s), without taking a priori probability into consideration. In the Basic Bayes model with 1 predictor, this is probability given attribute P(o|a), and in the Naive Bayes model with multiple predictors probability given pattern of attributes combined P(o|p). This probability is also termed Maximum Likelihood, and the table of Maximum Likelihood describes the behaviour of the model.
Probability of outcome using the predictor(s) plus the a priori probabilities π. In the Basic Bayes model with 1 predictor, this is probability given attribute and a priori probability πP(o|a), and in the Naive Bayes model with multiple predictors probability given pattern and a priori probability πP(o|p). This probability is also termed Bayes or Bayesian Probability, and is the major and most commonly used a posteriori probability.
Summary and Terminology
The terminology and abbreviations used in this page and the associated programs are adapted from diverse sources, and may not be the same as in other publications. Users should be aware of this peculiarity when comparing these pages with other sources of information. These are chosen to prefer clarity over brevity, hoping that, by doing so, the inexperienced will be less confused. In particular, the following should be noted.
Predictor is a conceptual term representing things used to predict. and no abbreviation is provided in these pages. In other publications, a variety of terms and abbreviations, such as independent variable, x, j, are used
Attribute is the value of a predictor, and abbreviated as a. In other publication, predictor, independent variable, x, j, and so on are used
Pattern or attribute list or attributes combined are terms representing the same thing, an array of attributes, each from a predictor. These are abbreviated as p, and is used only in the Naive Bayes model. In other publication, predictor, independent variable, x, j, and so on are used
Outcome is used as a concept of things to predict, and also as the values (probability) predicted, and is abreviated as o.In other publication, dependent variable, a posteriori, posterior probability, y, z, θ are used
The abbreviation P(x|y), representing the probability of x given y, generally known as conditional probability, is the same in this page as in most publications. However, in most publications, the same abbreviations are used (with different letters) to represent different types of conditional probabilities, while in these pages
P(a|o) represent probability of each attribute for a given outcome. Other publications use P(x|y), P(x|θ), or names of predictors and outcomes
P(o|a) and P(o|p) represent probability of each outcome given an attribute or a list of attributes combined (pattern), without consideration of a priori probabilities. Other publications use P(y|x), P(θ|x), or names of predictors and outcomes This represents Maximum Likelihood, a term used in these pages as in most publications
πP(o|a) and πP(o|p) represent Bayesian Probability, with π representing a priori probability. The term is an old one (see references), and used in these pages to distinguish it from Maximum Likelihood. In most publications the same abbreviation as Maximum Likelihood is used, and what the abbreviation means depends on the context described.
Basic Bayes
This panel discusses the basic Bayes model generally, and as presented in the program panel.
Table B1: Counts
Attributes
Outcome
French
German
Italian
H_Dark+E_Blue
3
1
2
H_Dark+E_Brown
1
1
3
H_Dark+E_Others
1
1
1
H_Light+E_Blue
2
1
1
H_Light+E_Brown
2
1
2
H_Light+E_Others
1
5
1
a priori π
0.5
0.33
0.17
Basic Bayes is used to represent the Bayes Probability model, as originally described by Bayes and in the Wikipedia page on Bayes Theorem. The term "Basic" is used to avoid confusion with the Naive Bayes model, and is not applicable outside of this page.
The model uses a single predictor with 2 or more attributes to predict the probabilities of 2 or more outcomes. Where more than one variable are used to predict, they are combined into a single compound predictor. In the default example of this page, the data presented in the Introduction panel are restructured, so that the two variables, hair color (2 attributes of Dark and Light) and eye colors (3 attributes of Blue, Brown and Others) are combined into a single compound predictor of HairEye, with 2x3=6 attributes of H_Dark+E_Blue, H_Dark+E_Brown, H_Dark+E_Others, H_Light+E_Blue, H_Light+E_Brown, and H_Light+E_Others. The restructured table is used in the program, and as shown to the right.
Building the model: Attribute (a): Compound(Hair and eye color), Outcome (o):ethnicity
Table B2. Model Coefficients P(a|o)
Attributes
Outcome
French
German
Italian
Total
H_Dark+E_Blue
3/10=0.3
1/10=0.1
2/10=0.2
0.3+0.1+0.2=0.6
H_Dark+E_Brown
1/10=0.1
1/10=0.1
3/10=0.3
0.1+0.1+0.3=0.5
H_Dark+E_Others
1/10=0.1
1/10=0.1
1/10=0.1
0.1+0.1+0.1=0.3
H_Light+E_Blue
2/10=0.2
1/10=0.1
1/10=0.1
0.2+0.1+0.1=0.4
H_Light+E_Brown
2/10=0.2
1/10=0.1
2/10=0.2
0.2+0.1+0.2=0.5
H_Light+E_Others
1/10=0.1
5/10=0.5
1/10=0.1
0.1+0.5+0.1=0.7
The coefficients of the model, to be used to convert a priori to a posteriori probability, is the probability of attribute given outcome P(a|o). For the ith attribute and the jth outcome, P(a|o) is calculated by dividing the number of the attribute/outcome pair (Ni,j) by the sample size of that outcome (Nj).
P(ai|oj) = Ni,j/Nj
In this example, the attributes are the hair eye color combinations, and the outcomes ethnicity. The results are shown in Table B2 to the right.
Prediction 1. Maximum Likelihood P(o|a)
Table B3. Maximum Likelihood P(o|a)
Attribute
Outcome
French
German
Italian
H_Dark+E_Blue
0.3/0.6=0.5
0.1/0.6=0.17
0.2/0.6=0.33
H_Dark+E_Brown
0.1/0.5=0.2
0.1/0.5=0.2
0.3/0.5=0.6
H_Dark+E_Others
0.1/0.3=0.33
0.1/0.3=0.33
0.1/0.3=0.33
H_Light+E_Blue
0.2/0.4=0.14
0.1/0.4=0.71
0.1/0.4=0.14
H_Light+E_Brown
0.2/0.5=0.4
0.1/0.5=0.2
0.2/0.5=0.4
H_Light+E_Others
0.1/0.7=0.5
0.5/0.7=0.25
0.1/0.7=0.25
If a posteriori probability is calculated without the inclusion of a priori probability π, the result is probability of outcome given attribute P(o|a), also called Maximum Likelihood. This describes the model, and demonstrates the relationship between attributes and outcomes.
P(o|a) is calculated from P(a|o) in Table B2. For each outcome j, its probability to be predicted by an attribute i (ai), the calculation is
P(oj|ai) = P(ai|oj) / Σ(P(ai|oj)) for all outcomes
The calculations and results are shown in the table B3 to the right. They suggest that, without including a priori probabilities, those with light hair and blue eyes Germans (0.71), those with dark hair and blue eyes French and those with light hair and other color eyes French (0.5), while the other combined attributes did not clearly discriminate between the 3 ethnicities.
Prediction 2. Bayesian Probability πP(o|a)
Table B4a. πP(a|o)
Attributes
Outcomes
French
German
Italian
Total
H_Dark+E_Blue
0.3x0.5=0.15
0.1x0.33=0.03
0.2x0.17=0.03
0.15+0.033+0.033=0.22
H_Dark+E_Brown
0.1x0.5=0.05
0.1x0.33=0.03
0.3x0.17=0.05
0.05+0.033+0.05=0.13
H_Dark+E_Others
0.1x0.5=0.05
0.1x0.33=0.03
0.1x0.17=0.02
0.05+0.033+0.02=0.1
H_Light+E_Blue
0.2x0.5=0.05
0.1x0.33=0.17
0.1x0.17=0.017
0.05+0.17+0.017=0.233
H_Light+E_Brown
0.2x0.5=0.1
0.1x0.33=0.03
0.2x0.17=0.033
0.1+0.03+0.03=0.17
H_Light+E_Others
0.1x0.5=0.1
0.5x0.33=0.03
0.1x0.17=0.017
0.1+0.03+0.017=0.15
Table B4b. Bayesian Probability πP(o|a)
Attributes
Outcomes
French
German
Italian
H_Dark+E_Blue
0.15/0.22=0.69
0.03/0.22=0.15
0.03/0.22=0.15
H_Dark+E_Brown
0.05/0.13=0.38
0.03/0.13=0.25
0.05/0.13=0.38
H_Dark+E_Others
0.05/0.1=0.5
0.03/0.1=0.33
0.02/0.1=0.17
H_Light+E_Blue
0.05/0.23=0.21
0.17/0.23=0.71
0.02/0.23=0.07
H_Light+E_Brown
0.1/0.17=0.6
0.03/0.17=0.2
0.03/0.17=0.2
H_Light+E_Others
0.1/0.15=0.67
0.03/0.15=0.22
0.02/0.15=0.11
If a posteriori probability is calculated by changing the a priori probability π, the result is probability of outcome given attribute and a priori probability πP(o|a). This is usually referred to as Bayesian Probability, as it follows the descriptions first made by Bayes.
πP(o|a) is calculated from P(a|o) in Table B2 and the apriori probability π from Table B1. The calculations are in 2 steps. Firstly, the coefficient is adjusted by the a priori probability of each outcome, then the adjusted coefficients are normalized by the total for all outcomes. For each outcome j, its probability to be predicted by an attribute i (ai) and a priori probability πj, the calculations are as follows
πjP(ai|oj) = P(ai|oj) x πj
πjP(oj|ai) = πjP(ai|oj) / Σall j(πjP(ai|oj))
The calculations and results are shown in tables B4a (step 1) and B4b (step 2). Bayesian Probabilities suggest that, in a population of French:German:Italian of 0.5:0.33:0.17, those with light hair and +E_Blue eyes are most likely to be German at 0.71 and all other combinations likely to be French, although those with dark hair and brown eyes are equally likely to be Italians at 0.38.
Naive Bayes
This panel discusses the Naive Bayes model and the calculations used in the program panel.
The model uses 2 or more predictors, each with 2 or more attributes, to predict the probabilities of 2 or more outcomes. The term Naive refers to the naive assumption that the predictors are independent of each other.
In the default example in Program panel, the data presented in the Introduction panel are restructured, so that the two preditors, hair color (2 attributes of Dark and Light) and eye colors (3 attributes of Blue, Brown and Others) are counted separately. The restructured table is used in the program, and as shown to the right.
Please note: The term Col is used in all tables to represent predictor. Col 1 is predictor 1 (in this example hair color), Col 2 is predictor 2 (in this example eye color), and so on.
Building the model: Attribute (a) : Outcome (o)
Table N2. Model Coefficients P(a|o)
Col
Attribute
French
German
Italian
1
H_Dark
5/10=0.5
3/10=0.3
6/10=0.6
1
H_Light
5/10=0.5
7/10=0.7
4/10=0.4
2
E_Blue
4/10=0.4
6/10=0.6
3/10=0.3
2
E_Blown
3/10=0.3
2/10=0.2
5/10=0.5
2
E_Others
3/10=0.3
2/10=0.2
2/10=0.2
The coefficients of the model, to be used to convert a priori to a posteriori probability, is the probability of attribute given outcome P(a|o). For the ith attribute from each predictor and the jth outcome, P(a|o) is calculated by dividing the number of the attribute/outcome pair (Ni,j) by the sample size of that outcome (Nj).
P(ai|oj) = Ni,j/Nj
In this example, the attributes are the hair color (Col 1) and eye color (Col 2), and the P(a|o) for each attribute from each predictor (Col) are calculated separately. The results are shown in Table N2 to the right.
Creating the pattern coefficient: pattern(p) : Outcome (o)
When presented with an array of attributes (pattern), Naive Bayes creates the coefficient for that particular pattern/outcome combination (P(p|o)). This is created by multiplying the P(a|o) of attribute/outcome combination in the list of attributes . An example uses the first pattern
Col 1, hair color:Dark and outcome 1 French: P(a|o) = 0.5
Col 2, eye color:Blue and outcome 1 French: P(a|o) = 0.4
For attributes combined [H_Dark,E_Blue]/French P(p|o) = 0.5x0.4 = 0.2
The coefficients are as shown as Table N2x for demonstration purpose here. In practice, the P(p|o) coefficients are calculated dynamically, depending on the attributes in the pattern array. As this is a dynamic intermediary step, P(p|o) are not presented as results, but used to produce the a posteriori probabilities in the same manner as P(a|o) in the basic Bayes model.
Prediction 1. Maximum Likelihood P(o|p)
Table N3. Maximum Likelihood P(o|p)
Pattern
Outcome
Col 1
Col 2
French
German
Italian
H_Dark
E_Blue
0.2/0.56=0.36
0.18/0.56=0.32
0.18/0.56=0.32
H_Dark
E_Blown
0.15/0.51=0.29
0.06/0.51=0.12
0.3/0.51=0.59
H_Dark
E_Others
0.15/0.33=0.45
0.06/0.33=0.18
0.12/0.33=0.36
H_Light
E_Blue
0.2/0.74=0.27
0.42/0.74=0.57
0.12/0.74=0.16
H_Light
E_Blown
0.15/0.49=0.31
0.14/0.49=0.29
0.2/0.49=0.41
H_Light
E_Others
0.15/0.37=0.41
0.14/0.37=0.38
0.08/0.37=0.22
If a posteriori probability is calculated without the inclusion of a priori probability π, the result is probability of outcome given a pattern of attributes P(o|p), also called Maximum Likelihood. This describes the model, and demonstrates the relationship between attributes and outcomes.
P(o|p) is calculated dynamically, the values demonstrated in Table N2x. For each outcome j, its probability to be predicted by pattern i (pi), the calculation is
P(oj|pi) = P(pi|oj) / Σ(P(pi|oj)) for all outcomes
The calculations and results are shown in the table N3 to the right. They suggest that, without including a priori probability, those with brown eyes are most likely to be Italians (0.59 for dark hair and 0.41 for light hair), those with light hair and blue eyes German (0.57), and the rest French (0.36,0.45, 0.41).
Prediction 2. Bayesian Probability πP(o|p)
Table N4a. πP(p|o)
Pattern
Outcome
Col 1
Col 2
French
German
Italian
Total
H_Dark
E_Blue
0.2x0.5=0.1
0.18x0.33=0.06
0.18x0.17=0.03
0.10+0.06+0.03=0.19
H_Dark
E_Blown
0.15x0.5=0.08
0.06x0.33=0.02
0.3x0.17=0.05
0.08+0.02+0.05=0.15
H_Dark
E_Others
0.15x0.5=0.08
0.06x0.33=0.02
0.12x0.17=0.02
0.08+0.02+0.02=0.12
H_Light
E_Blue
0.2x0.5=0.1
0.42x0.33=0.14
0.12x0.17=0.02
0.10+0.14+0.02=0.26
H_Light
E_Blown
0.15x0.5=0.08
0.14x0.33=0.05
0.2x0.17=0.03
0.08+0.05+0.03=0.16
H_Light
E_Others
0.15x0.5=0.08
0.14x0.33=0.05
0.08x0.17=0.01
0.08+0.05+0.01=0.13
Table N4b. Bayesian Probability πP(o|p)
Pattern
Outcome
Col 1
Col 2
French
German
Italian
H_Dark
E_Blue
0.1/0.19=0.53
0.06/0.19=0.32
0.03/0.19=0.16
H_Dark
E_Blown
0.08/0.15=0.52
0.02/0.15=0.14
0.05/0.15=0.34
H_Dark
E_Others
0.08/0.12=0.65
0.02/0.12=0.17
0.02/0.12=0.17
H_Light
E_Blue
0.1/0.26=0.38
0.14/0.26=0.54
0.02/0.26=0.08
H_Light
E_Blown
0.08/0.16=0.48
0.05/0.16=0.3
0.03/0.16=0.22
H_Light
E_Others
0.08/0.13=0.56
0.05/0.13=0.35
0.01/0.13=0.1
If a posteriori probability is calculated by changing the a priori probability π, as in the majority of predictions, the result is probability of outcome given pattern and a priori probability πP(o|p). This is usually referred to as Naive Bayesian Probability.
πP(o|p) is calculated from the dynamically calculated P(o|p) as shown in in Table N2x and the apriori probability π from Table N1. The calculations are in 2 steps. Firstly, the coefficient P(p|o) is adjusted by the a priori probability of each outcome, then the adjusted coefficients are normalized by the total for all outcomes. For each outcome j, its probability to be predicted by a pattern i (pi) and a priori probability πj, the calculations are as follows
πjP(pi|oj) = P(pi|oj) x πj
πjP(oj|pi) = πjP(pi|oj) / Σall j(πjP(pi|oj))
The calculations and results are shown in tables N4a (step 1) and N4b (step 2). Bayesian Probabilities suggest that, in a population of French:German:Italian of 0.5:0.33:0.17, those with light hair and blue eyes are most likely to be German at 0.54 and all other combinations likely to be French
Discussions
Explanations for the terms and abbreviations used, and the mathematical algorithms, have already been discussed in the previous panels. This panel discusses, in conceptual terms, the differences between the Basic and Naive Bayesian models, and how they may be used.
Basic Bayes Probability Model
The Basic Bayes model is based on the original Bayes Theorem. The term "Basic" is used in this page to distinguish it from the Naive Bayes model. In this discussion therefore, Basic Bayes and Bayes means the same thing
Conditional probability is estimating and changing our estimate of probability, when presented with conditions we known are associated with the outcome of interest. For example, if we know that a particular poker player blinks a lot when he is bluffing, then we can conclude he is likely to be bluffing when he blinks a lot. The model is simple, elegant, intuitively easy to understand and accept. What follows are discussions on its usage in practice.
Advantages
The main advantage is that the model is self evidently valid, as it is the mathematical form of an accepted practice of basing decisions on knowledge and experience. Given that each attribute and each outcome is unique, their relationship is also unique, so there is no ambiguitiy in what the results represent. The user can therefore be confident in translating the numerical results to decisions and actions, providing the model is based on valid data, and the a priori probabilities are well chosen.
The model is also easy to use. Tables of coefficients, and tables translating attributes to probabilities of outcomes, can be produced by researchers, and distributed to end decision makers, either physically as reference tables, or electronically as computer applications.
Disadvantages
The major disadvantage of the basic Bayesian model is its inability to cope with too many predictors.
In our example, with 2 predictor of 2 hair and 3 eye color, the number of combined attributes is 2x3=6. If we are to add skin color (say pale, tan, and dark), the number of attributes will be 6x3=36. If we then add body bult (skinny, fat, muscular), 36x3=108. Then temperament (stoic, emotional), 108x2=216, and so on. Each additional predictor multiplies the number of combinations, and this can exceed 1000 if more than 9 binary predictors are included.
The problem is not so much computational complexity, as the high speed and large memory of modern computers can always cope.
The main problem is to find sufficient cases to build a valid reference model in the first place. If probabilities are to be accurate to 0.1 (10%) then at least 10 cases are needed in each attribute/outcome pair. Some of the combinations may be uncommon (say fat, calm, dark skin, light hair, blue eyes, and French), and to obtain sufficient cases to fill all combinations with sufficient numbers may require samples sizes that are impractical to collect.
The model cannot compute missing data, as each attribute is a combination of all the predictors, and all observations must be present for the attribute to be valid. For example, we cannot make a prediction when the person is too far away for us to know the color of his/her eyes, or if the person has no hair. Missing data is common in situations where decisions need to be made with incomplete data. For example, when a patient arrives in an emergency department in a hospital, the triage nurse has to make a decision based on the major symptom. Once admitted, decisions are based on additional history and physical examination. After a few hours, decisions are modified with test results. A day later, more modification based on progress observed. This means that missing data can be normal, and a single model cannot cope with this. In these situations, individual model will need to be produced for every conceivable combination of data and missing data that may arise, and this magnifies the scalability problem.
The model is inflexible, in that each attribute in a compound predictor is actually a combination of information from different original predictors. This means that no original predictor can be added, removed, or its relationship with outcomes altered without changes to attributes in the model. If any change is required, then a new model will need to be constructed. For example, after we used our example model for a while, we observed that the 3 ethnic groups have different temperaments, and wish to include that in our model. We cannot simply add temperament to the model, as we will have to create 12 new patterns of hair_eye_temperament, essentially building a new model with new data. The problem of stability is particularly important in rapidly changing environments, such as medical care where disease patterns and technologies evolve rapidly.
Usage
The basic Bayesian model is preferred if the following conditions can be met
When the relationship between predictors and outcomes are stable
When the total number of attribute not so great that the adequate sample size to build the model cannot be obtained,
Where there is no attribute/outcome combination that is so uncommon that the total sample size required cannot be reached
When missing data is not expected
Where a sufficiently large sample size is available to build the model.
Naive Bayes Probability
Naive Bayes probability is used only when there are multiple predictors. It differs from the basic model by accepting, naively, that the predictors are independent of each other.
Conceptually, the model can be considered as a sequence of basic Bayesian calculations. We start with the a priori probabilities π modify this with the attribute of the first predictor to produce the first a posteriori probabilities, which become the a priori probabilities for the attribute of the second predictor, the results of which become a priori for the third predictor, and so on.
The computations are simplified by multiplying all the coefficients P(a|o) representing the attributes to create the pattern coefficient P(p|o), and use this in the same way as in the basic Bayesian model.
The advantages and disadvantages of using the Naive Bayes model are the mirror image of the basic Bayes model.
Advantages
The Naive Bayes model can include a large number of predictors.
The number of P(a|o) required increases linearly with the number of predictors for the Naive Bayes model, compared with exponentially for the basic bayes model. From our example, 2 hair and 3 eye color requires 2+3=5 P(a|o) cells for each outcome for Naive Bayes, instead of 2x3=6 P(p|o) for basic Bayes. Using only n binary predictors, Naive Bayes requires 2xn individual attribute, but basic Bayes requires 2n combination of attributes
By assuming that all attributes are independent of each other, a workable model can be built providing sufficient number of cases are available for each attribute, compared with basic Bayes where sufficient numbers are required for each combination of attributes
In other words, the model can be built using a smaller and more practical sample size.
The Naive Bayes model copes better with missing data. As all attributes are assumed independent, the absence of an attribute merely means that the influence of that attribute is not included. The results will be less precise, but the algorithm will calculate the a posteriori with what data it has, insted of not computing at all when the Basic Bayeseam model confronts missing data.
The Naive Bayes model is more flexible. It is easier to add, delete, or change a predictor. Given the assumption that all predictors are independent, and the P(p|o), P(o|p), and πP(o|p) are calculated dynamically, the addition, removal, or change in any predictor should not affect the performance of the other predictors. Most of the calculations would remain the same, but the results would be modified by the changes. This flexibility is particularly useful in situations where the relationship between predictors and outcome may change, or technological development requires additional predictors to be added from time to time, such as in medical care.
Disadvantages
The main disadvantage of, or objection to the Naive Bayes model is in the naive assumption, that the predictors are independent, because this assumption is rarely correct. Two types of dependencies may exist in any set of predictors.
Correlation may exist between predictors, as many predictors have common precursors in genetics, geography, history, culture, and so on. When a correlation exists, the common precursor is used more than once in calculation, and this inserts an unaccounted bias in the results. In our example, hair and eye color may be correlated, as they are both dependent on pigmentation generally. If we combine these with say temperament in a model, we will use the influence of pigmentation twice (two colors) to temperament once, resulting in a bias decision.
Interaction is when predictors have synergistic or inhibitory effects on each other in their relationship with outcomes.
An example of synergism is in the treatment of cancer. If surgery has a cure rate of 10% and chemotherapy 10%, then we would expect that providing both will result in a cure rate of 19% (1-0.9x0.9=0.19) if there is no interaction. In many cases however, surgery makes the remaining cancer cells more sensitive to chemotherapy, and the cure rate of combined therapy is more than 19%.
An eample of inhibition is in the use of antibiotics to treat a particular infection. An antibiotic, say penicillin, may cure this infection in 10% of cases, and another, say tetracyline, also 10%. In the absence of interaction we would expect a 19% cure rate if both are used. However, penicillin works by destroying bacterial cell walls and is most effective when bacteria are actively growing, but tetracycline works by slowing the growth of bacteria. The giving of both therefore may result in one inhibiting the effect of the other, and a cure rate of less than 19%.
The error cause by the naive assumption in the Bayes model is difficult to quantify, so the precision and accuracy can only be estimated when the model is deployed after development.
Choice between models
The Basic Bayes model should be the first option, given its theoretical validity and ease of use.
Given that the assumtion of predictor independence is naive and contains unquantifiable bias, the Naive Bayes model is a compromise solution, and chosen when the Basic Bayes model is not practicable. When chose, the following should be observed
The Naive Bayes model should be considered only as an approach to develop a prediction model, its validity and usefulness should not be presumed.
The choice of predictors should be carefully considered to ensure that they are as independent from each other as possible. When correlation or interaction is suspected, predictors can be combined into compound predictors before they are added to the model
The performance of the completed model requires constant review, and the model adjusted when inaccuracies are traced or whem circumstances changed.
References
The following references are introductory in nature, mainly to help the inexperienced.
https://arbital.com/p/bayes_rule/?l=1zq An online The introduction to Bayes theorem. This provides good detailed explanations, and links to additional pages that catered to different level of knowledge and needs
Mueller JP and Massaron L (2016) Machine Learning for Dummies. John Wiley and Son, Inc New Jersey, ISBN 978-1-19-24551-3. p.158-163.
This gives only a brief introduction to both Bayes algorithms. However it provides perspectives in how Bayes probabililty is used in data analysis, in comparison with a whole lot of other methods. There is not much guidance to calculations however, as the book relies heavily on the use of R and Python packages.
Old references
The following 2 references introduced me to Naive Bayes many years ago, before the term naive became commonly used. I include them partly out of sentiment, but mostly because they have major influences on how the models are presented on this page.
Warner HR, Toronto AF, Veasey LG, and Stephenson R (1961)A Mathematical Approach
to Medical Diagnosis. Application to Congenital Heart Disease. JAMA 117:3 p.177-183.
This paper used the formula that is the same as what is now Naive Bayes probability, but neither Bayes nor Naive was mentioned. The paper quoted Lusted as the source of the mathematical approach.
The paper discussed the difficulty of combining multiple predictors into a single compound predictor, as the list of observations needed to differentiate between congenital heart deseases was too great to do so in practice (my interpretation).
The paper presented a complicated set of mathematical arrangements to deal with a mix of binomial and multinomial predictors. It was easy enough to understand, but difficult to present concisely and dynamically in the web page format, and I had tried a number of different ways to present the calculations and results before ending with the format on this page
The paper went on to warn the reader about the pitfalls of correlation between multiple predictors and the consequent tautology, and recommended their careful selection and management.
Over the next few years, the authors published a number of additional papers on the subject, reviewing the results and validating the use of the model they built.
This paper therefore provided the initial prompt that led me to developed the arguments listed in the advantages/disadvantages and usage sections in the discussion panel. It also prompted the evolutions on how the concept and results are presented until the current form on ths page.
Overall JE and Klett CJ (1972) Applied Multivariate Analysis. McGraw Hill Series in Psychology.
McGraw Hill Book Company New York. Library of Congress No. 73-14716407-047935-6 p.400-412.
This is an old book that may be out of print. In its days it was a recommended text for multivariate statistics at the Masters level.
Included in the book was a brief chapter on Bayes Probability, and in that chapter presented the formulae which are the same as that currently known as Naive Bayes. The term "pattern probability" was used, and my guess is that the term Naive was at the time not yet in common use.
I began to understand Bayesian probability from reading this book, as by using different abbreviations, the book clarifies the difference between the basic and the naive models.
From this book, I acquire the terms attribute and pattern, and the abbreviations πP(o|p) to represent Bayes Probability. The book used j for outcome, x for attributes, and arrays of attribute for patterns. I converted these to a for attribute, and p for pattern, and o for outcomes, making it easier for the inexperienced to follow explanations and algorithms.
Calculations
Hints and Suggestions
Data Entry
Multiple points of data entry
The Javascript program is a single cascading program, which allows data entry at different stages of calculation
The top textarea (attribute and outcome designation) accepts the raw data. It then calculates the number of cases for each attribute and outcome, and deposits the results in the table of counts textArea
The table of counts is converted to the table of probabilities for each attribute given an outcome P(a|o), and deposits the results in the Table of P(a|o) Coefficient table.
The table of probability P(a|o) is used to calculate Byesean Probability
Where the user already produced the count table or the probability table, data entry in earlier textareas can be bypassed.
Format of data
Attribute and outcome, the first textarea, consists of 2 or more columns of text data
Each row contains a subject under study
The right most column contains the names of the coutcome. To the left, each column contains a predictor variable
The cells contains the name of the attributes of that predictor variable, or the name of the coutcome (for last column to the right)
All the names are single wordsusing keyboard characters
When the data has only 2 columns, omly 1 predictor variable is used, and the Basic Bayes model is calculated. Otherwise the Naive Bayes model is calculated.
Count Table, the second textarea, contains the count of number of cases for each attribute of each predictor variable (col in first text area)
The first row names the columns
Each subsequent row contains an attribute
Column 1 (col) is a number starting with 1, and represents the predictor number
Column 2 (Attrib) is the name of the attribute
All other columns to the right are the counts of the number of cases with that attribute and that outcome
Table of P(a|o) Coefficients, the third and last textarea, contains the probability of having the attribute given an outcome, P(a|o).
The first row names the columns
Each subsequent row contains an attribute
Column 1 (col) is a number starting with 1, and represents the predictor number
Column 2 (Attrib) is the name of the attribute
All other columns to the right are the Probability of that attribute for that outcome. These are obtained by dividing each count by the total number of cases in each outcome.
P(a|o) are the coefficients used to calculate Bayes Probability
Array of Apriori Probabilities must have the same number of elements as the number of groups. The numbers are relative frequencies or probabilities prior to Bayesean analysis (apriori). These can be actual numbers, relative weights, or actual probability. For example, 3 2 1, 0.5 0.33 0.17, or 60 40 20, are the same apriori frequencies.
Results
The results are calculated using the data in the third text area, the table of probabilities of an attribute given an outcome P(a|o). These are the coefficients used to calculate Bayesean Probabilities
Probability of Each Attribute given a known outcome P(a|o)
This is a reproduction of the coefficients table, to be used for calculation
The first column is the order of the predictor variable
The second column the attribute name
The remaining columns the probability of that attribute given each of the output groups
The last row of the table contains the apriori probabilities, calculated by dividing each element of the input array by the total.
Probabilities of Each Outcome Given an Attribute (uncorrected for prior probabilities), P(o|a)
This is the probability of each outcome for each attribute, assuming the same apriori probability for all groups
The first column is the order of the predictor variable
The second column the attribute name
The remaining columns the probability of that group given the attribute of that row
Probabilities of Each Outcome Given a pattern (combination of Attributes) (uncorrected for prior probabilities), P(o|p)
This is only applicable in the Naive Bayes model, when more than 1 predictor is used. It assumes the same apriori probability for all groups
The first column is the order of the predictor variable
A number of columns for the attributes of different predictors forming the pattern follows to the right
Finally, the remaining columns the probability of the group in that column, given the combinations of attributes in that row
The probabilities are obtained by multiplying the elements of probability for each attribute, then normalized it by dividing each to the sum of probabilities in all groups
Bayesean Probabilities, πP(o|a) for single predictor and πP(o|p) for multiple predictors
This is the same table as the previous one, except that the probabilities are corrected for the different apriori probability of each group.
This is calculated by multiplying the probability of each group with its corresponding apriori probability, then normalize it by dividing it to the sum of probabilities in all groups
Comparison of results of calculations with the original data
This is presented only when the original data in the first textbox are used in calculations. It produces a table where the original outcome (in rows) and estimated outcome (in columns) are counted. This allows a display of the relationship between the original and estimated outcomes
Javascript and html script for future calculations of Bayesean Probability
This is presented only when there is more than 1 predictor variable, and the Naive Bayes model is used. The script is in maroon color, and can be copied and pasted to a text editor, and saved as a html file. The file can then be displayed in any web browser.
The attributes are presented as radiobuttons, each row represents a predictor variable. The probabilities of the outcomes are then estimated.
Javascript Program
Attributes ± Outcome Designation
Array of of Apriori Probabilities
Input Using Attributes ± Outcome Designation The data is a table with 2 or more columns
Each row contains data from a case from the reference data
The last, right most, column is for the outcome
All columns to the left are for attributes, each column a predictor
Each cell the name of attribute or outcome
All names must be single words
Array of Apriori Probabilities Probability for each outcome before attributes are known
Single row, number of columns = number of outcomes
Columns separated by spaces of tabs
Values representing relative probabilities or frequencies
Table of Counts
Array of of Apriori Probabilities
Input Using a Table of Counts The is a table of counts
Col 1 contains the predictors in numerical order
Col 2 contains attributes of predictor
All other columns to the right for the outcomes
The first row contains the outcome names
Each following row are for each attribute
Each cell is the count of the attribute (row) for that outcome (Col)
Array of Apriori Probabilities Probability for each outcome before attributes are known
Single row, number of columns = number of outcomes
Columns separated by spaces of tabs
Values representing relative probabilities or frequencies
Table of P(a|o) Coefficients
Array of of Apriori Probabilities
Input Using a Table of Probability of attributes given the outcome P(a|o) This is a table of probabilities P(a|o)
Col 1 contains the predictors in numerical order
Col 2 contains attributes of predictor
All other columns to the right for the outcomes
The first row contains the outcome names
Each following row are for each attribute
Each cell is the Probability attribute for the outcome P(a|o)
Array of Apriori Probabilities Probability for each outcome before attributes are known
Single row, number of columns = number of outcomes
Columns separated by spaces of tabs
Values representing relative probabilities or frequencies