BayesProb

This page provides explanation and support for the programs to estimate Bayes Probability, as it is used in classifications. As the programs and this explanation page use specific terms and abbreviations, and these are best demonstrated with examples, this introduction panel will describe the example used and the terminology.

The format of data entry and explanation of results produced are in the Help and Hints panel of the program pages.

Before we start: Modern computer perform calculations with precision to 14 decimal points. The programs in this page display results with precision to 4 decimal points. In the introduction panels however, probability values are displayed to 2 decimal points to conserve space and make reading easier. Minor differences to the results from the program may occasionally arise, and some of the probabilities do not total to 1. Reader should be aware and not be confused by this.

The Example

Please be aware that the data is artificially created and not real. The sample size is also deliberately small for easy display.

Predictor		Outcome
Hair	Eye	French	German	Italian
H_Dark	E_Blue	3	1	2
H_Dark	E_Brown	1	1	3
H_Dark	E_Others	1	1	1
H_Light	E_Blue	2	1	1
H_Light	E_Brown	2	1	2
H_Light	E_Others	1	5	1
a priori		0.5	0.33	0.17

We wish to develop a Bayesian model to identify the ethnicity of people, based on hair and eye color. To build our model, we recruited 10 each of known French, German, and Italians, and observed their hair and eye color. We then use the Bayesian model to predict ethnicity using hair and eye colors, in a community with an expected ratios of French:German:Italian of 3:2:1, normalized to a priori probabilities of 0.5:0.33:0.17.

The count of combinations are presented in the table to the right, and the explanation of terms and abbreviations used are as follows

Bayesian Probability

Bayesian Probability Theory is a mathematical model of making decisions based on experience. The process is to predict, using a set of predictors to determine the probabilities of alternative outcomes. In the Bayesian context, prediction is not to forecast the future, nor to establish what may be true, but to logically apply the observed values (attributes) of predictors to calculate how confident we can be, in terms of probabilities (a number between 0 and 1, or a percentage), for each of the alternative outcomes contained in our model.

The process of Bayesian decisions can be separated into the following stages

We begin by nominating the a priori probabilities (π), our confidence in believing each of the alternative outcomes to be correct, before taking predictors into consideration. This can be established by the following
- We can decide that we do not know, and assign the same value as a priori probability to all outcomes
- We can base the a priori probabilities on knowledge, from experience, research, previously collected data, heresay, cultural belief, or simply a guess
- We can propose a priori probabilities as a hypothesis to explore, such as "if the a priori probabilities are ...., then ....."
- From our example, in the community we will use our Bayesian model (noth west of Switzerland), Census informs us that the ratio of French:German:Italian are 3:2:1. These are normalized to probabilities by dividing each value by the total to 0.5:0.33:0.17
We then use the coefficients of our model to apply the attributes of predictors to change a priori probabilities to a posteriori probabilities. The coefficients are developed using a set of reference data, in our example, 10 cases of each ethnicity. Each coefficient is the probability of seeing an attribute given the outcome P(a|o), obtained by dividing the number of cases with each pair of attribute/outcome by the sample size of that outcome in the reference data. Both the Basic and Naive Bayes model use P(a|o) as coefficients, but they are calculated, presented, and used differently. Details of this are presented in the 2 subsequent panels.
The coefficients P(a|o) interacts with the alternatives in the predictor(s) to estimate the a posteriori probability. This is term the a posteriori probability, and commonly referred to as the Bayesian probability
- When there is only 1 predictor, as in the Basic Bayes model, attribute (a) represents each alternative of the predictor, and the Bayesian probability is probability given attribute πP(o|a)
- When there are more than 1 predictor, as in the Naive Bayes model, pattern (p) represents an array of attributes, one from each predictor, and the Bayesian probability is probability given pattern πP(o|p)
Two types of a posteriori probability can therefore be calculated using the coefficients we developed
- Probability of outcome using only the predictor(s), without taking a priori probability into consideration. In the Basic Bayes model with 1 predictor, this is probability given attribute P(o|a), and in the Naive Bayes model with multiple predictors probability given pattern of attributes combined P(o|p). This probability is also termed Maximum Likelihood, and the table of Maximum Likelihood describes the behaviour of the model.
- Probability of outcome using the predictor(s) plus the a priori probabilities π. In the Basic Bayes model with 1 predictor, this is probability given attribute and a priori probability πP(o|a), and in the Naive Bayes model with multiple predictors probability given pattern and a priori probability πP(o|p). This probability is also termed Bayes or Bayesian Probability, and is the major and most commonly used a posteriori probability.

Summary and Terminology

The terminology and abbreviations used in this page and the associated programs are adapted from diverse sources, and may not be the same as in other publications. Users should be aware of this peculiarity when comparing these pages with other sources of information. These are chosen to prefer clarity over brevity, hoping that, by doing so, the inexperienced will be less confused. In particular, the following should be noted.

Predictor is a conceptual term representing things used to predict. and no abbreviation is provided in these pages. In other publications, a variety of terms and abbreviations, such as independent variable, x, j, are used
Attribute is the value of a predictor, and abbreviated as a. In other publication, predictor, independent variable, x, j, and so on are used
Pattern or attribute list or attributes combined are terms representing the same thing, an array of attributes, each from a predictor. These are abbreviated as p, and is used only in the Naive Bayes model. In other publication, predictor, independent variable, x, j, and so on are used
Outcome is used as a concept of things to predict, and also as the values (probability) predicted, and is abreviated as o.In other publication, dependent variable, a posteriori, posterior probability, y, z, θ are used
The abbreviation P(x|y), representing the probability of x given y, generally known as conditional probability, is the same in this page as in most publications. However, in most publications, the same abbreviations are used (with different letters) to represent different types of conditional probabilities, while in these pages
- P(a|o) represent probability of each attribute for a given outcome. Other publications use P(x|y), P(x|θ), or names of predictors and outcomes
- P(o|a) and P(o|p) represent probability of each outcome given an attribute or a list of attributes combined (pattern), without consideration of a priori probabilities. Other publications use P(y|x), P(θ|x), or names of predictors and outcomes This represents Maximum Likelihood, a term used in these pages as in most publications
- πP(o|a) and πP(o|p) represent Bayesian Probability, with π representing a priori probability. The term is an old one (see references), and used in these pages to distinguish it from Maximum Likelihood. In most publications the same abbreviation as Maximum Likelihood is used, and what the abbreviation means depends on the context described.

This panel discusses the basic Bayes model generally, and as presented in the program panel.

Table B1: Counts
Attributes	Outcome
	French	German	Italian
H_Dark+E_Blue	3	1	2
H_Dark+E_Brown	1	1	3
H_Dark+E_Others	1	1	1
H_Light+E_Blue	2	1	1
H_Light+E_Brown	2	1	2
H_Light+E_Others	1	5	1
a priori π	0.5	0.33	0.17

Basic Bayes is used to represent the Bayes Probability model, as originally described by Bayes and in the Wikipedia page on Bayes Theorem. The term "Basic" is used to avoid confusion with the Naive Bayes model, and is not applicable outside of this page.

The model uses a single predictor with 2 or more attributes to predict the probabilities of 2 or more outcomes. Where more than one variable are used to predict, they are combined into a single compound predictor. In the default example of this page, the data presented in the Introduction panel are restructured, so that the two variables, hair color (2 attributes of Dark and Light) and eye colors (3 attributes of Blue, Brown and Others) are combined into a single compound predictor of HairEye, with 2x3=6 attributes of H_Dark+E_Blue, H_Dark+E_Brown, H_Dark+E_Others, H_Light+E_Blue, H_Light+E_Brown, and H_Light+E_Others. The restructured table is used in the program, and as shown to the right.

Building the model: Attribute (a): Compound(Hair and eye color), Outcome (o):ethnicity

Table B2. Model Coefficients P(a|o)
Attributes	Outcome
	French	German	Italian	Total
H_Dark+E_Blue	3/10=0.3	1/10=0.1	2/10=0.2	0.3+0.1+0.2=0.6
H_Dark+E_Brown	1/10=0.1	1/10=0.1	3/10=0.3	0.1+0.1+0.3=0.5
H_Dark+E_Others	1/10=0.1	1/10=0.1	1/10=0.1	0.1+0.1+0.1=0.3
H_Light+E_Blue	2/10=0.2	1/10=0.1	1/10=0.1	0.2+0.1+0.1=0.4
H_Light+E_Brown	2/10=0.2	1/10=0.1	2/10=0.2	0.2+0.1+0.2=0.5
H_Light+E_Others	1/10=0.1	5/10=0.5	1/10=0.1	0.1+0.5+0.1=0.7

The coefficients of the model, to be used to convert a priori to a posteriori probability, is the probability of attribute given outcome P(a|o). For the i^th attribute and the j^th outcome, P(a|o) is calculated by dividing the number of the attribute/outcome pair (N_i,j) by the sample size of that outcome (N_j).

_i

_j

_i,j

_j

In this example, the attributes are the hair eye color combinations, and the outcomes ethnicity. The results are shown in Table B2 to the right.

Prediction 1. Maximum Likelihood P(o|a)

Table B3. Maximum Likelihood P(o|a)
Attribute	Outcome
	French	German	Italian
H_Dark+E_Blue	0.3/0.6=0.5	0.1/0.6=0.17	0.2/0.6=0.33
H_Dark+E_Brown	0.1/0.5=0.2	0.1/0.5=0.2	0.3/0.5=0.6
H_Dark+E_Others	0.1/0.3=0.33	0.1/0.3=0.33	0.1/0.3=0.33
H_Light+E_Blue	0.2/0.4=0.14	0.1/0.4=0.71	0.1/0.4=0.14
H_Light+E_Brown	0.2/0.5=0.4	0.1/0.5=0.2	0.2/0.5=0.4
H_Light+E_Others	0.1/0.7=0.5	0.5/0.7=0.25	0.1/0.7=0.25

If a posteriori probability is calculated without the inclusion of a priori probability π, the result is probability of outcome given attribute P(o|a), also called Maximum Likelihood. This describes the model, and demonstrates the relationship between attributes and outcomes.

P(o|a) is calculated from P(a|o) in Table B2. For each outcome j, its probability to be predicted by an attribute i (a_i), the calculation is

_j

_i

_j

_i

_j

The calculations and results are shown in the table B3 to the right. They suggest that, without including a priori probabilities, those with light hair and blue eyes Germans (0.71), those with dark hair and blue eyes French and those with light hair and other color eyes French (0.5), while the other combined attributes did not clearly discriminate between the 3 ethnicities.

Prediction 2. Bayesian Probability πP(o|a)

Table B4a. πP(a|o)
Attributes	Outcomes
	French	German	Italian	Total
H_Dark+E_Blue	0.3x0.5=0.15	0.1x0.33=0.03	0.2x0.17=0.03	0.15+0.033+0.033=0.22
H_Dark+E_Brown	0.1x0.5=0.05	0.1x0.33=0.03	0.3x0.17=0.05	0.05+0.033+0.05=0.13
H_Dark+E_Others	0.1x0.5=0.05	0.1x0.33=0.03	0.1x0.17=0.02	0.05+0.033+0.02=0.1
H_Light+E_Blue	0.2x0.5=0.05	0.1x0.33=0.17	0.1x0.17=0.017	0.05+0.17+0.017=0.233
H_Light+E_Brown	0.2x0.5=0.1	0.1x0.33=0.03	0.2x0.17=0.033	0.1+0.03+0.03=0.17
H_Light+E_Others	0.1x0.5=0.1	0.5x0.33=0.03	0.1x0.17=0.017	0.1+0.03+0.017=0.15

Table B4b. Bayesian Probability πP(o|a)
Attributes	Outcomes
	French	German	Italian
H_Dark+E_Blue	0.15/0.22=0.69	0.03/0.22=0.15	0.03/0.22=0.15
H_Dark+E_Brown	0.05/0.13=0.38	0.03/0.13=0.25	0.05/0.13=0.38
H_Dark+E_Others	0.05/0.1=0.5	0.03/0.1=0.33	0.02/0.1=0.17
H_Light+E_Blue	0.05/0.23=0.21	0.17/0.23=0.71	0.02/0.23=0.07
H_Light+E_Brown	0.1/0.17=0.6	0.03/0.17=0.2	0.03/0.17=0.2
H_Light+E_Others	0.1/0.15=0.67	0.03/0.15=0.22	0.02/0.15=0.11

If a posteriori probability is calculated by changing the a priori probability π, the result is probability of outcome given attribute and a priori probability πP(o|a). This is usually referred to as Bayesian Probability, as it follows the descriptions first made by Bayes.

πP(o|a) is calculated from P(a|o) in Table B2 and the apriori probability π from Table B1. The calculations are in 2 steps. Firstly, the coefficient is adjusted by the a priori probability of each outcome, then the adjusted coefficients are normalized by the total for all outcomes. For each outcome j, its probability to be predicted by an attribute i (a_i) and a priori probability π_j, the calculations are as follows

π_jP(a_i|o_j) = P(a_i|o_j) x π_j
π_jP(o_j|a_i) = π_jP(a_i|o_j) / Σ_{all j}(π_jP(a_i|o_j))

The calculations and results are shown in tables B4a (step 1) and B4b (step 2). Bayesian Probabilities suggest that, in a population of French:German:Italian of 0.5:0.33:0.17, those with light hair and +E_Blue eyes are most likely to be German at 0.71 and all other combinations likely to be French, although those with dark hair and brown eyes are equally likely to be Italians at 0.38.

This panel discusses the Naive Bayes model and the calculations used in the program panel.

Table N1: Counts
Attributes		Outcome
Col	Attribute	French	German	Italian
1	H_Dark	5	3	6
1	H_Light	5	7	4
2	E_Blue	4	6	3
2	E_Blown	3	2	5
2	E_Others	3	2	2
a priori π		0.5	0.33	0.17

Naive Bayes is used to represent the Bayes Probability model, a modification of the original Bayes modle described by Bayes and in the Wikipedia page on Bayes Theorem, and explained in Wikipedia page on Naive Bayes.

The model uses 2 or more predictors, each with 2 or more attributes, to predict the probabilities of 2 or more outcomes. The term Naive refers to the naive assumption that the predictors are independent of each other.

In the default example in Program panel, the data presented in the Introduction panel are restructured, so that the two preditors, hair color (2 attributes of Dark and Light) and eye colors (3 attributes of Blue, Brown and Others) are counted separately. The restructured table is used in the program, and as shown to the right.

Please note: The term Col is used in all tables to represent predictor. Col 1 is predictor 1 (in this example hair color), Col 2 is predictor 2 (in this example eye color), and so on.

Building the model: Attribute (a) : Outcome (o)

Table N2. Model Coefficients P(a|o)
Col	Attribute	French	German	Italian
1	H_Dark	5/10=0.5	3/10=0.3	6/10=0.6
1	H_Light	5/10=0.5	7/10=0.7	4/10=0.4
2	E_Blue	4/10=0.4	6/10=0.6	3/10=0.3
2	E_Blown	3/10=0.3	2/10=0.2	5/10=0.5
2	E_Others	3/10=0.3	2/10=0.2	2/10=0.2

The coefficients of the model, to be used to convert a priori to a posteriori probability, is the probability of attribute given outcome P(a|o). For the i^th attribute from each predictor and the j^th outcome, P(a|o) is calculated by dividing the number of the attribute/outcome pair (N_i,j) by the sample size of that outcome (N_j).

_i

_j

_i,j

_j

In this example, the attributes are the hair color (Col 1) and eye color (Col 2), and the P(a|o) for each attribute from each predictor (Col) are calculated separately. The results are shown in Table N2 to the right.

Creating the pattern coefficient: pattern(p) : Outcome (o)

Table N2x. Pattern (attributes combined) Coefficients P(p|o)
Pattern		Outcome
Col 1	Col 2	French	German	Italian	Total
H_Dark	E_Blue	0.5x0.4=0.2	0.3x0.6=0.18	0.6x0.3=0.18	0.2+0.18+0.18=0.56
H_Dark	E_Blown	0.5x0.3=0.15	0.3x0.2=0.06	0.6x0.5=0.3	0.15+0.06+0.3=0.51
H_Dark	E_Others	0.5x0.3=0.15	0.3x0.2=0.06	0.6x0.2=0.12	0.15+0.06+0.12=0.33
H_Light	E_Blue	0.5x0.4=0.2	0.7x0.6=0.42	0.4x0.3=0.12	0.2+0.42+0.12=0.74
H_Light	E_Blown	0.5x0.3=0.15	0.7x0.2=0.14	0.4x0.5=0.2	0.15+0.14+0.2=0.49
H_Light	E_Others	0.5x0.3=0.15	0.7x0.2=0.14	0.4x0.2=0.08	0.15+0.14+0.08=0.37

When presented with an array of attributes (pattern), Naive Bayes creates the coefficient for that particular pattern/outcome combination (P(p|o)). This is created by multiplying the P(a|o) of attribute/outcome combination in the list of attributes . An example uses the first pattern

Col 1, hair color:Dark and outcome 1 French: P(a|o) = 0.5
Col 2, eye color:Blue and outcome 1 French: P(a|o) = 0.4
For attributes combined [H_Dark,E_Blue]/French P(p|o) = 0.5x0.4 = 0.2

The coefficients are as shown as Table N2x for demonstration purpose here. In practice, the P(p|o) coefficients are calculated dynamically, depending on the attributes in the pattern array. As this is a dynamic intermediary step, P(p|o) are not presented as results, but used to produce the a posteriori probabilities in the same manner as P(a|o) in the basic Bayes model.

Prediction 1. Maximum Likelihood P(o|p)

Table N3. Maximum Likelihood P(o|p)
Pattern		Outcome
Col 1	Col 2	French	German	Italian
H_Dark	E_Blue	0.2/0.56=0.36	0.18/0.56=0.32	0.18/0.56=0.32
H_Dark	E_Blown	0.15/0.51=0.29	0.06/0.51=0.12	0.3/0.51=0.59
H_Dark	E_Others	0.15/0.33=0.45	0.06/0.33=0.18	0.12/0.33=0.36
H_Light	E_Blue	0.2/0.74=0.27	0.42/0.74=0.57	0.12/0.74=0.16
H_Light	E_Blown	0.15/0.49=0.31	0.14/0.49=0.29	0.2/0.49=0.41
H_Light	E_Others	0.15/0.37=0.41	0.14/0.37=0.38	0.08/0.37=0.22

If a posteriori probability is calculated without the inclusion of a priori probability π, the result is probability of outcome given a pattern of attributes P(o|p), also called Maximum Likelihood. This describes the model, and demonstrates the relationship between attributes and outcomes.

P(o|p) is calculated dynamically, the values demonstrated in Table N2x. For each outcome j, its probability to be predicted by pattern i (p_i), the calculation is

_j

_i

_j

_i

_j

The calculations and results are shown in the table N3 to the right. They suggest that, without including a priori probability, those with brown eyes are most likely to be Italians (0.59 for dark hair and 0.41 for light hair), those with light hair and blue eyes German (0.57), and the rest French (0.36,0.45, 0.41).

Prediction 2. Bayesian Probability πP(o|p)

Table N4a. πP(p|o)
Pattern		Outcome
Col 1	Col 2	French	German	Italian	Total
H_Dark	E_Blue	0.2x0.5=0.1	0.18x0.33=0.06	0.18x0.17=0.03	0.10+0.06+0.03=0.19
H_Dark	E_Blown	0.15x0.5=0.08	0.06x0.33=0.02	0.3x0.17=0.05	0.08+0.02+0.05=0.15
H_Dark	E_Others	0.15x0.5=0.08	0.06x0.33=0.02	0.12x0.17=0.02	0.08+0.02+0.02=0.12
H_Light	E_Blue	0.2x0.5=0.1	0.42x0.33=0.14	0.12x0.17=0.02	0.10+0.14+0.02=0.26
H_Light	E_Blown	0.15x0.5=0.08	0.14x0.33=0.05	0.2x0.17=0.03	0.08+0.05+0.03=0.16
H_Light	E_Others	0.15x0.5=0.08	0.14x0.33=0.05	0.08x0.17=0.01	0.08+0.05+0.01=0.13

Table N4b. Bayesian Probability πP(o|p)
Pattern		Outcome
Col 1	Col 2	French	German	Italian
H_Dark	E_Blue	0.1/0.19=0.53	0.06/0.19=0.32	0.03/0.19=0.16
H_Dark	E_Blown	0.08/0.15=0.52	0.02/0.15=0.14	0.05/0.15=0.34
H_Dark	E_Others	0.08/0.12=0.65	0.02/0.12=0.17	0.02/0.12=0.17
H_Light	E_Blue	0.1/0.26=0.38	0.14/0.26=0.54	0.02/0.26=0.08
H_Light	E_Blown	0.08/0.16=0.48	0.05/0.16=0.3	0.03/0.16=0.22
H_Light	E_Others	0.08/0.13=0.56	0.05/0.13=0.35	0.01/0.13=0.1

If a posteriori probability is calculated by changing the a priori probability π, as in the majority of predictions, the result is probability of outcome given pattern and a priori probability πP(o|p). This is usually referred to as Naive Bayesian Probability.

πP(o|p) is calculated from the dynamically calculated P(o|p) as shown in in Table N2x and the apriori probability π from Table N1. The calculations are in 2 steps. Firstly, the coefficient P(p|o) is adjusted by the a priori probability of each outcome, then the adjusted coefficients are normalized by the total for all outcomes. For each outcome j, its probability to be predicted by a pattern i (p_i) and a priori probability π_j, the calculations are as follows

π_jP(p_i|o_j) = P(p_i|o_j) x π_j
π_jP(o_j|p_i) = π_jP(p_i|o_j) / Σ_{all j}(π_jP(p_i|o_j))

The calculations and results are shown in tables N4a (step 1) and N4b (step 2). Bayesian Probabilities suggest that, in a population of French:German:Italian of 0.5:0.33:0.17, those with light hair and blue eyes are most likely to be German at 0.54 and all other combinations likely to be French

Explanations for the terms and abbreviations used, and the mathematical algorithms, have already been discussed in the previous panels. This panel discusses, in conceptual terms, the differences between the Basic and Naive Bayesian models, and how they may be used.

Basic Bayes Probability Model

The Basic Bayes model is based on the original Bayes Theorem. The term "Basic" is used in this page to distinguish it from the Naive Bayes model. In this discussion therefore, Basic Bayes and Bayes means the same thing

Conditional probability is estimating and changing our estimate of probability, when presented with conditions we known are associated with the outcome of interest. For example, if we know that a particular poker player blinks a lot when he is bluffing, then we can conclude he is likely to be bluffing when he blinks a lot. The model is simple, elegant, intuitively easy to understand and accept. What follows are discussions on its usage in practice.

Advantages

The main advantage is that the model is self evidently valid, as it is the mathematical form of an accepted practice of basing decisions on knowledge and experience. Given that each attribute and each outcome is unique, their relationship is also unique, so there is no ambiguitiy in what the results represent. The user can therefore be confident in translating the numerical results to decisions and actions, providing the model is based on valid data, and the a priori probabilities are well chosen.

The model is also easy to use. Tables of coefficients, and tables translating attributes to probabilities of outcomes, can be produced by researchers, and distributed to end decision makers, either physically as reference tables, or electronically as computer applications.

Disadvantages

The major disadvantage of the basic Bayesian model is its inability to cope with too many predictors.

In our example, with 2 predictor of 2 hair and 3 eye color, the number of combined attributes is 2x3=6. If we are to add skin color (say pale, tan, and dark), the number of attributes will be 6x3=36. If we then add body bult (skinny, fat, muscular), 36x3=108. Then temperament (stoic, emotional), 108x2=216, and so on. Each additional predictor multiplies the number of combinations, and this can exceed 1000 if more than 9 binary predictors are included.
The problem is not so much computational complexity, as the high speed and large memory of modern computers can always cope.
The main problem is to find sufficient cases to build a valid reference model in the first place. If probabilities are to be accurate to 0.1 (10%) then at least 10 cases are needed in each attribute/outcome pair. Some of the combinations may be uncommon (say fat, calm, dark skin, light hair, blue eyes, and French), and to obtain sufficient cases to fill all combinations with sufficient numbers may require samples sizes that are impractical to collect.

The model cannot compute missing data, as each attribute is a combination of all the predictors, and all observations must be present for the attribute to be valid. For example, we cannot make a prediction when the person is too far away for us to know the color of his/her eyes, or if the person has no hair. Missing data is common in situations where decisions need to be made with incomplete data. For example, when a patient arrives in an emergency department in a hospital, the triage nurse has to make a decision based on the major symptom. Once admitted, decisions are based on additional history and physical examination. After a few hours, decisions are modified with test results. A day later, more modification based on progress observed. This means that missing data can be normal, and a single model cannot cope with this. In these situations, individual model will need to be produced for every conceivable combination of data and missing data that may arise, and this magnifies the scalability problem.

The model is inflexible, in that each attribute in a compound predictor is actually a combination of information from different original predictors. This means that no original predictor can be added, removed, or its relationship with outcomes altered without changes to attributes in the model. If any change is required, then a new model will need to be constructed. For example, after we used our example model for a while, we observed that the 3 ethnic groups have different temperaments, and wish to include that in our model. We cannot simply add temperament to the model, as we will have to create 12 new patterns of hair_eye_temperament, essentially building a new model with new data. The problem of stability is particularly important in rapidly changing environments, such as medical care where disease patterns and technologies evolve rapidly.

Usage

The basic Bayesian model is preferred if the following conditions can be met

When the relationship between predictors and outcomes are stable
When the total number of attribute not so great that the adequate sample size to build the model cannot be obtained,
Where there is no attribute/outcome combination that is so uncommon that the total sample size required cannot be reached
When missing data is not expected
Where a sufficiently large sample size is available to build the model.
Naive Bayes Probability
Naive Bayes probability is used only when there are multiple predictors. It differs from the basic model by accepting, naively, that the predictors are independent of each other.
Conceptually, the model can be considered as a sequence of basic Bayesian calculations. We start with the a priori probabilities π modify this with the attribute of the first predictor to produce the first a posteriori probabilities, which become the a priori probabilities for the attribute of the second predictor, the results of which become a priori for the third predictor, and so on.
The computations are simplified by multiplying all the coefficients P(a|o) representing the attributes to create the pattern coefficient P(p|o), and use this in the same way as in the basic Bayesian model.
The advantages and disadvantages of using the Naive Bayes model are the mirror image of the basic Bayes model.
Advantages
The Naive Bayes model can include a large number of predictors.
- The number of P(a|o) required increases linearly with the number of predictors for the Naive Bayes model, compared with exponentially for the basic bayes model. From our example, 2 hair and 3 eye color requires 2+3=5 P(a|o) cells for each outcome for Naive Bayes, instead of 2x3=6 P(p|o) for basic Bayes. Using only n binary predictors, Naive Bayes requires 2xn individual attribute, but basic Bayes requires 2ⁿ combination of attributes
- By assuming that all attributes are independent of each other, a workable model can be built providing sufficient number of cases are available for each attribute, compared with basic Bayes where sufficient numbers are required for each combination of attributes
- In other words, the model can be built using a smaller and more practical sample size.
The Naive Bayes model copes better with missing data. As all attributes are assumed independent, the absence of an attribute merely means that the influence of that attribute is not included. The results will be less precise, but the algorithm will calculate the a posteriori with what data it has, instead of not computing at all when the Basic Bayesean model confronts missing data.
The Naive Bayes model is more flexible. It is easier to add, delete, or change a predictor. Given the assumption that all predictors are independent, and the P(p|o), P(o|p), and πP(o|p) are calculated dynamically, the addition, removal, or change in any predictor should not affect the performance of the other predictors. Most of the calculations would remain the same, but the results would be modified by the changes. This flexibility is particularly useful in situations where the relationship between predictors and outcome may change, or technological development requires additional predictors to be added from time to time, such as in medical care.
Disadvantages
The main disadvantage of, or objection to the Naive Bayes model is in the naive assumption, that the predictors are independent, because this assumption is rarely correct. Two types of dependencies may exist in any set of predictors.
- Correlation may exist between predictors, as many predictors have common precursors in genetics, geography, history, culture, and so on. When a correlation exists, the common precursor is used more than once in calculation, and this inserts an unaccounted bias in the results. In our example, hair and eye color may be correlated, as they are both dependent on pigmentation generally. If we combine these with say temperament in a model, we will use the influence of pigmentation twice (two colors) to temperament once, resulting in a bias decision.
- Interaction is when predictors have synergistic or inhibitory effects on each other in their relationship with outcomes.
  - An example of synergism is in the treatment of cancer. If surgery has a cure rate of 10% and chemotherapy 10%, then we would expect that providing both will result in a cure rate of 19% (1-0.9x0.9=0.19) if there is no interaction. In many cases however, surgery makes the remaining cancer cells more sensitive to chemotherapy, and the cure rate of combined therapy is more than 19%.
  - An eample of inhibition is in the use of antibiotics to treat a particular infection. An antibiotic, say penicillin, may cure this infection in 10% of cases, and another, say tetracyline, also 10%. In the absence of interaction we would expect a 19% cure rate if both are used. However, penicillin works by destroying bacterial cell walls and is most effective when bacteria are actively growing, but tetracycline works by slowing the growth of bacteria. The giving of both therefore may result in one inhibiting the effect of the other, and a cure rate of less than 19%.
The error cause by the naive assumption in the Bayes model is difficult to quantify, so the precision and accuracy can only be estimated when the model is deployed after development.
Choice between models

The Basic Bayes model should be the first option, given its theoretical validity and ease of use.
Given that the assumtion of predictor independence is naive and contains unquantifiable bias, the Naive Bayes model is a compromise solution, and chosen when the Basic Bayes model is not practicable. When chosen, the following should be observed
- The Naive Bayes model should be considered only as an approach to develop a prediction model, its validity and usefulness should not be presumed.
- The choice of predictors should be carefully considered to ensure that they are as independent from each other as possible. When correlation or interaction is suspected, predictors can be combined into compound predictors before they are added to the model
- The performance of the completed model requires constant review, and the model adjusted when inaccuracies are traced or whem circumstances changed.

The following references are introductory in nature, mainly to help the inexperienced.

Basic Bayes

https://en.wikipedia.org/wiki/Bayes%27_theorem Bayes Probability from Wikipedia. This provides a clear definition, and links to references for more reading

https://arbital.com/p/bayes_rule/?l=1zq An online The introduction to Bayes theorem. This provides good detailed explanations, and links to additional pages that catered to different level of knowledge and needs

http://jim-stone.staff.shef.ac.uk/BookBayes2012/bookbayesch01WithR.pdf The first chapter of a book on Bayes Theorem, and provides a clear explanation and examples of Bayes that can be understood by beginners

Naive Bayes

https://en.wikipedia.org/wiki/Naive_Bayes_classifier Naive Bayes from Wikipedia. A concise and clear description, and provides references to more reading

https://www.machinelearningplus.com/predictive-modeling/how-naive-bayes-algorithm-works-with-example-and-full-code/ a teaching page with explanations for beginners

Mueller JP and Massaron L (2016) Machine Learning for Dummies. John Wiley and Son, Inc New Jersey, ISBN 978-1-19-24551-3. p.158-163.

This gives only a brief introduction to both Bayes algorithms. However it provides perspectives in how Bayes probabililty is used in data analysis, in comparison with a whole lot of other methods. There is not much guidance to calculations however, as the book relies heavily on the use of R and Python packages.

Old references

The following 2 references introduced me to Naive Bayes many years ago, before the term naive became commonly used. I include them partly out of sentiment, but mostly because they have major influences on how the models are presented on this page.

Warner HR, Toronto AF, Veasey LG, and Stephenson R (1961)A Mathematical Approach to Medical Diagnosis. Application to Congenital Heart Disease. JAMA 117:3 p.177-183.

The paper discussed the difficulty of combining multiple predictors into a single compound predictor, as the list of observations needed to differentiate between congenital heart deseases was too great to do so in practice (my interpretation).

The paper presented a complicated set of mathematical arrangements to deal with a mix of binomial and multinomial predictors. It was easy enough to understand, but difficult to present concisely and dynamically in the web page format, and I had tried a number of different ways to present the calculations and results before ending with the format on this page

The paper went on to warn the reader about the pitfalls of correlation between multiple predictors and the consequent tautology, and recommended their careful selection and management.

Over the next few years, the authors published a number of additional papers on the subject, reviewing the results and validating the use of the model they built.

This paper therefore provided the initial prompt that led me to developed the arguments listed in the advantages/disadvantages and usage sections in the discussion panel. It also prompted the evolutions on how the concept and results are presented until the current form on ths page.

Overall JE and Klett CJ (1972) Applied Multivariate Analysis. McGraw Hill Series in Psychology. McGraw Hill Book Company New York. Library of Congress No. 73-14716407-047935-6 p.400-412.

Included in the book was a brief chapter on Bayes Probability, and in that chapter presented the formulae which are the same as that currently known as Naive Bayes. The term "pattern probability" was used, and my guess is that the term Naive was at the time not yet in common use.

I began to understand Bayesian probability from reading this book, as by using different abbreviations, the book clarifies the difference between the basic and the naive models.

From this book, I acquire the terms attribute and pattern, and the abbreviations πP(o|p) to represent Bayes Probability. The book used j for outcome, x for attributes, and arrays of attribute for patterns. I converted these to a for attribute, and p for pattern, and o for outcomes, making it easier for the inexperienced to follow explanations and algorithms.

StatsToDo: Bayes and Naive Bayes Probability

The Example

Bayesian Probability

Summary and Terminology

Building the model: Attribute (a): Compound(Hair and eye color), Outcome (o):ethnicity

Prediction 1. Maximum Likelihood P(o|a)

Prediction 2. Bayesian Probability πP(o|a)

Building the model: Attribute (a) : Outcome (o)

Creating the pattern coefficient: pattern(p) : Outcome (o)

Prediction 1. Maximum Likelihood P(o|p)

Prediction 2. Bayesian Probability πP(o|p)

Basic Bayes Probability Model

Naive Bayes Probability

Choice between models

Basic Bayes

Naive Bayes

Old references

Data Entry

Multiple points of data entry

Format of data

Results

Probability of Each Attribute given a known outcome P(a|o)

Probabilities of Each Outcome Given an Attribute (uncorrected for prior probabilities), P(o|a)

Probabilities of Each Outcome Given a pattern (combination of Attributes) (uncorrected for prior probabilities), P(o|p)

Bayesean Probabilities, πP(o|a) for single predictor and πP(o|p) for multiple predictors

Comparison of results of calculations with the original data

Javascript and html script for future calculations of Bayesean Probability