 Content DisclaimerCopyright @2020.All Rights Reserved.
StatsToDo : Classification by Bayes Probability Explained
Introduction Basic Bayes Naive Bayes Discussions References
This page provides explanation and support for the two programs in Classification by Basic Bayes Probability Program Page and Classification by Naive Bayes Probability Program Page . As the programs and this explanation page use specific terms and abbreviations, and these are best demonstrated with examples, this introduction panel will describe the example used and the terminology.

The format of data entry and explanation of results produced are in the Help and Hints panel of the program pages.

The remainder of this panel provides a description of the example data used in this and the two program pages, and brief explanations of the overall concepts of Bayesian probability and the terms used in these pages.

Before we start: Modern computer perform calculations with precision to 14 decimal points. The two programs associated with this page display results with precision to 4 decimal points. On this page, to conserve space and make reading easier, probability values are displayed to 2 decimal points. Minor differences to the results from the program may occasionally arise, and some of the probabilities do not total to 1. Reader should be aware and not be confused by this.

### The Example

The same example is used in the two programming pages and this explaination page.
PredictorOutcome
HairEyeFrenchGermanItalian
DarkBlue312
DarkBrown113
DarkOthers111
LightBlue211
LightBrown212
LightOthers151
a priori0.50.330.17

We wish to develop a Bayesian model to identify the ethnicity of people, based on hair color and eye color. To build our model, we recruited 10 each of known French, German, and Italians, and observed their hair and eye color. We then use the Bayesian model to predict ethnicity using hair and eye colors, in a community with an expected ratios of French:German:Italian of 3:2:1, normalized to a priori probabilities of 0.5:0.33:0.17.

The count of each combinations and the coefficients are presented in the table to the right, and the explanation of terms and abbreviations used are as follows

### Bayesian Probability

Bayesian Probability Theory is a mathematical model of making decisions based on experience. The process is to predict, using a set of predictors to determine the probabilities of alternative outcomes. In the Bayesian context, prediction is not to forecast the future, nor to establish what may be true, but to logically apply the observed values of predictors to calculate how confident we can be, in terms of probabilities (a number between 0 and 1, or a percentage), for each of the alternative outcomes contained in our model.

The process of Bayesian decisions can be separated into the following stages

1. We begin by nominating the a priori probabilities (π), our confidence in believing each of the alternative outcomes to be correct, before taking predictors into consideration. This can be established by the following
• We can decide that we do not know, and assign the same value as a priori probability to all outcomes
• We can base the a priori probabilities on knowledge, from experience, research, previously collected data, heresay, cultural belief, or simply a guess
• We can propose a priori probabilities as a hypothesis to explore, such as "if the a priori probabilities are ...., then ....."
• From our example, in the community we will use our Bayesian model (noth west of Switzerland), Census informs us that the ratio of French:German:Italian are 3:2:1. These are normalized to probabilities by dividing each value by the total to 0.5:0.33:0.17
2. We then use the coefficients of our model to apply the attributes of predictors to change a priori probabilities to a posteriori probabilities. The coefficients are developed using a set of reference data, in our example, 10 cases of each ethnicity. Each coefficient is the probability of seeing an attribute given the outcome P(a|o), obtained by dividing the number of cases with each pair of attribute/outcome by the sample size of that outcome in the reference data. Both the Basic and Naive Bayes model use P(a|o) as coefficients, but they are calculated, presented, and used differently. Details of this are presented in the 2 subsequent panels.
3. The coefficients P(a|o) interacts with the alternatives in the predictor(s) to estimate the a posteriori probability. This is term the a posteriori probability, and commonly referred to as the Bayesian probability
• When there is only 1 predictor, as in the Basic Bayes model, attribute (a) represents each alternative of the predictor, and the Bayesian probability is probability given attribute πP(o|a)
• When there are more than 1 predictor, as in the Naive Bayes model, pattern (p) represents an array of attributes, one from each predictor, and the Bayesian probability is probability given pattern πP(o|p)
4. Two types of a posteriori probability can therefore be calculated using the coefficients we developed
• Probability of outcome using only the predictor(s), without taking a priori probability into consideration. In the Basic Bayes model with 1 predictor, this is probability given attribute P(o|a), and in the Naive Bayes model with multiple predictors probability given pattern P(o|p). This probability is also termed Maximum Likelihood, and the table of Maximum Likelihood describes the behaviour of the model.
• Probability of outcome using the predictor(s) and the a priori probabilities π. In the Basic Bayes model with 1 predictor, this is probability given attribute and a priori probability πP(o|a), and in the Naive Bayes model with multiple predictors probability given pattern and a priori probability πP(o|p). This probability is also termed Bayes or Bayesian Probability, and is the major and most commonly used a posteriori probability.

### Summary and Technical Notes

The terminology and abbreviations used in this page and the two associated program pages are adapted from diverse sources, and may not be the same as in other publications. Users should be aware of this peculiarity when comparing these pages with other sources of information. These are chosen to prefer clarity over brevity, hoping that, by doing so, the inexperienced will be less confused. In particular, the following should be noted.
• Predictor is a conceptual term representing things used to predict. and no abbreviation is provided in these pages. In other publications, a variety of terms and abbreviations, such as independent variable, x, j, are used
• Attribute is the value of a predictor, and abbreviated as a. In other publication, predictor, independent variable, x, j, and so on are used
• Pattern is an array of attributes, and abbreviated as p, one from each predictor, and is used only in the Naive Bayes model. In other publication, predictor, independent variable, x, j, and so on are used
• Outcome is used bothe as a concept of things to predict, and also as the values (probability) predicted, and is abreviated as o.In other publication, dependent variable, a posteriori, posterior probability, y, z, θ are used
• The abbreviation P(x|y), representing the probability of x given y, generally known as conditional probability, is the same in these pages as in most publications. However, in most publications, the same abbreviations are used (with different letters) to represent different types of conditional probabilities, while in these pages
• P(a|o) and P(p|o) represent probability of attribute or pattern given outcome. Other publications use P(x|y), P(x|θ), or names of predictors and outcomes
• P(o|a) and P(o|a) represent probability of outcome given attribute or pattern, without consideration of a priori probabilities. Other publications use P(y|x), P(θ|x), or names of predictors and outcomes This represents Maximum Likelihood, a term used in these pages as in most publications
• πP(o|a) and πP(o|a) represent Bayesian Probability, with π representing a priori probability. The term is an old one (see references), and used in these pages to distinguish it from Maximum Likelihood. In most publications the same abbreviation as Maximum Likelihood is used, and what the abbreviation means depends on the context described.