StatsToDo : Classification by Bayes Probability Explained
Introduction Basic Bayes Naive Bayes Discussions References
This page provides explanation and support for the two programs in Classification by Basic Bayes Probability Program Page and Classification by Naive Bayes Probability Program Page . As the programs and this explanation page use specific terms and abbreviations, and these are best demonstrated with examples, this introduction panel will describe the example used and the terminology.

The format of data entry and explanation of results produced are in the Help and Hints panel of the program pages.

The remainder of this panel provides a description of the example data used in this and the two program pages, and brief explanations of the overall concepts of Bayesean probability and the terms used in these pages.

Before we start: Modern computer perform calculations with precision to 14 decimal points. The two programs associated with this page display results with precision to 4 decimal points. On this page, to conserve space and make reading easier, probability values are displayed to 2 decimal points. Minor differences to the results from the program may occasionally arise, and some of the probabilities do not total to 1. Reader should be aware and not be confused by this.

### The Example

The same example is used in the two programming pages and this explaination page.
PredictorOutcome
HairEyeFrenchGermanItalian
DarkBlue312
DarkBrown113
DarkOthers111
LightBlue211
LightBrown212
LightOthers151
a priori0.50.330.17
cost0.250.50.25

We wish to develop a Bayesean model to identify the ethnicity of people, based on hair color and eye color. To build our model, we recruited 10 each of known French, German, and Italians, and observed their hair and eye color. We then use the Bayesean model to predict ethnicity using hair and eye colors, in a community with an expected ratios of French:German:Italian of 3:2:1, normalized to a priori probabilities of 0.5:0.33:0.17. In addition, we inserted a cost adjustment to our prediction, as cost ratios of 1:2:1 for French:German:Italian, normalized to cost coefficients of 0.25:0.5:0.25.

The count of each combinations and the coefficients are presented in the table to the right, and the explanation of terms and abbreviations used are as follows

### Bayesean Probability

Bayesean Probability Theory is a mathematical model of making decisions based on experience. The process is to predict, using a set of predictors to determine the probabilities of alternative outcomes. In the Bayesean context, prediction is not to forecast the future, nor to establish what may be true, but to logically apply the observed values of predictors to calculate how confident we can be, in terms of probabilities (a number between 0 and 1, or a percentage), for each of the alternative outcomes contained in our model.

The process of Bayesean decisions can be separated into the following stages

1. We begin by nominating the a priori probabilities (π), our confidence in believing each of the alternative outcomes to be correct, before taking predictors into consideration. This can be established by the following
• We can declare that we do not know, and assign the same value as a priori probabilities to all outcomes
• We can base the a priori probabilities on knowledge, from experience, research, previously collected data, heresay, cultural belief, or simply a guess
• We can propose a priori probabilities as a hypothesis to explore, such as "if the a priori probabilities are ...., then ....."
• From our example, in the community we will use our Bayesean model, Census informs us that the ratio of French:German:Italian are 3:2:1. These are normalized to probabilities by dividing each value by the total to 0.5:0.33:0.17
2. We then use the coefficients of our model to apply the attributes of predictors to change a priori probabilities to a posteriori probabilities. The coefficients are developed using a set of reference data, in our example, 10 cases of each ethnicity. Each coefficient is the probability of seeing an attribute given the outcome P(a|o), obtained by dividing the number of cases with each pair of attribute/outcome by the sample size of that outcome in the reference data. Both the Basic and Naive Bayes model use P(a|o) as coefficients, but they are calculated, presented, and used differently. Details of this are presented in the 2 subsequent panels.
3. The coefficients P(a|o) interacts with the alternatives in the predictor(s) to estimate the a posteriori probability. This is term the a posteriori probability, and commonly referred to as the Bayesian probability
• When there is only 1 predictor, as in the Basic Bayes model, attribute (a) represents each alternative of the predictor, and the Bayesian probability is probability given attribute πP(o|a)
• When there are more than 1 predictor, as in the Naive Bayes model, pattern (p) represents an array of attributes, one from each predictor, and the Bayesian probability is probability given pattern πP(o|p)
4. Under some circumstances, we may in addition impose a cost adjustment on our decisions, if we hold that the outcomes have different values CπP(o|a) or CπP(o|p). For example, headache may predict anxiety or brain tumour, but missing a brain tumour has far graver consequences than missing anxiety, so we can insert a cost coefficient to bias our decisions towards brain tumour. The term cost refers to the cost of wrongly not identify a particular outcome, reflecting the importance of that outcome. The process insert a deliberate, considered and calibrated bias to our decisions.
5. Three types of a posteriori probability can therefore be calculated using the coefficients we developed
• Probability of outcome using only the predictor(s), without taking a priori probability or cost into consideration. In the Basic Bayes model with 1 predictor, this is probability given attribute P(o|a), and in the Naive Bayes model with multiple predictors probability given pattern P(o|p). This probability is also termed Maximum Likelihood, and the table of Maximum Likelihood describes the behaviour of the model.
• Probability of outcome using the predictor(s) and the a priori probabilities π. In the Basic Bayes model with 1 predictor, this is probability given attribute and a priori probability πP(o|a), and in the Naive Bayes model with multiple predictors probability given pattern and a priori probability πP(o|p). This probability is also termed Bayes or Bayesean Probability, and is the major and most commonly used a posteriori probability.
• Probability of outcome using the predictor(s), the a priori probabilities π, and the costs C. In the Basic Bayes model with 1 predictor, this is probability given attribute, a priori probability, and cost CπP(o|a), and in the Naive Bayes model with multiple predictors cost adjusted probability given pattern, a priori probability, and cost CπP(o|p). Both are abbreviated to Cost Adjusted Bayesean Probability, and used for decision making where a value judgement is included.

### Summary and Technical Notes

The terminology and abbreviations used in this page and the two associated program pages are adapted from diverse sources, and may not be the same as in other publications. Users should be aware of this peculiarity when comparing these pages with other sources of information. These are chosen to prefer clarity over brevity, hoping that, by doing so, the inexperienced will be less confused. In particular, the following should be noted.
• Predictor is a conceptual term representing things used to predict. and no abbreviation is provided in these pages. In other publications, a variety of terms and abbreviations, such as independent variable, x, j, are used
• Attribute is the value of a predictor, and abbreviated as a. In other publication, predictor, independent variable, x, j, and so on are used
• Pattern is an array of attributes, and abbreviated as p, one from each predictor, and is used only in the Naive Bayes model. In other publication, predictor, independent variable, x, j, and so on are used
• Outcome is used bothe as a concept of things to predict, and also as the values (probability) predicted, and is abreviated as o.In other publication, dependent variable, a posteriori, posterior probability, y, z, θ are used
• The abbreviation P(x|y), representing the probability of x given y, generally known as conditional probability, is the same in these pages as in most publications. However, in most publications, the same abbreviations are used (with different letters) to represent different types of conditional probabilities, while in these pages
• P(a|o) and P(p|o) represent probability of attribute or pattern given outcome. Other publications use P(x|y), P(x|θ), or names of predictors and outcomes
• P(o|a) and P(o|a) represent probability of outcome given attribute or pattern, without consideration of a priori or cost. Other publications use P(y|x), P(θ|x), or names of predictors and outcomes This represents Maximum Likelihood, a term used in these pages as in most publications
• πP(o|a) and πP(o|a) represent Bayesean Probability, with π representing a priori probability. The term is an old one (see references), and used in these pages to distinguish it from Maximum Likelihood. In most publications the same abbreviation as Maximum Likelihood is used, and what the abbreviation means depends on the context described.
• CπP(o|a) and CπP(o|a) represent Bayesean Probability with cost adjustment. The concept and term are old (see references), and it is difficult to find reference to this in more recent publications. This is included in these pages in case any user should wish to use it, but in most cases, costs need not be set.