1. Introduction
In this tutorial, we’ll explain Bayes’ theorem.
It forms the basis of many AI methods, such as Bayesian networks and Naïve Bayes but is also at the core of Bayesian statistics, a popular statistical school of thought.
2. Inverting Conditional Probabilities
Let’s suppose we have a patient with a set of symptoms . Based on our expert knowledge, we suspect the patient may have the health condition . So, if the conditional probability is reasonably high, we may start administering therapy for . Therefore, the estimation of should be our first step.
To be as objective as possible, we search medical records in our archive but find no way to compute directly. The problem is that patients are sorted according to their diagnoses, not by the symptoms upon examination. So, we can compute easily by counting cases in the disease folder, but would require going over the entire archive, which is infeasible.
Bayes’ theorem can help us compute by using its converse . In general, the theorem is useful when the conditional probability of interest is hard or impossible to estimate, but its converse is available.
3. Bayes’ Theorem
Here is the typical use case of Bayes’ theorem. We have some evidence (i.e., data) , obtained as the result of experimentation or observation, and want to decide if hypothesis explains reasonably well. A sensible approach is to compute the probability and consider true if the is high enough.
By the definition of conditional probability, we have:
(1)
The probability of an intersection can be expressed as:
(2)
Putting the two equations together, we get Bayes’ theorem:
(3)
3.1. Terminology and Notation
The probabilities in Bayes’ theorem have special names. is the prior probability or simply the prior. It represents our degree of belief that holds before we have seen evidence , hence the name.
Analogously, is the posterior probability (or just the posterior). It denotes our belief that is true after seeing evidence .
Finally, the converse conditional probability is called the likelihood of given .
Since , we can drop and write the theorem informally:
Here, means that the posterior is proportional to the product of the prior and likelihood. Usually, we don’t need to estimate directly since we can get it by focusing on and and combining them with our prior.
3.2. Sequential Updates
Bayes’ theorem allows sequential inference. What does that mean? Well, if we get pieces of evidence one by one, we can iteratively apply the theorem until we process the last one. The result will be the same as if we waited to collect all the evidence to use the theorem.
For example, let’s say that our data (evidence) comes in two parts: and then . If we use the theorem after receiving both parts, we get the following:
In the second line, we use Bayes’ theorem to deal with the first piece of evidence and get . The last line is equivalent to applying the theorem to conditioned on and the posterior we get with .
So, we can update our belief state iteratively:
In brief, one step’s posterior becomes the next step’s prior.
4. Distributional Perspective
So far, we’ve dealt only with binary events. There were only two outcomes for : and . The real world can be much more complex, so priors and posteriors can be and often are continuous distributions.
Let’s show an example. For instance, we may want to determine if a coin is fair. By definition, it’s such if . So, we estimate the probability and check if it’s close to .
4.1. Prior
Before tossing the coin, we should choose the prior. Since , the prior needs to specify a continuous distribution. For example, we can use the Beta distribution with the density:
where is the Beta function, and and are the parameters we set when specifying the prior.
4.2. Likelihood
The evidence is the number of heads. If we toss the coin times and get heads, the likelihood of the evidence is:
The overall probability of the evidence is:
Here, we integrate over all the possible values of the head probability. is a constant with respect to any choice of appearing in .
4.3. Bayesian Update
Now, we can get our new posterior by combining the prior with the likelihood:
What’s more, the posterior is a Beta distribution with parameters and .
If we choose a suitable prior, we’ll be able to derive the posterior analytically, as in this example. However, if no closed-form solution exists for our prior, we’ll need to use numerical methods to approximate the posterior.
5. The Prior Controversy and Other Criticisms
Bayes’ theorem gives us a rigorous mathematical tool for updating our beliefs. However, the approach drew much criticism. Doesn’t it open a door for our prejudices and subjective beliefs to override objective data and influence our judgment? Can we claim to have a sound inference procedure if a wrong choice of the prior can bias the posterior and lead to incorrect conclusions?
Bayesians argue that the choice of the prior isn’t and shouldn’t be arbitrary. If we use the prior “the probability that the hypothesis is true is 75%”, that needs to be grounded in theory or previous empirical evidence. For instance, if a medical condition is rare and occurs only in 1% of the population, it makes sense to use as the prior.
Additionally, the proponents of the Bayesian approach argue that, in the end, all decisions about accepting or rejecting hypotheses are subjective and based on internal belief states. As a result, the Bayesian methodology is best suited to support inference. Whether we share this view or not, Bayes’ theorem is a prerequisite for modern AI and statistics.
6. Example
Let’s say that we found that 75% of the patients with medical condition had symptoms , i.e., . In the literature, we find that the prevalence of in the general population is 1%, whereas the symptoms are estimated to hit a quarter of the population at any given time.
Then, the probability that a patient with symptoms has the condition is:
So, the chance that a random person with symptoms has is only 3%. If we didn’t use Bayes’ theorem and looked only at , we’d arrive at a completely wrong conclusion.
7. Conclusion
In this article, we covered Bayes’ theorem. It allows us to get the posterior probability based on the prior and data likelihood. However, if we don’t choose a suitable prior, our inference may be biased.