Bayesian statistics and models

Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.

This is an introduction to Bayesian statistics.

Once data becomes too large to look at all the data, and one needs results based on many factors, query results will (and sometimes now have) error bars associated with them.

In computer science, a linear algorithm is needed to at least look at all of the data once.

As databases become bigger and bigger, the only way to get sub-linear algorithms is to not look at all of the data, which requires probabilistic models.

Michael Jordan: Berkeley (machine learning, computer science, statistics):

Data is getting really big.
Sub-linear algorithms are needed (cannot look at all the data).
Computer science and statistics will be merged in 50 years.

You can view his YouTube video at http://www.youtube.com/watch?v=LFiwO5wSbi8.

Statistics has two correct ways of looking at reality. Both are correct. One may work better in a given situation.

Frequentist statistics (null hypothesis, confidence intervals, etc.)
Bayesian statistics (inverse probability, probability of causes, etc.)

Many statisticians disagree over both frequentist and Bayesian statistics being correct ways of looking at reality.

More: Duality principles

Rev. Thomas Bayes (mid 1700's) had the original idea as a specific instance of a problem, used Newtons cumbersome notation to describe it.

Pierre-Simon Laplace (1749-1827) in late 1700's and early 1800's independently developed the idea of Bayes Rule, and credits Bayes for one insightful idea, then refined and formalized the ideas in an elegant way. Laplace developed both frequentist statistics and Bayesian statistics.

Problems with Bayesian statistics:

No formal theory for realistic problems
No computational way to solve problems

Bayes Rule was independently developed and used by many during the next 150 years.

Alan Turing used Bayes Rule to help break the Enigma encryption during World War II.

In the 1980's, algorithmic advances using MCMC (Markov Chain Monte Carlo) techniques make Bayesian system computationally tractable.

Technique: Gibbs Sampling
Software: BUGS, WinBuGS, OpenBUGS

Some of these techniques were known during World War II, but were kept secret until later discovered by academic researchers.

Computer scientist Judea Pearl (and others) developed the theory for causation and Bayesian inference and graphical models needed to model practical problems.

Problem: Statisticians for over 150 years have been doing one approach and are not ready for a new approach to the same problems.

According to Jordan, in general:

A frequentist approach will average over the possible data to see if the sample result is within a certain limit (i.e., confidence interval).

A Bayesian approach will look at just the available data, and what is known about the past data, in making a decision.

A good book on the history of Bayes Rule, and also of frequentist statistics, is "The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy" by Sharon Bertsch McGrayne.

These two (symmetric) conditional probability equations can be related by the common joint probability to get Bayes Rule.

The above equation is symmetric if A and B are interchanged.

Conditional probability

Here, then, is Bayes Rule.

In usual form Bayes Rule appears as follows.

Let A be the proposed Model and B be the observed Data. Then Bayes rule becomes the following.

The posterior is P(Model | Data).
The likelihood is P(Data | Model).
The prior is P(Model).
The evidence is P(Data).

What happens if the prior, P(Model), is 0.0?

In real world calculations, the posterior is proportional to the likelihood times the prior (together the numerator on the right side) so that the evidence can be ignored (in the denominator of the right side).

Cromwell's Rule has to do with setting the prior P(Model) to 0.0. "As long as you are set that the probability is going to be zero, then nothing's going to change your mind." Mandansky.

"I beseech you, in the bowels of Christ, think it possible you may be mistaken." Oliver Cromwell.

Favorite introduction to Bayes by Author Bailey, accountant and Bayes Rule popularizer in the early 1950's.

"If thou canst believe, all things are possible to him that believeth." Mark 9:23.

Takeaway: As soon as one gives no chance to something happening, there is no evidence that will logically sway that opinion. (Think certain viewpoints in politics, religion, sports, etc.).

How do you recognize a Bayesian statistician?

"Ye shall know them by their posteriors."

Stumbling block: In the absence of information about the domain of application, what value should be used for the prior probability?

In the absence of information, the error in the prior probability is minimized by assuming equal odds (50.0%, or 0.5 probability for a sample space of two events) or a uniform prior.

Does this make you uncomfortable?