Home | Portfolio | GitHub | LinkedIn | Medium | Stack Overflow | Terms | E-mail
Analysing Posterior Predictive Distributions with PyMC3
The primary purpose of Bayesian analysis is to model data given uncertainty.
Since one cannot access all the data about a population to determine its precise distribution, assumptions regarding the same are often made.
For instance, I might make an assumption regarding the mean height of a population in a particular country. This is a prior distribution, or a distribution that is founded on prior beliefs before looking at data that could prove or disprove that belief.
Upon analysing a new set of data (a likelihood function), prior beliefs and the likelihood function can then be combined to form the posterior distribution.
Source: Image Created by Author
Let’s see how PyMC3 can be used to model such a distribution.
Analysing BMI Data with PyMC3
For this example, the Pima Indians Diabetes dataset is used to model body mass index data across a number of patients (some of whom are diabetic, some of whom are not).
Before looking at the data, one needs to establish a prior belief regarding the mean and standard deviation of that data. Given that a sizeable number of patients in the dataset are diabetic, then one could formulate a prior belief that the mean BMI is higher than average.
NUTS (No-U-Turn Sampler)
In this regard, let’s make a prior assumption that the mean BMI is 30, with a relatively smaller standard deviation of 2, indicating that we do not expect much deviation from the mean across individual cases.
mu_prior=30
sigma_prior=2
with pm.Model() as model:
mu = pm.Normal("mu", mu=mu_prior, sigma=sigma_prior)
sd = pm.HalfNormal("sd", sigma=sigma_prior)
obs = pm.Normal("obs", mu=mu, sigma=sd, observed=data)
idata = pm.sample(1000, tune=1500, return_inferencedata=True)
Using the 3.1. Sampling template from the PyMC3 General API Quickstart manual, the NUTS (No-U-Turn) sampler is used to draw 1,000 samples from the posterior in each chain, and then allow for readjustment across an additional 1,500 iterations. Note that as well as the prior mean and standard deviation, the model is also taking the observed data (observed=data) into account when generating the posterior distribution.
NUTS is a form of Monte Carlo Markov Chain (MCMC) algorithm that bypasses random walk behaviour and allows for convergence to a target distribution more quickly. This not only has the advantage of speed, but allows for complex models to be fitted without having to employ specialised knowledge regarding the theory underlying those fitting methods.
Interpretation
Using arviz, here are the generated posterior plots:
Source: Jupyter Notebook Output
A statistical summary of the distribution is also provided:
Source: Jupyter Notebook Output
Note that the mean of 31.949 was slightly higher than the prior estimation, while the standard deviation was significantly higher at 7.823.
Having taken the data into account, it would make sense in hindsight that the standard deviation is higher than originally expected.
After all, numerous patients in the dataset are not diabetic. Moreover, while one might expect a positive correlation between BMI and diabetes, this is based on a prior belief — we do not have hard evidence that this is the case without looking at the data.
Taking the updated mean and standard deviation into account, here is the updated posterior distribution:
Source: Jupyter Notebook Output
We can see that the dispersion is higher than originally anticipated. That said, what if the data itself is not always quite accurate? For instance, nobody can have a BMI of 0, yet a number of 0 BMI values are present in the data for some unknown reason.
However, specification of the prior values — along with the majority of other values in the data — indicate that such an observation is highly improbable. In this regard, the posterior predictive distribution drops dramatically below a level of 20 (the lowest observed BMI value in the data was 18.2).
Revisiting prior beliefs
When first formulating our prior belief about BMI, the estimated mean value was not very far off from the actual value.
However, what if our prior beliefs completely missed the mark? Suppose we had estimated a mean BMI of 50 and a standard deviation of 2?
The generated posterior distribution is as follows:
Source: Jupyter Notebook Output
Even in spite of the inaccurate prior beliefs, the sampler can recognise that this completely misses the mark with respect to the actual data observed, and the distribution is updated accordingly.
This can also be very useful in a situation whereby new data assumes a different distribution from prior patterns. For instance, a manager of a company might assume from experience that product sales in a given year tend to show a certain mean and standard deviation, but subsequent data is different from that of previous years. In this regard, the posterior distribution allows for updating of the manager’s prior beliefs to reflect the new data being observed.
Conclusion
Posterior distributions allow for updating of prior beliefs through taking new evidence (or data) into account when generating such distributions.
In this example, you have learned:
- The difference between a prior and posterior distribution
- How to model posterior distributions with PyMC3
- How to interpret posterior plots with arviz
- Role of prior beliefs and the likelihood function in generating posterior distributions