Home | Portfolio | Terms and Conditions | E-mail me | LinkedIn
Bayesian Statistics: Analysis of Health Data
The premise of Bayesian statistics is that distributions are based on personal belief about the shape of such a distribution, rather than the classical assumption which does not take such subjectivity into account.
In this regard, Bayesian statistics defines distributions in the following way:
- Prior: Beliefs about a distribution prior to observing any data.
- Likelihood: Distribution based on the observed data.
- Posterior: Distribution that combines observed data with prior beliefs.
Classical statistics relies largely on the t-test to determine significance of a particular variable, and does not take subjective predictions of the data into account. However, the issue with such an approach is that no constraint is placed on the data, and as Richard Morey explains in his blog, an alternative hypothesis becomes “completely unfalsifiable”.
In this regard, the Bayes factor penalises hypotheses that are not specific enough, allowing for a more concrete assessment of how accurate a specific prediction is.
Problem and hypothesis
As an example, let us consider the hypothesis that BMI increases with age. Conversely, the null hypothesis argues that there is no evidence for a positive correlation between BMI and age.
In this regard, even if we did find a positive correlation between BMI and age, the hypothesis is virtually unfalsifiable given that the existence of no relationship whatever between these two variables is highly unlikely.
Therefore, as opposed to using a simple t-test, a Bayes Factor analysis needs to have specific predictions to assess the accuracy of the same.
t-test vs. Bayes Factor t-test
For this problem, the diabetes dataset from the UCI Machine Learning Repository is used. The variables Age and BMI are extracted, the data is ordered by age, and a t-test is calculated across two separate age groups.
library(BayesFactor) # Data Processing health<-read.csv("diabetes.csv") attach(health) newdata <- data.frame(Age,BMI) newdata <- newdata[order(Age),] ## Compute difference scores diffScores = newdata$BMI[1:384] - newdata$BMI[385:768] ## Traditional two-tailed t test t.test(diffScores)
Here are the results for the t-test:
data: diffScores t = -2.2622, df = 383, p-value = 0.02425 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -2.4371984 -0.1706141 sample estimates: mean of x -1.303906
With a p-value below 0.05, the t-test shows significance at the 5% level, indicating that the null hypothesis of the mean being equal to 0 is rejected. However, the issue still remains in that the degree of evidence in favour of H1 over H0 cannot be quantified in detail.
In this regard, a Bayes Factor t-test is run across the different scores.
bf = ttestBF(x = diffScores) bf
Here are the results:
Bayes factor analysis --------------  Alt., r=0.707 : 0.7139178 ±0.01% Against denominator: Null, mu = 0 --- Bayes factor type: BFoneSample, JZS
A score of 0.7139 is yielded. Typically, a score of > 1 signifies anecdotal evidence for H0 compared to H1. The exact thresholds are defined by Wagenmakers et. al, 2011, and a copy of the table can be found at the following R-bloggers post.
Let’s come back to the issue of the posterior distribution. In the case that we are unable to calculate the posterior distribution, it can be estimated using Markov chain Monte Carlo methods (MCMC).
The chains are estimated and the distributions are plotted:
chains = posterior(bf, iterations = 1000) summary(chains) plot(chains[,1:2])
Here are the results:
Iterations = 1:1000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 1000 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE mu -1.2667 0.56619 0.017904 0.017904 sig2 128.3996 9.42371 0.298004 0.298004 delta -0.1121 0.05032 0.001591 0.001591 g 1.8037 10.59165 0.334937 0.334937 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% mu -2.38323 -1.6611 -1.2626 -0.87649 -0.17530 sig2 111.81072 121.7486 128.2914 134.16920 147.16271 delta -0.21117 -0.1458 -0.1123 -0.07679 -0.01459 g 0.06183 0.1763 0.3868 0.94495 8.79583
Here is the graph of the distributions:
Bayesian analysis can also be used to compare probabilities across several regression models.
As an example, let’s take the following regression model:
reg1 = lm(BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies, data = health)
Generating a summary for this regression yields the following results:
> summary(reg1) Call: lm(formula = BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies, data = health) Residuals: Min 1Q Median 3Q Max -33.502 -4.559 -0.220 4.400 29.738 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.669900 1.375338 15.029 < 2e-16 *** Insulin 0.008192 0.002486 3.295 0.00103 ** Age -0.044086 0.028317 -1.557 0.11992 BloodPressure 0.106763 0.014281 7.476 2.11e-13 *** Glucose 0.038988 0.009258 4.211 2.84e-05 *** Pregnancies 0.011194 0.094427 0.119 0.90567 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.377 on 762 degrees of freedom Multiple R-squared: 0.1302,Adjusted R-squared: 0.1245 F-statistic: 22.81 on 5 and 762 DF, p-value: < 2.2e-16
Again, we see that certain variables are indicated as statistically significant while others are not. However, the same problem arises in that p-values are not regarded as direct measures of evidence against the null hypothesis under the Bayesian school of thought, and thus more concrete probability measures should be used.
In this regard, a Bayesian regression is run, with a Bayes factor analysis indicating the highest and lowest probability regressions.
bf = regressionBF(BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies, data = health) length(bf) head(bf, 3) tail(bf, 3) which.max(bf)
Here is the generated output:
> bf = regressionBF(BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies, data = health) |==================================================================================================| 100% > length(bf)  31 > head(bf, 3) Bayes factor analysis --------------  Insulin + BloodPressure + Glucose : 1.352293e+19 ±0%  Insulin + Age + BloodPressure + Glucose : 7.763347e+18 ±0%  Insulin + BloodPressure + Glucose + Pregnancies : 2.389623e+18 ±0.01% Against denominator: Intercept only --- Bayes factor type: BFlinearModel, JZS > tail(bf, 3) Bayes factor analysis --------------  Age : 0.1321221 ±0%  Pregnancies : 0.09066687 ±0%  Age + Pregnancies : 0.01652438 ±0.01% Against denominator: Intercept only --- Bayes factor type: BFlinearModel, JZS > which.max(bf) Insulin + BloodPressure + Glucose 19
From here, we see that the following three combinations of variables in each regression demonstrate the reported highest probability in computing BMI:
- Insulin + BloodPressure + Glucose
- Insulin + Age + BloodPressure + Glucose
- Insulin + BloodPressure + Glucose + Pregnancies
Conversely, these three combinations of variables demonstrated the lowest reported probability:
- Age + Pregnancies
As indicated by which.max(bf), 19 regression models were generated in total.
In order to generate a plot of the probabilities for each combination, the probabilities of the top reported models can be divided against the probabilities of all the generated models:
bf2 = head(bf) / max(bf) bf2 plot(bf2)
The following results and graph are yielded:
> bf2 = head(bf) / max(bf) > bf2 Bayes factor analysis --------------  Insulin + BloodPressure + Glucose : 1 ±0%  Insulin + Age + BloodPressure + Glucose : 0.5740878 ±0%  Insulin + BloodPressure + Glucose + Pregnancies : 0.176709 ±0.01%  Insulin + Age + BloodPressure + Glucose + Pregnancies : 0.08419853 ±0%  Age + BloodPressure + Glucose : 0.02216035 ±0%  BloodPressure + Glucose : 0.01570813 ±0.01% Against denominator: BMI ~ Insulin + BloodPressure + Glucose --- Bayes factor type: BFlinearModel, JZS
Additionally, “top-down” and “bottom-up” analyses can be generated, where in the former each independent variable is eliminated one at a time, whereas in the latter each independent variable is added one at a time.
> bf_top Bayes factor top-down analysis -------------- When effect is omitted from Insulin + Age + BloodPressure + Glucose + Pregnancies , BF is...  Omit Pregnancies : 6.818265 ±0%  Omit Glucose : 0.001291526 ±0%  Omit BloodPressure : 2.526265e-11 ±0%  Omit Age : 2.098719 ±0.01%  Omit Insulin : 0.0350438 ±0% Against denominator: BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies --- Bayes factor type: BFlinearModel, JZS > plot(bf_top) > > # Bottom-up > bf_btm = regressionBF(BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies, data = health, whichModels = "bottom") |=================================================================| 100% > bf_btm Bayes factor analysis --------------  Insulin : 275178 ±0%  Age : 0.1321221 ±0%  BloodPressure : 2.926062e+12 ±0%  Glucose : 12805251 ±0%  Pregnancies : 0.09066687 ±0% Against denominator: Intercept only --- Bayes factor type: BFlinearModel, JZS
Here are the plots for each:
According to these two graphs, Blood Pressure is indicated as the most influential variable in explaining the indicative probability favouring H1 over H0.
Ultimately, the area of Bayesian statistics is very large and the examples above cover just the tip of the iceberg.
However, in this particular example we have looked at:
- The comparison between a t-test and the Bayes Factor t-test
- How to estimate posterior distributions using Markov chain Monte Carlo methods (MCMC)
- Use of regressionBF to compare probabilities across regression models