Bayesian Linear Regression on the Swiss dataset

Today we are again walking through a multivariate linear regression method (see my previous post on the topic here). This time however we discuss the Bayesian approach and carry out all analysis and modeling in R. My relationship with R has been tempestuous to say the least, but the more I use it the more enjoyable it becomes.

Import R libraries

First thing to do is load up the libraries we’ll be using. For example we load the MASS library and get access to the stepAIC function and the dplyr library lets us use the piping operator %>%.

library(ggplot2)
library(GGally)
library(dplyr)
library(BAS)
library(MASS)

Please note: I will be using “=” in place of “<-” when writing R code because wordpress has a bad habit of changing my < characters in code snippets.

 

The Swiss dataset

The swiss dataset contains 47 observations on 6 variables.

# Store the swiss dataframe in memory
data(swiss)
# Create a pairplot
ggpairs(swiss)

pairs

Each sample is for a province in Switzerland and we are given the fertility measure, % of males involved in an agriculture occupation, % of draftees receiving the highest mark on an army examination, % of draftees with education beyond primary school, % catholic population, and infant mortality rates. The data is from the year 1888 by the way. We’ll use Bayesian linear regression to model the fertility of the population, but first let’s start with a Frequentist approach: Ordinary Least Squares (OLS).

 

Ordinary least squares

For OLS we model y as a function of x_1, x_2, ... with the equation:

y = \beta_0 + \beta_1\cdot x_1 + \beta_2\cdot x_2 + ...

and solve for the parameters \beta_0, \beta_1, \beta_2... by minimizing the least squares objective function.

In R this can be done as follows, where fertility is modeled as a function of each feature (as indicated by the . in the model equation).

swiss.lm_full = lm(formula = Fertility ~ ., data = swiss)

What will happen if we try and plot the resulting line of best fit?

# Set up dataframe containing predictions
predict = data.frame(predict(swiss.lm_full))
predict$x = swiss$Agriculture
names(predict) = c('y', 'x')

# Plot data and predictions
p = ggplot() + geom_point(data = swiss, aes(Agriculture, Fertility, color='black'), size=3)
p = p + geom_line(data = predict, aes(x=x, y=y, color ='red', alpha=0.8), size=1)
p + scale_colour_manual(name='', values=c('black', 'red'), labels=c('y_true', 'y_predict'))

 

ols_1

Expecting the line of best fit to be straight? We are fitting a model with 5 features so we would need 5-dimensional space to illustrate the linear hyperplane. Since none of us have 5-dimensions lying around we’ll just have to trust the math on this one. By now you may have already realized that the plot above is not even valid because we are simply drawing lines between predicted points. The figure should look like this:

p = ggplot() + geom_point(data = swiss, aes(Agriculture, Fertility, color='black'), size=3)
p = p + geom_point(data = predict, aes(x=x, y=y, color ='red'), size=3, shape=1)
p + scale_colour_manual(name='', values=c('black', 'red'), labels=c('y_true', 'y_predict'))

ols_2

This is awful to look at and can better be interpreted as a residual plot, where we plot the differences between the black filled points and red hollow ones.

ols_3

The model above was trained on all of the features, but it may be better to use only a subset. One method of determining the optimal subset of features is with the stepAIC function, which attempts to minimize the Bayesian Information Criterion (BIC) metric. This metric ranks the models according to goodness of fit but includes a penalty for having more parameters that goes as log(n) where n is the number of parameters.

stepAIC(lm(Fertility ~., data = swiss), k=log(nrow(rock)))

stepAIC.PNG

As can be seen, the BIC was reduced by removing the “Examination” feature. After this step it was found that no lower value could be achieved by removing additional features and the algorithm ended.

Bayesian linear regression

In bayesian linear regression we write a similar equation to the OLS method:

y_i = \beta_0 + \beta_1\cdot x_{1,i} + \beta_2\cdot x_{2,i} + ... + \epsilon_i

where i represents the sample number and \epsilon_i is the error of each sample. Before revealing how the parameters \beta_0, \beta_1, ... are determined [1], let’s talk about the errors.

By rearranging, we could calculate \epsilon_i for a given sample by evaluating y_i - (\beta_0 + \beta_1\cdot x_{1,i} + ...). The errors \epsilon are assumed to be normally distributed with mean of 0. We can check this assumption for the OLS swiss dataset model by solving for each \epsilon_i and plotting the distribution. In other words, we plot a histogram of the residuals:

# Compute errors
errors = resid(swiss.lm_full)
# Plot histogram and fitted line
as.data.frame(errors) %>% ggplot(aes(errors)) +
    geom_histogram(binwidth=1.5, aes(y=..density..)) +
    geom_density(adjust=1.2, size=1, color='red') +
    xlim(-23, 23)

bay_1

Even with this small dataset of 47 samples we see the normal distribution beginning to take shape, as suggested with the red curve.

In Bayesian regression we assign prior probability distributions to the parameters \beta_0, \beta_1, ... and use a likelihood function to determine posterior using Bayes’ rule. For a given parameter \beta_i this rule can be stated as:

p(\beta_i|\mathbf{y}) \propto p(\mathbf{y}|\beta_i)\cdot p(\beta_i)

where p(\beta_i) is the prior distribution of \beta_ip(\beta_i|\mathbf{y}) is the posterior distribution given the data \mathbf{y} and the other term is the likelihood [2].

We can see how the posterior will in principle depend on the choice of both prior and likelihood, but in this post we never explicitly define any priors because they will be dominated by the likelihood under our BIC assumptions. For more details, check out the top answer to my stack exchange question.

Once we have determined the posterior distribution for each \beta_i we can set the parameters for our linear model. Our choice should depend on the loss function we wish to minimize. For a linear loss function we should take the mean and for a quadratic loss function (used in OLS) we should take the median. In this post our posteriors are symmetric, so each choice is equivalent.

To implement this in R we’ll import the BAS library and use the bas.lm function to evaluate a set of Bayesian models containing different combinations of features. We can then make predictions using various combinations of the resulting models.

swiss.lm_bay = bas.lm(Fertility ~ ., data = swiss,
                      prior = 'BIC', modelprior = uniform())
swiss.lm_bay

bay_lm.PNG

Just like our linear models earlier, we feed in all of the features using the dot (.) and specify “Fertility” for prediction. The function returns inclusion probabilities for each feature, given the data used to fit the models.

Let’s not worry about the \beta_i parameters for specific models just yet and turn our attention to the probabilities of the models. The prior distribution for the models is uniform, as can be confirmed with the following code:

swiss.lm_bay$priorprobs

bay_lm_priors

These are updated to:

swiss.lm_bay$postprobs

bay_lm_posts

which can be illustrated using the image function.

image(swiss.lm_bay, rotate=FALSE)

bay_2

Here we see the models ranked by their posterior odds ratio where black squares indicate which features are being left out of each model. Just like our stepAIC linear model feature reduction earlier, “Examination” can be identified as a poor feature for making predictions about fertility.

For a more quantified summary of the top models we can do:

summary.bas(swiss.lm_bay)

bay_summary.PNG

This gives access to the posterior probability of the top models side-by-side with R^2 values. Notice how the model with the largest R^2 does not have the largest probability!

As promised, we’ll now return to the parameter probabilities and plot the \beta coefficient posterior distribution for each feature. The code below uses the model averaging approach to calculate these distributions.

par(mfrow = c(1,2))
plot(coefficients(swiss.lm_bay))

bay_beta_post_1bay_beta_post_2bay_beta_post_3

Notice how our weakest feature, “Examination”, has a large overlap with 0. In each plot the overlap is quantified by the height of the black vertical line extending up from x=0.

Making predictions

Since we didn’t hold out any data during training, we have nothing to test our model on. Let’s swiftly fix that by breaking our dataframe into training and testing pieces:

set.seed(1)
n = nrow(swiss)
train = sample(1:n, size = round(0.6*n), replace=FALSE)
swiss.train = swiss[train,]
swiss.test = swiss[-train,]

and training a new set of models:

swiss.lm_bay = bas.lm(Fertility ~ ., data = swiss.train,
                      prior = 'BIC', modelprior = uniform())

Now we can compare the performance of the following aggregated models:

  • BMA: Bayesian Model Averaging (mean of best models)
  • BPM: Bayesian Posterior Model (best predictave model according to some loss function e.g., squared error)
  • MPM: Median Probability Model (including all predictors whose marginal probabilities of being non zero are above 50%)
  • HPM: Highest Probability Model
# Set up matrix to store results in
results = matrix(NA, ncol=4, nrow=1)
colnames(results) = c('BMA', 'BPM', 'MPM', 'HPM')

# Make predictions for each aggregated model
for (name in colnames(results)) {
    y_pred = predict(swiss.lm_bay, swiss.test, estimator=name)$fit
    results[1, name] = cv.summary.bas(y_pred, swiss.test$Fertility)
}

# Print results
options(digits = 4)
results

bay_predict_1

In each case the performance is similar, with the BMA model appearing to be the best and BPM the worst. Unfortunately we can not trust these results because they depend too much on the training / testing data allocation. To get results we can trust we’ll average the predictions of many models that have been trained and tested on different parts of the data [3], as seen below.

set.seed(99)
results = matrix(NA, ncol=4, nrow=10)
colnames(results) = c('BMA', 'BPM', 'MPM', 'HPM')

for (i in 1:10) {
    n = nrow(swiss)
    train = sample(1:n, size = round(0.6*n), replace=FALSE)
    swiss.train = swiss[train,]
    swiss.test = swiss[-train,]
    swiss.lm_bay = bas.lm(Fertility ~ ., data = swiss.train,
                          prior = 'BIC', modelprior = uniform())

    for (name in colnames(results)) {
        y_pred = predict(swiss.lm_bay, swiss.test, estimator=name)$fit
        results[i, name] = cv.summary.bas(y_pred, swiss.test$Fertility)
    }
}

boxplot(results)

bay_predict_2

Now we can see that each method performs equally well within the calculated error bounds.

If your still reading this, and especially if you have been following along in RStudio, then perhaps you are willing to take on a homework task of comparing results when using different priors. What happens when you run K-fold cross validation with the substitution below?

swiss.lm_bay = bas.lm(Fertility ~ ., data = swiss.train,
                      prior = 'g-prior',
                      modelprior = beta.binomial(1,1))

 

Thanks for reading! You can find a link to the RStudio markdown file here.

If you would like to discuss anything or have questions/corrections then please write a comment, email me at agalea91@gmail.com, or tweet me @agalea91

 

[1] – As well as the \beta‘s, we analogously solve for the standard deviation \sigma^2 of the error function \epsilon in Bayesian linear regression. This also involves setting a prior distribution and using a likelihood function to determine the posterior.

[2] – The posterior can be calculated using conjugacy, which occurs when the prior and posterior distributions are defined by the same function with different parameters. By selecting the appropriate prior and likelihood  this concept can be used to easily determine the posterior.

[3] – As pointed out to me by a reddit user (named questionquality), what I am doing here is not K-fold cross validation and I have edited the post accordingly. For K-fold testing we could do something like this:

n = nrow(swiss)
folds = caret::createFolds(1:n, k=10)
for (fold in folds) {
    swiss.train = swiss[-fold,]
    swiss.test = swiss[fold,]
    etc...
}

Again I would like to acknowledge reddit user questionquality for this code.

5 thoughts on “Bayesian Linear Regression on the Swiss dataset

  1. Nice article, it would have been interesting to also plot a scatter plot of the bayesian prediction, the same way you did with the OLS one.

    Like

  2. Nice post!
    Just a comment about the Bayes rule as it’s written above. If you want to use “=” sign in the formula then you must divide the RHS by p(y). If you don’t want p(y) appearing in the formula then instead of equality it should be proportionality.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s