Linear models for predicting esophagus cancer likelihood

Did you know the R datasets package can be accessed in Python using the statsmodels library? Today we’ll look at the esoph dataset that contains records for 88 age/alcohol/tobacco combinations based on an esophageal cancer study in France.

First we need to load the data.

import statsmodels.api as sm
import pandas as pd

esoph = sm.datasets.get_rdataset('esoph')

df =
# Rename a column
df.columns = ['Age_group']+list(df.columns[1:])

<class ‘statsmodels.datasets.utils.Dataset’>


After cleaning the data (which can be seen along with the full code in my ipython notebook), it ended up looking like this:


Here we’re only seeing the first 5 elements i.e., df.head(). The features have been converted to dummy variables representing the different groups.


At a glance

A new column has been added called “positive_frac” which was found by calculating ncases/ncontrols for each row. It is the percentage of each age/alcohol/tobacco group that was diagnosed positive for esophagus cancer and we’re going to create linear models in an attempt to predict it. Let’s get some quick statistics about it:


count 88.000000
mean 0.346807
std      0.357342
min    0.000000
25%    0.000000
50%    0.267857
75%    0.583333
max    1.000000
Name: positive_frac, dtype: float64

So it can range from 0% to 100% of the group with an average of ~35% being diagnosed positive. Let’s take a look at how this is distributed depending on the age group using Seaborn’s FacetGrid() object to plot a set of histograms.

import seaborn as sns
colors = sns.color_palette("BrBG", 10)

g = sns.FacetGrid(df, row='Age_group', size=2,
                  aspect=4, legend_out=False), 'positive_frac', normed=True,



As could be expected, the younger groups are less likely to be diagnosed positive.


One feature models

There is clearly a trend of higher percentages for older age groups, let’s take a different look:

sns.regplot(y='positive_frac', x='Age_group_i', data=df,
            fit_reg = True, marker='o', color=colors[1])



Seaborn has produced a line of best fit that confirms our previous observation. We can make analogous plots for our other features.


I was surprised to see that alcohol seems to be more a more influential factor in esophagus cancer than tobacco!


Multivariate regression

We can build a model to predict the likelihood of positive diagnosis depending on multiple features. In general this will be a hyperplane with the equation

y = m_0 + \sum_i m_i x_i

where m_0 is the intercept and the other m‘s are the slopes. We have three independent variables:

  • alcohol consumption
  • tobacco consumption
  • age

Two feature models

For illustration purposes, let’s build a two-feature model where x_1 = ‘alcohol consumption’ and x_2 = ‘tobacco consumption’. We’ll use the ordinary least squares regression class of statsmodels.

from statsmodels.formula.api import ols

reg = ols('positive_frac ~ alcgp + tobgp', df).fit()

Intercept -0.173821
alcgp         0.176903
tobgp        0.035869
dtype: float64

In this case we’ll be able to plot the resulting hyperplane because it’s just a regular (i.e., 2D) plane. In the figures below I’ve done just that. The data is also plotted where the point size is determined by how many people were in the group, so a larger point represents a more statistically significant data-point [1].


This model was fit with an intercept (i.e., a non-zero value for m_0). We can remove this by doing the following:

reg = ols('positive_frac ~ alcgp + tobgp -1', df).fit()

alcgp  0.144220
tobgp 0.003582
dtype: float64

Notice that the m_1 and m_2 parameters have changed because the plane was re-fit with m_0 = 0. The hyperplane looks similar but is less slanted along the ‘Tobacco consumption’ axes.



Let’s compare predictions for the following combination of features:

  • alcohol => group 3 (80-119 g/day)
  • tobacco => group 4 (130+ g/day)

With the intercept we predict 50% positive diagnoses and without we predict ~45% [2].

Three feature models

The previous predictions did not account in any way for the age groups, and we have already seen from the single-feature models that it’s correlated to being positively diagnosed. We can include it like this:

reg = ols('positive_frac ~ alcgp + tobgp + Age_group_i',

Intercept       -0.615799
alcgp                0.180166
tobgp               0.047964
Age_group_i 0.119547
dtype: float64

It’s not practical to try and visualize the hyperplane in this case. The intercept can be removed the same way as before:

reg = ols('positive_frac ~ alcgp + tobgp + Age_group_i -1',

alcgp                0.099652
tobgp               -0.035494
Age_group_i 0.066731
dtype: float64

Wait, the coefficient for tobacco consumption m_2=-0.035 is negative? This means that a higher tobacco intake implies lower risk according to the zero-intercept model!

Now we can re-visit the test case from earlier and make predictions for each age group. We see a large variation depending on whether or not the intercept parameter m_0 is fit.


To give some indication as to the significance and accuracy of the three-feature models we can look at the residual plots.

Plotting the residuals

The linear models we’ve seen have been fit by minimizing the sum of the squared residuals S, defined as:

S = \sum_i r_i^2 = \sum_i \big[ \hat{y}_i - (m_0 + \sum_j m_j x_j) \big]^2,

where \hat{y}_i are the actual values we are trying to predict (i.e., the “positive_frac” column of our dataframe).

Below we plot r_i as a function of predictions y_i = m_0 + m_1 x_{1,i} + m_2 x_{2,i} + m_3 x_{3,i}. For the model where m_0=0 we get:


Whereas for the model with a fitted intercept we find a slightly tighter distribution (which can be confirmed, as I have done, by comparing the R-squared fit values for each model):


But a close look at the x-axis values reveals that we are predicting positive diagnosis fractions y_i to be negative in some cases, which makes no sense! This doesn’t mean the model won’t give more accurate predictions for the majority of age/alcohol/tobacco combinations, but it is worth noting!

As mentioned above, the full code used to make this post can be found in my ipython notebook.

Thanks for reading! If you would like to discuss my code or have any questions or corrections, please write a comment. You are also welcome to email me at or tweet me @agalea91

[1] – In this post I did not do weighted regression, as may be suggested by having different sizes data-points.

[2] – Please don’t think this means you have a ~1/2 chance of being diagnosed positive yourself if you fit into this category! We are simply modeling a study.


One thought on “Linear models for predicting esophagus cancer likelihood

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s