# Linear models for predicting esophagus cancer likelihood

Did you know the R datasets package can be accessed in Python using the statsmodels library? Today we’ll look at the esoph dataset that contains records for 88 age/alcohol/tobacco combinations based on an esophageal cancer study in France.

First we need to load the data.

import statsmodels.api as sm
import pandas as pd

esoph = sm.datasets.get_rdataset('esoph')
print(type(esoph))

df = esoph.data
# Rename a column
df.columns = ['Age_group']+list(df.columns[1:])


<class ‘statsmodels.datasets.utils.Dataset’>

After cleaning the data (which can be seen along with the full code in my ipython notebook), it ended up looking like this:

Here we’re only seeing the first 5 elements i.e., df.head(). The features have been converted to dummy variables representing the different groups.

## At a glance

A new column has been added called “positive_frac” which was found by calculating ncases/ncontrols for each row. It is the percentage of each age/alcohol/tobacco group that was diagnosed positive for esophagus cancer and we’re going to create linear models in an attempt to predict it. Let’s get some quick statistics about it:

df.describe()['positive_frac']


count 88.000000
mean 0.346807
std      0.357342
min    0.000000
25%    0.000000
50%    0.267857
75%    0.583333
max    1.000000
Name: positive_frac, dtype: float64

So it can range from 0% to 100% of the group with an average of ~35% being diagnosed positive. Let’s take a look at how this is distributed depending on the age group using Seaborn’s FacetGrid() object to plot a set of histograms.

import seaborn as sns
colors = sns.color_palette("BrBG", 10)

g = sns.FacetGrid(df, row='Age_group', size=2,
aspect=4, legend_out=False)

g.map(plt.hist, 'positive_frac', normed=True,
color=colors[3])


As could be expected, the younger groups are less likely to be diagnosed positive.

## One feature models

There is clearly a trend of higher percentages for older age groups, let’s take a different look:

sns.set_style('ticks')
sns.regplot(y='positive_frac', x='Age_group_i', data=df,
fit_reg = True, marker='o', color=colors[1])


Seaborn has produced a line of best fit that confirms our previous observation. We can make analogous plots for our other features.

I was surprised to see that alcohol seems to be more a more influential factor in esophagus cancer than tobacco!

## Multivariate regression

We can build a model to predict the likelihood of positive diagnosis depending on multiple features. In general this will be a hyperplane with the equation

$y = m_0 + \sum_i m_i x_i$

where $m_0$ is the intercept and the other $m$‘s are the slopes. We have three independent variables:

• alcohol consumption
• tobacco consumption
• age

### Two feature models

For illustration purposes, let’s build a two-feature model where $x_1 =$ ‘alcohol consumption’ and $x_2 =$ ‘tobacco consumption’. We’ll use the ordinary least squares regression class of statsmodels.

from statsmodels.formula.api import ols

reg = ols('positive_frac ~ alcgp + tobgp', df).fit()
reg.params


Intercept -0.173821
alcgp         0.176903
tobgp        0.035869
dtype: float64

In this case we’ll be able to plot the resulting hyperplane because it’s just a regular (i.e., 2D) plane. In the figures below I’ve done just that. The data is also plotted where the point size is determined by how many people were in the group, so a larger point represents a more statistically significant data-point [1].

This model was fit with an intercept (i.e., a non-zero value for $m_0$). We can remove this by doing the following:

reg = ols('positive_frac ~ alcgp + tobgp -1', df).fit()
reg.params


alcgp  0.144220
tobgp 0.003582
dtype: float64

Notice that the $m_1$ and $m_2$ parameters have changed because the plane was re-fit with $m_0 = 0$. The hyperplane looks similar but is less slanted along the ‘Tobacco consumption’ axes.

Let’s compare predictions for the following combination of features:

• alcohol => group 3 (80-119 g/day)
• tobacco => group 4 (130+ g/day)

With the intercept we predict 50% positive diagnoses and without we predict ~45% [2].

### Three feature models

The previous predictions did not account in any way for the age groups, and we have already seen from the single-feature models that it’s correlated to being positively diagnosed. We can include it like this:

reg = ols('positive_frac ~ alcgp + tobgp + Age_group_i',
df).fit()
reg.params


Intercept       -0.615799
alcgp                0.180166
tobgp               0.047964
Age_group_i 0.119547
dtype: float64

It’s not practical to try and visualize the hyperplane in this case. The intercept can be removed the same way as before:

reg = ols('positive_frac ~ alcgp + tobgp + Age_group_i -1',
df).fit()
reg.params


alcgp                0.099652
tobgp               -0.035494
Age_group_i 0.066731
dtype: float64

Wait, the coefficient for tobacco consumption $m_2=-0.035$ is negative? This means that a higher tobacco intake implies lower risk according to the zero-intercept model!

Now we can re-visit the test case from earlier and make predictions for each age group. We see a large variation depending on whether or not the intercept parameter $m_0$ is fit.

To give some indication as to the significance and accuracy of the three-feature models we can look at the residual plots.

### Plotting the residuals

The linear models we’ve seen have been fit by minimizing the sum of the squared residuals $S$, defined as:

$S = \sum_i r_i^2 = \sum_i \big[ \hat{y}_i - (m_0 + \sum_j m_j x_j) \big]^2$,

where $\hat{y}_i$ are the actual values we are trying to predict (i.e., the “positive_frac” column of our dataframe).

Below we plot $r_i$ as a function of predictions $y_i = m_0 + m_1 x_{1,i} + m_2 x_{2,i} + m_3 x_{3,i}$. For the model where $m_0=0$ we get:

Whereas for the model with a fitted intercept we find a slightly tighter distribution (which can be confirmed, as I have done, by comparing the R-squared fit values for each model):

But a close look at the x-axis values reveals that we are predicting positive diagnosis fractions $y_i$ to be negative in some cases, which makes no sense! This doesn’t mean the model won’t give more accurate predictions for the majority of age/alcohol/tobacco combinations, but it is worth noting!

As mentioned above, the full code used to make this post can be found in my ipython notebook.

Thanks for reading! If you would like to discuss my code or have any questions or corrections, please write a comment. You are also welcome to email me at agalea91@gmail.com or tweet me @agalea91

[1] – In this post I did not do weighted regression, as may be suggested by having different sizes data-points.

[2] – Please don’t think this means you have a ~1/2 chance of being diagnosed positive yourself if you fit into this category! We are simply modeling a study.