Did you know the R datasets package can be accessed in Python using the statsmodels library? Today we’ll look at the esoph dataset that contains records for 88 age/alcohol/tobacco combinations based on an esophageal cancer study in France.
First we need to load the data.
import statsmodels.api as sm import pandas as pd esoph = sm.datasets.get_rdataset('esoph') print(type(esoph)) df = esoph.data # Rename a column df.columns = ['Age_group']+list(df.columns[1:]) df.head()
After cleaning the data (which can be seen along with the full code in my ipython notebook), it ended up looking like this:
Here we’re only seeing the first 5 elements i.e., df.head(). The features have been converted to dummy variables representing the different groups.
At a glance
A new column has been added called “positive_frac” which was found by calculating ncases/ncontrols for each row. It is the percentage of each age/alcohol/tobacco group that was diagnosed positive for esophagus cancer and we’re going to create linear models in an attempt to predict it. Let’s get some quick statistics about it:
Name: positive_frac, dtype: float64
So it can range from 0% to 100% of the group with an average of ~35% being diagnosed positive. Let’s take a look at how this is distributed depending on the age group using Seaborn’s FacetGrid() object to plot a set of histograms.
import seaborn as sns colors = sns.color_palette("BrBG", 10) g = sns.FacetGrid(df, row='Age_group', size=2, aspect=4, legend_out=False) g.map(plt.hist, 'positive_frac', normed=True, color=colors)
As could be expected, the younger groups are less likely to be diagnosed positive.
One feature models
There is clearly a trend of higher percentages for older age groups, let’s take a different look:
sns.set_style('ticks') sns.regplot(y='positive_frac', x='Age_group_i', data=df, fit_reg = True, marker='o', color=colors)
Seaborn has produced a line of best fit that confirms our previous observation. We can make analogous plots for our other features.
I was surprised to see that alcohol seems to be more a more influential factor in esophagus cancer than tobacco!
We can build a model to predict the likelihood of positive diagnosis depending on multiple features. In general this will be a hyperplane with the equation
where is the intercept and the other ‘s are the slopes. We have three independent variables:
- alcohol consumption
- tobacco consumption
Two feature models
For illustration purposes, let’s build a two-feature model where ‘alcohol consumption’ and ‘tobacco consumption’. We’ll use the ordinary least squares regression class of statsmodels.
from statsmodels.formula.api import ols reg = ols('positive_frac ~ alcgp + tobgp', df).fit() reg.params
In this case we’ll be able to plot the resulting hyperplane because it’s just a regular (i.e., 2D) plane. In the figures below I’ve done just that. The data is also plotted where the point size is determined by how many people were in the group, so a larger point represents a more statistically significant data-point .
This model was fit with an intercept (i.e., a non-zero value for ). We can remove this by doing the following:
reg = ols('positive_frac ~ alcgp + tobgp -1', df).fit() reg.params
Notice that the and parameters have changed because the plane was re-fit with . The hyperplane looks similar but is less slanted along the ‘Tobacco consumption’ axes.
Let’s compare predictions for the following combination of features:
- alcohol => group 3 (80-119 g/day)
- tobacco => group 4 (130+ g/day)
With the intercept we predict 50% positive diagnoses and without we predict ~45% .
Three feature models
The previous predictions did not account in any way for the age groups, and we have already seen from the single-feature models that it’s correlated to being positively diagnosed. We can include it like this:
reg = ols('positive_frac ~ alcgp + tobgp + Age_group_i', df).fit() reg.params
It’s not practical to try and visualize the hyperplane in this case. The intercept can be removed the same way as before:
reg = ols('positive_frac ~ alcgp + tobgp + Age_group_i -1', df).fit() reg.params
Wait, the coefficient for tobacco consumption is negative? This means that a higher tobacco intake implies lower risk according to the zero-intercept model!
Now we can re-visit the test case from earlier and make predictions for each age group. We see a large variation depending on whether or not the intercept parameter is fit.
To give some indication as to the significance and accuracy of the three-feature models we can look at the residual plots.
Plotting the residuals
The linear models we’ve seen have been fit by minimizing the sum of the squared residuals , defined as:
where are the actual values we are trying to predict (i.e., the “positive_frac” column of our dataframe).
Below we plot as a function of predictions . For the model where we get:
Whereas for the model with a fitted intercept we find a slightly tighter distribution (which can be confirmed, as I have done, by comparing the R-squared fit values for each model):
But a close look at the x-axis values reveals that we are predicting positive diagnosis fractions to be negative in some cases, which makes no sense! This doesn’t mean the model won’t give more accurate predictions for the majority of age/alcohol/tobacco combinations, but it is worth noting!
As mentioned above, the full code used to make this post can be found in my ipython notebook.
Thanks for reading! If you would like to discuss my code or have any questions or corrections, please write a comment. You are also welcome to email me at email@example.com or tweet me @agalea91
 – In this post I did not do weighted regression, as may be suggested by having different sizes data-points.
 – Please don’t think this means you have a ~1/2 chance of being diagnosed positive yourself if you fit into this category! We are simply modeling a study.