Since the results are there on twitter for all to see, I decided to take a peek and look at exactly what happened.

Ink Master is a reality competition television series on Spike TV, where contestants are judged on their tattooing ability. This year saw the 8th round of the competition and the live season finale ran last week, on Tuesday December 6th, 2016.

Three of the original 18 contestants made it to the finale, where they had to ink a 6 hour “live tattoo” that was judged by twitter users. The person receiving the most votes progressed to the final stage of judging, while one of the other two was eliminated on the spot.

After each completed “live tattoo” was revealed, viewers were given a small window of about 15 minutes to post tweets that would be counted as votes. As per the official rules, a twitter user could cast as many votes as they want and all would be tallied. A tweets had to contain the hashtag #InkKelly, #InkGian, or #InkRyan to be counted as a vote.

Here are the three finalists and their “live tattoos”:

A tweet mining Python script was used to collect posts that contain #InkKelly, #InkGian, or #InkRyan. The analysis was done in this ipython notebook under the following assumptions / conditions:

- Posts made in between 10:32 and 10:45 (the official voting period) are counted as votes
- Retweets do not count as votes
- A single post can be used to vote for more than one competitor, this did not occur very often but there are no rules against it

I ended up counting approximately 65,000 votes for Kelly, Gian and Ryan combined [1]. These were split as seen below:

The time series plot shows how posts were heavily localized to the voting period. In particular, there was a large influx of posts right after the voting window opened at around 10:32.

All competitors were pretty even for the first few seconds, but Ryan’s votes soon spiked to a rate of 300 per second and continued at a higher rate than Kelly and Gian through the remainder of the voting period.

Ryan indeed advanced to the final round based on her majority vote by twitter. The judges proceeded to select Gian to compete with her for the title of Ink Master and Ryan ended up coming out on top.

Initially I was surprised that users would be allowed to vote multiple times, as it would open the door to spamming. However I later realized the effectiveness of this tactic as a means of generating more exposure for the show. As it turns out, this rule had no effect on the outcome. Counting only one vote per user, the time series looks much the same:

Subtle differences are more visible later in the voting period, as many users had posted already and additional posts were being logged more frequently.

Most users, about 75%, voted only once. Out of those who voted multiple times the average number of posts was 3.3 and median was 2. So most people didn’t really spam at all. But some did. Here are the highest vote counts I found:

I really enjoy watching Ink Master and wanted to give a shout out to all the competitors this year for their hard work.

In particular, the three finalists created 24-hour chest tattoos which were all phenomenal and deserve to be seen. They are included below.

Thanks for reading. As mentioned above, you can can see my analysis in an ipython notebook on github. Please direct any questions or comments to my twitter account @agalea91 or post below.

[1] – I am missing votes from posts that were deleted after the voting period ended. Presumably, at least, the number of deleted posts should be similar for each contestant.

As usual, the complete work can be seen in my ipython notebook.

Using the python libraries requests and beautiful soup, this can be done as seen below. We first get the names of current NHL captains.

url = 'https://en.wikipedia.org/wiki/List_of_current_NHL_captains_and_alternate_captains' page = requests.get(url) print('Got %s type object of url using the requests library' % str(type(page)))

*>> Got <class ‘requests.models.Response’> type object of url using the requests library*

soup = BeautifulSoup(page.content, 'html.parser') print('Fed %s type object into BeautifulSoup to create a %s type object' % (str(type(page.content)), str(type(soup))))

*>> Fed <class ‘bytes’> type object into BeautifulSoup to create a <class ‘bs4.BeautifulSoup’> type object
*

page_tables = soup.findAll('table') print('Got %s type object of length %d' % (str(type(page_tables)), len(page_tables)))

*>> Got <class ‘bs4.element.ResultSet’> type object of length 11*

Each table in the URL is an object and they can be iterated over. For example:

for table in page_tables: print(table.find('caption'))

*>> Position abbreviations*

* >> List of current NHL Captains*

* >> List of current NHL Alternate Captains*

* >> None*

* >> None*

* >> …etc*

We clearly are interested in the 2nd and 3rd items in page_tables. There are many ways to extract the data at this point. We’ll iterate through the data in our table using mod(i, 3) to access only one of the 3 columns.

# Getting the captains C_players = [] for i, n in enumerate(page_tables[1].findAll('td')): if i % 3 == 0: try: C_players.append(n.findAll('a')[0].text) except: # N/A entry print('Skipping entry:', n.contents)

We can do something similar to get the alternate captains.

The world cup rosters can be acquired in a similar way, as seen in my ipython notebook linked to at the top of this post.

At this point we have the following data in memory:

- players – a list containing lists of players on each team
- teams – the World Cup team names
- C_players – list of NHL team captains
- A_players – list of NHL team alternate captains

We could get the total number of captains playing in the tournament as follows:

# Flatten player list all_players = [p for p_list in players for p in p_list] # Initialize counters N_C, N_A = 0, 0 for player in all_players: if player in C_players: N_C += 1 elif player in A_players: N_A += 1 print('%d captains and %d alternates ' % (N_C, N_A))

*>> 18 captains and 25 alternates*

This turns out to be 69% of the NHL team captains and 42% of the alternates. Overall about half of the NHL captains are playing in this tournament.

Building the above counter into a function, we can apply it to each list of players. Using the pandas library, we can then build the following dataframe:

With this dataframe (named df), we can use Bokeh to create the stacked bar plot as follows [1]:

# Import libraries and setup for ipython notebook display from bokeh.charts import Bar from bokeh.charts.attributes import color, cat from bokeh.charts.operations import blend from bokeh.io import output_notebook, show output_notebook() # Make the plot bar = Bar(df, values=blend('Number of Captains', 'Number of Alternate Captains', name='Number of Captains', labels_name='caps'), stack=cat(columns='caps', sort=False), label=cat(columns='Team', sort=False), color=color(columns='caps', palette=['OrangeRed', 'Orange'], sort=False), title='2016 World Cup of Hockey NHL Captains', legend='top_right') show(bar)

This method is good for creating charts where the data is ordered in the same way as in the dataframe. In this case we ordered df by the total number of captains.

Most of my time on this little project was spent trying to get some of Bokeh’s interactive plot features working. In particular, I wanted to use the hover tool to show the names of the players in each bar. Unfortunately it is not possible to do this easily with the current version of Bokeh and I could not figure out how to get it working. For more details check out my stack overflow question about the issue.

Here is the list of captains / alternates on each team:

for T, C, A in zip(df.Team, df.C, df.A): print(T, '\nC -', C, '\nA -', A, '\n')

**Canada**

C – Alex Pietrangelo, Sidney Crosby, Ryan Getzlaf, Claude Giroux, Steven Stamkos, John Tavares, Jonathan Toews

A – Drew Doughty, Shea Weber, Patrice Bergeron, Logan Couture, Ryan O’Reilly, Corey Perry, Joe Thornton

**United States**

C – Ryan McDonagh, Max Pacioretty, Joe Pavelski, Blake Wheeler

A – Dustin Byfuglien, Ryan Suter, Brandon Dubinsky, Ryan Kesler, Zach Parise, Derek Stepan

**Sweden**

C – Erik Karlsson, Gabriel Landeskog, Henrik Sedin

A – Oliver Ekman-Larsson, Nicklas Backstrom, Daniel Sedin

**Team Europe**

C – Zdeno Chara, Anze Kopitar

A – Roman Josi, Mark Streit

**Russia**

C – Alexander Ovechkin

A – Andrei Markov, Evgeni Malkin

**Czech Republic**

C –

A – Martin Hanzal, Tomas Plekanec

**Finland**

C – Mikko Koivu

A – Jussi Jokinen

**Team North America**

C –

A – Ryan Nugent-Hopkins, Mark Scheifele

Thanks for reading! If you would like to discuss anything or have questions/corrections then please write a comment, email me at agalea91@gmail.com, or tweet me @agalea91

[1] – I used this example as a guide from the Bokeh docs to create my bar plot.

]]>First thing to do is load up the libraries we’ll be using. For example we load the MASS library and get access to the stepAIC function and the dplyr library lets us use the piping operator %>%.

library(ggplot2) library(GGally) library(dplyr) library(BAS) library(MASS)

Please note: I will be using “=” in place of “<-” when writing R code because wordpress has a bad habit of changing my < characters in code snippets.

The swiss dataset contains 47 observations on 6 variables.

# Store the swiss dataframe in memory data(swiss) # Create a pairplot ggpairs(swiss)

Each sample is for a province in Switzerland and we are given the fertility measure, % of males involved in an agriculture occupation, % of draftees receiving the highest mark on an army examination, % of draftees with education beyond primary school, % catholic population, and infant mortality rates. The data is from the year 1888 by the way. We’ll use Bayesian linear regression to model the fertility of the population, but first let’s start with a Frequentist approach: Ordinary Least Squares (OLS).

For OLS we model as a function of with the equation:

and solve for the parameters by minimizing the least squares objective function.

In R this can be done as follows, where fertility is modeled as a function of each feature (as indicated by the . in the model equation).

swiss.lm_full = lm(formula = Fertility ~ ., data = swiss)

What will happen if we try and plot the resulting line of best fit?

# Set up dataframe containing predictions predict = data.frame(predict(swiss.lm_full)) predict$x = swiss$Agriculture names(predict) = c('y', 'x') # Plot data and predictions p = ggplot() + geom_point(data = swiss, aes(Agriculture, Fertility, color='black'), size=3) p = p + geom_line(data = predict, aes(x=x, y=y, color ='red', alpha=0.8), size=1) p + scale_colour_manual(name='', values=c('black', 'red'), labels=c('y_true', 'y_predict'))

Expecting the line of best fit to be straight? We are fitting a model with 5 features so we would need 5-dimensional space to illustrate the linear hyperplane. Since none of us have 5-dimensions lying around we’ll just have to trust the math on this one. By now you may have already realized that the plot above is not even valid because we are simply drawing lines between predicted points. The figure should look like this:

p = ggplot() + geom_point(data = swiss, aes(Agriculture, Fertility, color='black'), size=3) p = p + geom_point(data = predict, aes(x=x, y=y, color ='red'), size=3, shape=1) p + scale_colour_manual(name='', values=c('black', 'red'), labels=c('y_true', 'y_predict'))

This is awful to look at and can better be interpreted as a residual plot, where we plot the differences between the black filled points and red hollow ones.

The model above was trained on all of the features, but it may be better to use only a subset. One method of determining the optimal subset of features is with the stepAIC function, which attempts to minimize the Bayesian Information Criterion (BIC) metric. This metric ranks the models according to goodness of fit but includes a penalty for having more parameters that goes as where is the number of parameters.

stepAIC(lm(Fertility ~., data = swiss), k=log(nrow(rock)))

As can be seen, the BIC was reduced by removing the “Examination” feature. After this step it was found that no lower value could be achieved by removing additional features and the algorithm ended.

In bayesian linear regression we write a similar equation to the OLS method:

where represents the sample number and is the error of each sample. Before revealing how the parameters are determined [1], let’s talk about the errors.

By rearranging, we could calculate for a given sample by evaluating . The errors are assumed to be normally distributed with mean of 0. We can check this assumption for the OLS swiss dataset model by solving for each and plotting the distribution. In other words, we plot a histogram of the residuals:

# Compute errors errors = resid(swiss.lm_full) # Plot histogram and fitted line as.data.frame(errors) %>% ggplot(aes(errors)) + geom_histogram(binwidth=1.5, aes(y=..density..)) + geom_density(adjust=1.2, size=1, color='red') + xlim(-23, 23)

Even with this small dataset of 47 samples we see the normal distribution beginning to take shape, as suggested with the red curve.

In Bayesian regression we assign prior probability distributions to the parameters and use a likelihood function to determine posterior using Bayes’ rule. For a given parameter this rule can be stated as:

where is the prior distribution of , is the posterior distribution given the data and the other term is the likelihood [2].

We can see how the posterior will in principle depend on the choice of both prior and likelihood, but in this post we never explicitly define any priors because they will be dominated by the likelihood under our BIC assumptions. For more details, check out the top answer to my stack exchange question.

Once we have determined the posterior distribution for each we can set the parameters for our linear model. Our choice should depend on the loss function we wish to minimize. For a linear loss function we should take the mean and for a quadratic loss function (used in OLS) we should take the median. In this post our posteriors are symmetric, so each choice is equivalent.

To implement this in R we’ll import the BAS library and use the bas.lm function to evaluate a set of Bayesian models containing different combinations of features. We can then make predictions using various combinations of the resulting models.

swiss.lm_bay = bas.lm(Fertility ~ ., data = swiss, prior = 'BIC', modelprior = uniform()) swiss.lm_bay

Just like our linear models earlier, we feed in all of the features using the dot (.) and specify “Fertility” for prediction. The function returns inclusion probabilities for each feature, given the data used to fit the models.

Let’s not worry about the parameters for specific models just yet and turn our attention to the probabilities of the models. The prior distribution for the models is uniform, as can be confirmed with the following code:

swiss.lm_bay$priorprobs

These are updated to:

swiss.lm_bay$postprobs

which can be illustrated using the image function.

image(swiss.lm_bay, rotate=FALSE)

Here we see the models ranked by their posterior odds ratio where black squares indicate which features are being left out of each model. Just like our stepAIC linear model feature reduction earlier, “Examination” can be identified as a poor feature for making predictions about fertility.

For a more quantified summary of the top models we can do:

summary.bas(swiss.lm_bay)

This gives access to the posterior probability of the top models side-by-side with values. Notice how the model with the largest does not have the largest probability!

As promised, we’ll now return to the parameter probabilities and plot the coefficient posterior distribution for each feature. The code below uses the model averaging approach to calculate these distributions.

par(mfrow = c(1,2)) plot(coefficients(swiss.lm_bay))

Notice how our weakest feature, “Examination”, has a large overlap with 0. In each plot the overlap is quantified by the height of the black vertical line extending up from .

Since we didn’t hold out any data during training, we have nothing to test our model on. Let’s swiftly fix that by breaking our dataframe into training and testing pieces:

set.seed(1) n = nrow(swiss) train = sample(1:n, size = round(0.6*n), replace=FALSE) swiss.train = swiss[train,] swiss.test = swiss[-train,]

and training a new set of models:

swiss.lm_bay = bas.lm(Fertility ~ ., data = swiss.train, prior = 'BIC', modelprior = uniform())

Now we can compare the performance of the following aggregated models:

- BMA: Bayesian Model Averaging (mean of best models)
- BPM: Bayesian Posterior Model (best predictave model according to some loss function e.g., squared error)
- MPM: Median Probability Model (including all predictors whose marginal probabilities of being non zero are above 50%)
- HPM: Highest Probability Model

# Set up matrix to store results in results = matrix(NA, ncol=4, nrow=1) colnames(results) = c('BMA', 'BPM', 'MPM', 'HPM') # Make predictions for each aggregated model for (name in colnames(results)) { y_pred = predict(swiss.lm_bay, swiss.test, estimator=name)$fit results[1, name] = cv.summary.bas(y_pred, swiss.test$Fertility) } # Print results options(digits = 4) results

In each case the performance is similar, with the BMA model appearing to be the best and BPM the worst. Unfortunately we can not trust these results because they depend too much on the training / testing data allocation. To get results we *can* trust we’ll average the predictions of many models that have been trained and tested on different parts of the data [3], as seen below.

set.seed(99) results = matrix(NA, ncol=4, nrow=10) colnames(results) = c('BMA', 'BPM', 'MPM', 'HPM') for (i in 1:10) { n = nrow(swiss) train = sample(1:n, size = round(0.6*n), replace=FALSE) swiss.train = swiss[train,] swiss.test = swiss[-train,] swiss.lm_bay = bas.lm(Fertility ~ ., data = swiss.train, prior = 'BIC', modelprior = uniform()) for (name in colnames(results)) { y_pred = predict(swiss.lm_bay, swiss.test, estimator=name)$fit results[i, name] = cv.summary.bas(y_pred, swiss.test$Fertility) } } boxplot(results)

Now we can see that each method performs equally well within the calculated error bounds.

If your still reading this, and especially if you have been following along in RStudio, then perhaps you are willing to take on a homework task of comparing results when using different priors. What happens when you run K-fold cross validation with the substitution below?

swiss.lm_bay = bas.lm(Fertility ~ ., data = swiss.train, prior = 'g-prior', modelprior = beta.binomial(1,1))

Thanks for reading! You can find a link to the RStudio markdown file here.

If you would like to discuss anything or have questions/corrections then please write a comment, email me at agalea91@gmail.com, or tweet me @agalea91

[1] – As well as the ‘s, we analogously solve for the standard deviation of the error function in Bayesian linear regression. This also involves setting a prior distribution and using a likelihood function to determine the posterior.

[2] – The posterior can be calculated using conjugacy, which occurs when the prior and posterior distributions are defined by the same function with different parameters. By selecting the appropriate prior and likelihood this concept can be used to easily determine the posterior.

[3] – As pointed out to me by a reddit user (named questionquality), what I am doing here is not K-fold cross validation and I have edited the post accordingly. For K-fold testing we could do something like this:

n = nrow(swiss) folds = caret::createFolds(1:n, k=10) for (fold in folds) { swiss.train = swiss[-fold,] swiss.test = swiss[fold,] etc... }

Again I would like to acknowledge reddit user questionquality for this code.

]]>I pulled the statistics from the original post (linked to above) using requests and BeautifulSoup for python. The bar plots were made with matplotlib and seaborn, where the functions are ordered by the number of unique repositories containing instances. For example we see that pd.Timestamp is not as often used in a project as a number of others, despite it having a very high number of total instances on Github.

**1) Dataframe: **Creates a dataframe object.

df = pd.DataFrame(data={'y': [1, 2, 3], 'score': [93.5, 89.4, 90.3], 'name': ['Dirac', 'Pauli', 'Bohr'], 'birthday': ['1902-08-08', '1900-04-25', '1885-10-07']}) print(type(df)) print(df.dtypes) df

**6) Merge: **Combine dataframes.

df_new = pd.DataFrame(data=list(zip(['Dirac', 'Pauli', 'Bohr', 'Einstein'], [True, False, True, True])), columns=['name', 'friendly']) df_merge = pd.merge(left=df, right=df_new, on='name', how='outer') df_merge

**3) arange:** Create an array of evenly spaced values between two limits.

np.arange(start=1.5, stop=8.5, step=0.7, dtype=float)

**8) mean:** Get mean of all values in list/array or along rows or columns.

vals = np.array([1, 2, 3, 4]*3).reshape((3, 4)) print(vals) print('') print('mean entire array =', np.mean(vals)) print('mean along columns =', np.mean(vals, axis=0)) print('mean along rows =', np.mean(vals, axis=1))

**1) stats:** A module containing various statistical functions and distributions (continuous and discrete).

# Normal distribution: # plot Gaussian x = np.linspace(-5,15,50) plt.plot(x, sp.stats.norm.pdf(x=x, loc=5, scale=2)) # plot histogram of randomly sampling np.random.seed(3) plt.hist(sp.stats.norm.rvs(loc=5, scale=2, size=200), bins=50, normed=True, color='red', alpha=0.5) plt.show()

**5) linalg:** Among other things, this module contains linear algebra functions including inverse (linalg.inv), determinant (linalg.det), and matrix/vector norm (linalg.norm) along with eigenvalue tools e.g., linalg.eig.

matrix = np.array([[4.3, 8.9],[2.2, 3.4]]) print(matrix) print('') # Find norm norm = sp.linalg.norm(matrix) print('norm =', norm) # Alternate method print(norm == np.square([v for row in matrix for v in row]).sum()**(0.5)) print('') # Get eigenvalues and eigenvectors eigvals, eigvecs = sp.linalg.eig(matrix) print('eigenvalues =', eigvals) print('eigenvectors =\n', eigvecs)

**6) interpolate:** A module containing splines and other interpolation tools.

# Spline fit for scattered points x = np.linspace(0, 10, 10) xs = np.linspace(0, 11, 50) y = np.array([0.5, 1.8, 1.3, 3.5, 3.4, 5.2, 3.5, 1.0, -2.3, -6.3]) spline = sp.interpolate.UnivariateSpline(x, y) plt.scatter(x, y); plt.plot(xs, spline(xs)) plt.show()

**8) signal:** This module must be import directly. It contains tools for signal processing.

# Fit noisy signal smooth line import scipy.signal np.random.seed(0) # Create noisy data x = np.linspace(0,6*np.pi,100) y = [sp.special.sph_jn(n=3, z=xi)[0][0] for xi in x] y = [yi + (np.random.random()-0.5)*0.7 for yi in y] # y = np.sin(x) # Get paramters for an order 3 lowpass butterworth filter b, a = sp.signal.butter(3, 0.08) # Initialize filter zi = sp.signal.lfilter_zi(b, a) # Apply filter y_smooth, _ = sp.signal.lfilter(b, a, y, zi=zi*y[0]) plt.plot(x, y, c='blue', alpha=0.6) plt.plot(x, y_smooth, c='red', alpha=0.6) plt.title('Noisy spherical bessel function signal processing') plt.savefig('noisy_signal_fit.png', bbox_inches='tight') plt.show()

**10) misc:** A module containing “utilities that don’t have another home”. Based on the google search results, people often use `misc.imread` and `mics.imsave` to open and save pictures.

# Get a raccoon face # Get the raccoon pics = sp.misc.face(), sp.misc.face(gray=True) # Look at it fig, axes = plt.subplots(1, 2, figsize=(10, 4)) for pic, ax in zip(pics, axes): ax.imshow(pic); ax.set_xticks([]); ax.set_yticks([]) plt.show()

Thanks for reading. As mentioned earlier, you can see a full list of examples in my ipython notebook. I would like to acknowledge Robert for mining the usage data from Github, here is a link to his blog.

If you would like to discuss anything or have questions/corrections then please write a comment, email me at agalea91@gmail.com, or tweet me @agalea91

]]>I used my own Python script for collecting tweets, it’s available on GitHub here. You can learn about its dependencies and how to run it in this blog post. For the analysis I used an ipython notebook (available here), where you can see the inner workings of this series in more detail.

We can see the overall trend by plotting a histogram of posts containing #NHL during the playoffs.

Each bar includes a day worth of tweets, and the spikes occurred on days with big games. Overall we see a decrease in interest as the playoffs progress and more teams are eliminated. On the other hand, we see a dramatic increase in the popularity to game ratio. This is especially evident for the final round where game days are clearly distinguishable with spikes of at least 1000 tweets. This count increased significantly for the final two games of the playoffs and hit a maximum on the final game. Interestingly, the two days prior to this had the lowest tweet counts of the entire playoffs.

Below we’ll look at similar histogram plots for a selected group of players – one from each team in the playoffs. For teams in the east we get the following:

And for teams in the west we find:

The plots are pretty hectic, and for this reason I’ve also produced variations where each player is isolated. We’ll look at these individually below with some commentary, although I’ll mostly leave the interpretation up to you.

Starting with the east players:

**Sidney Crosby:**

Sid dominated the popularity contest this year. Even from the start he was one of the most tweeted about players, but things really started getting crazy in the later rounds. The final spike of ~40,000 tweets was the result of Pittsburgh winning the Stanley Cup and Crosby himself being named playoff MVP.

**Nikita Kucherov:**

Nikita and the Tampa Bay Lightning had a great run this year, losing to Pittsburgh in the conference finals. His big spike near the end was the result of a 3 point performance on May 22nd. He had two shots that game and got two goals, one coming late in the third to push the game to overtime where Tampa won on a lucky bounce.

**Alex Ovechkin:**

Alex’s histogram probably would have looked more like Crosby’s if his Washington Capitals had been able to make it past the Penguins in the second round. He was initially on par with Sid, if not more popular, but that all came to an end along with the dreams of his Stanley cup hopeful teammates.

**John Tavares:**

John is one of my favorite captains in the league and he’s helped bring his New York Islanders three playoff births in the last four years. The spike of ~10,000 tweets on April 24th was the result of his series wining double overtime goal (in New York) that pushed his team into the second round for the first time since 1993. Oh and he also scored that games tying goal with a minute left in the third period.

**Jaromir Jagr:**

Jagr, now 44 years old, saw his Stanley cup dreams end in the first round this year. Like other members of his Florida Panthers (e.g., Barkov), he wasn’t very successful in the playoffs compared to the regular season. His most popular twitter day came on May 5th when the Panthers signed him to a one year deal.

**Claude Giroux:**

Despite a solid effort, Giroux and his Philadelphia flyers were never even close to defeating the Presidents trophy winning Capitals in the first round.

**Petr Mrazek:**

Mrazek played only three games in the playoffs. He put up great numbers and earned his Detroit Red Wings a win against the Lightning. These performances helped to solidify him as Detroit’s expected number one next year.

**Henrik Lundqvist:**

The king and his New York Rangers battled hard this year against Pittsburgh. He played all 5 games despite suffering an eye injury in game 1.

Now moving swiftly on to the west players:

**Joe Pavelski:**

Little Joe was a total powerhouse of goal scoring for his San Jose Sharks, and was a big reason why they made it to the finals. Unfortunately for the Sharks, his unprecedented goal scoring came to an abrupt halt in the finals. This is reflected in the data as we see no significant spikes after the third round.

**Vladimir Tarasenko:**

Tarasenko and his St.Louis Blues were finally able to defeat Chicago in the playoffs this year. Until the conference finals he had a productive playoffs, but the Sharks were able to shut him down. As can be seen, Pavelski (dashed turquoise line) had larger spikes game-for-game when the two stars faced off against each other.

**Filip Forsberg:**

Despite his Nashville Predators going deep this year, Forsberg was one of the least popular players I followed. This makes sense considering his uncharacteristically poor performance and low popularity to begin with.

**Tyler Seguin:**

Seguin was injured shortly before the playoffs began and I had anticipated he’d be returning before long, resulting in an onslaught of attention. I was wrong.

**Patrick Kane:**

Despite his Chicago Blackhawks being eliminated in the first round, Kane still ended up being on of the most popular players this year on twitter. The massive spike came after a double overtime goal against St.Louis in an elimination situation for Chicago. Later on, after being eliminated, he garnered more attention when being named a Hart trophy finalist on May 7th. This is awarded to the regular season MVP, and Kane ended up winning it this year as announced on June 22nd.

**Corey Perry:**

Perry and his Anaheim Ducks were eliminated early this year, but Perry threw on a team Canada jersey and flew over to Russia to compete in the IIHF world championships. As such, and considering Canada ended up winning the tournament, he saw late spikes of twitter attention in May.

**Jason Pominville:**

Jason played really well for his Minnesota Wild, and was a factor in the team stealing two games from a high powered Dallas team before being defeated in an exciting final match. This series was really great.

**Milan Lucic:**

Because of his affinity for controversy, Lucic had the potential to get a lot of twitter attention had his L.A. Kings made it deep. But they didn’t.

Thanks for reading. Keep an eye out for my next post, where we’ll look at the most influential NHL twitter users.

If you would like to discuss anything or have questions/corrections then please write a comment, email me at agalea91@gmail.com, or tweet me @agalea91

[1] – The filtering, which took my computer nearly two days, was necessary to make sure the tweets were about NHL players and not other people with the same last names. Next time I do this I’ll attempt to customize my search better via the twitter API, or alternatively implement the pre-processing algorithm at the tweet-scraping stage before writing to the file.

]]>I used my own Python script for collecting tweets, it’s available on GitHub here. You can learn about its dependencies and how to run it in this blog post. For the analysis I used an ipython notebook (available here), where you can see the inner workings of this series in more detail.

After collection, the tweets are stored in .JSON formatted files which can each be read into the notebook by calling the following function.

import json def load_tweets(file, skip): with open(file, 'r') as f: tweets = (json.loads(line) for i, line in enumerate(f.readlines()) if i%skip==0) return tweets

The entire file is iterated over line by line (where there is one tweet per line) and stored in a generator object. For now the memory usage is extremely low, however at some point we’ll have to store pieces of this generator object as lists. This will be quite memory intensive and so I’ve specified the variable skip to allow for some lines to be left out during exploration. Setting skip=1 is equivalent to skipping no tweets because i%1==0 for all integer values of i.

The key insight to using a generator for temporary tweet storage is that we are only interested in a small portion of the total number of tweet attributes that exist. We can iterate over the generator and append only the desired information to lists, which will lead to a large reduction in memory usage compared to storing everything. In particular, the attributes we care about for this study are:

- text
- date created
- user name
- number of favorites
- number of retweets
- number user is following
- number of user followers

We can get this information into a Pandas dataframe named df by doing something like this:

data = {'search_phrase': [], 'text': [], 'screen_name': [], 'created_at': [], 'retweet_count': [], 'favorite_count': [], 'friends_count': [], 'followers_count': []} # tweets is a generator object returned from # calling the load_tweets function for t in tweets: data['text'].append(t['text']) data['screen_name'].append(t['user']['screen_name']) data['created_at'].append(t['created_at']) data['retweet_count'].append(t['retweet_count']) data['favorite_count'].append(t['favorite_count']) data['friends_count'].append(t['user']['friends_count']) data['followers_count'].append(t['user']['followers_count']) import pandas as pd df = pd.DataFrame(data)

Posts that are retweets will have “RT” as the first two characters of the text entry. As such, we can make a new column to identify these by running the following:

RT = [] for t in df.text: RT.append(t.split()[0]=='RT') df['RT'] = RT

It’s also a good idea to convert our created_at column data to datetimes:

# Convert created_at to datetimes df['created_at'] = pd.to_datetime(df['created_at'])

Let’s see an overview of the dataframe, where I’ve iterated over the first two code snippets to load the tweets from all files.

We can easily and quickly search specific queries. For example let’s see the original posts that were retweeted at least 20 times but never were favorited.

If you are just getting started with Pandas, I’ve made a short reference gist with some useful dataframe commands that may be helpful if you are just getting started with Pandas.

Thanks for reading. Keep an eye out for my next post, where we’ll start to visualize the data.

[1] – The filtering, which took my computer nearly two days, was necessary to make sure the tweets were about NHL players and not other people with the same last names. Next time I do this I’ll attempt to customize my search better via the twitter API, or alternatively implement the pre-processing algorithm at the tweet-scraping stage before writing to the file.

]]>

In Part 1 of the study I used this data to identify the location of Trump haters who tweeted the hashtag #MakeDonaldDrumpfAgain in the month of March. In that blog post we looked at the number of tweets per state (and per capita) to see where the haters were. Today we’ll follow up with further analysis of the same data, including a closer look at retweet trends and tweet content.

This study was inspired by a social media outburst set off by HBO’s “Last Week Tonight” video segment about Trump, where host John Oliver asks the audience to broadcast the hashtag MakeDonaldDrumpfAgain – see the relevant clip here. In the video segment he plays on the fact that Trump’s family name was changed from Drumpf a few generations ago.

In the 32 days following its release (the length of this study) the video collected 23.4 million views on Youbtube and I found ~550,000 tweets with the viral hashtag. About half of them were retweets, meaning that the post was not original content.

**#MakeDonaldDrumpfAgain study
**

The Python 3 script I used can be found here and a tutorial on using the script and installing dependencies can be found here.

**Objectives
**

In this part I’ll focus on answering the questions:

- How did the social media ecosystem of Twitter react to this viral political topic?

- What is the content of anti-Trump tweets?

For all of the figures below I used Python 3 and the complete code can be found in my ipython notebook.

**Lonely Tweets
**

It’s common for twitter posts to be “retweeted” by users who want to broadcast the tweet to their followers. Popular twitter accounts can see up to thousands of retweets for each post. Mine, on the other hand, usually get none – we’ll call these lonely tweets. Below we look at the March 2016 #MakeDonaldDrumpfAgain tweets that fit under this category.

The main plot histogram is similar to the one we saw in Part 1 for all tweets. What’s more insightful is the inset plot showing the ratio of lonely tweets to all tweets. We see that the likelihood of getting a post retweeted decreased as more time passed since the video was initially posted.

**Retweet frequency**

I wanted to look closer at the relation between original tweets and their ‘children’ retweets. In particular: how much time passed between posting the original and somebody retweeting it. Intuition and experience tells me that it happens sooner rather than later. The results, shown below, are consistent with this expectation.

The inset shows the same data as the main plot but on a logarithmic scale, where we see the probability of getting a post retweeted drops off exponentially with time.

**Most common hashtags**

Other than #MakeDonaldDrumpfAgain, which was contained in all tweets, the 10 most popular hashtags are seen below.

Many of these could lead to misinterpretation as positive tweet sentiment in other studies e.g., #Trump2016, #MakeAmericaGreatAgain. Others, not so much e.g., #NeverTrump, #DumpTrump. Underdog democrat senator Bernie Sanders is referenced with #FeelTheBern, suggesting that Trump haters on twitter are more supportive of Bernie than Hillary. At this point its almost a certainty that she will win out over Bernie as the democratic presidential nominee.

**Most common users tweeted @**

Here are the top 10 users included in tweets:

A large group of people directed their dislike towards Trump directly via his twitter account by including @realDonaldTrump in their post. Imagine getting 20,000 “dislike” notifications in just one month! On the other hand, Donald obviously doesn’t have time to monitor his twitter account personally … but I bet he noticed more than just a few of these posts.

**Word counts**

How many God-loving Americans put the hate on Trump? Where is the love? How much profanity? All are answered below.

Okay so this doesn’t really tell us much. Most of the love word counts came from posts something akin to “I love John Oliver’s video..”.

The most interesting pie slice, in my opinion, is that for profanity. I would be hard-pressed to drop an F-bomb or other swear word in a twitter post, but apparently there are plenty of people who don’t mind this sort of thing. I was even tentative to include explicit profanity on my blog, but I will in the name of science! Here are the most popular curse words, presumably directed towards Trump:

Thanks for reading! All the figures can be found in high resolution in their github repository along with all the code used to create them. If you would like to access my data, I may be able to share the .JSON tweet files with you – although they are quite large.

If you would like to discuss any of the plots or have any questions or corrections, please write a comment. You are also welcome to email me at agalea91@gmail.com or tweet me @agalea91

]]>Part 1 – calculating Pi with Monte Carlo

Part 2 – Galton’s peg board and the central limit theorem

Part 3 – Markov Chain Monte Carlo

In the previous post, Markov Chains were introduced along with the Metropolis algorithm. We then looked at a Markov Chain Monte Carlo (MCMC) Python script for sampling probability distributions. Today we’ll use a modified version of that script to perform calculations *inspired by quantum mechanics*. This will involve sampling high dimensional probability distributions for systems with multiple particles.

We are heading into complicated territory, so I set up the first section of this post in question-answer format.

As far as I can tell, variational Monte Carlo (VMC) is the same as MCMC. The name was probably inspired by the variational theorem from quantum mechanics, but we’ll get to this later. For now let’s simply pose the problem:

**Given a probability distribution , where is the configuration vector for the positions of particles in a box, calculate the average energy of the system.**

Why do we need to worry about probability distributions? Because in the quantum world things to not always have well defined positions, but instead only have well defined probabilities of being in specific positions. Hence the existence of probability distributions.

We’ll calculate the average value of the local energy , which is defined as the sum of kinetic and potential energies:

It’s not unreasonable to plug in some and calculate by hand or with a computer, but to calculate the average we’ll need to sample the probability distribution with Monte Carlo.

The kinetic energy will depend on the second derivative of the wave function, which is relatively difficult and computationally time consuming to calculate. To avoid dealing with this beast we’ll use *a made-up function*

It could be interpreted as the particles having more kinetic energy (i.e., larger ) depending on how far they are from the center of the box (as we’ll see the box is centered about r=(0,0)). This doesn’t make sense physically of course.

This is the cumulative potential energy of the particles, and unlike it will be calculated “for real” (i.e., how it actually can be done in practice). We’ll assume the system is split into equal sized groups of different “species” particles, denoted and [1], and only consider interactions between particles of different species. In this case we have

where is the two-body potential and is the distance between particles and . The notation ensures that we only count each interaction once. We'll take to be a shifted Gaussian:

from scipy import stats

mu = 0.5

sig = 0.1

y = lambda r: -stats.norm.pdf((r-mu)/(sig))

We’ll focus on the case where , for which the potential becomes

From elementary quantum mechanics, the probability density is given by the square of the wave function. So, for our many-body wave function , we have that

The absolute value is meaningful when the wave function has imaginary components, which is not the case for us today. Our first will be a linear combination of single-particle wave functions and [2]:

def prob_density(R, N): ''' The square of the many body wave function Psi_V(R). ''' # e.g. for N=4: # psi_v = psi(r_1) + psi(r_2) + psi(r_3) + psi(r_4) psi_v = sum([psi_1(R[n][0], R[n][1]) for n in range(N)]) + \ sum([psi_2(R[n][0], R[n][1]) for n in range(N)]) # Setting anything outside the box equal to zero # This will keep particles inside for coordinate in R.ravel(): if abs(coordinate) >= 1: psi_v = 0 return np.float64(psi_v**2)

Our configuration space is two dimensional (so we can visualize it nicely) and so particle positions consist of just and coordinates, i.e. . Lets take to be a positive Gaussian and to be a negative Gaussian.

def psi_1(x, y): ''' A single-particle wave function. ''' g1 = lambda x, y: mlab.bivariate_normal(x, y, 0.2, 0.2, -0.25, -0.25, 0) return g1(x, y) def psi_2(x, y): ''' A single-particle wave function. ''' g2 = lambda x, y: -mlab.bivariate_normal(x, y, 0.2, 0.2, 0.25, 0.25, 0) return g2(x, y)

The summation of and looks like:

If we take this summation to be the wave function of a system with just one particle then the associated probability distribution for the particle location would look like:

So what about a plot of the many-body wave function ? We would require a higher dimensional space for this. Take the example of particles. In this case we have and would need to include an axes for not just “ and ” as done above, but for , , , , , , , and .

We’ll focus on the particle system where 2 are species and 2 are species . The wave function in this case is

Running a quick simulation with 1000 walkers (i.e., 1000 samples per step) for 40 steps, the samples look like this:

The top left panel shows the initial state where walkers are distributed randomly. As the system equilibrates we see them drifting into areas where the probability density is large. Species is plotted in blue and in red.

To calculate we average over many values of for equilibrated samples. An example using 200 walkers is shown below.

Each point is an average over all walkers at the given step. In this case we calculate an average energy of , where the first 200 steps have been excluded so we only average over the equilibrated system. The error, as shown by the red band around the average, is calculated as the sample standard deviation of [3]:

Running a calculation with 2000 walkers, we can see a dramatically reduced error:

Now we calculate , which agrees within error to the previous calculation.

In the calculations above, the system appeared to equilibrate almost immediately. This is because the initial configurations were randomly distributed about the box. If instead we force the particles to start near the corners of the box (far away from the areas where the probability distribution is large) we can clearly see decreasing as equilibration occurs. Below we plot this along with the particle density from various regions of the calculation (as marked in yellow in the right panel).

The usefulness of VMC for quantum mechanical problems has to do with the variational theorem. In words, this theorem says that the energy expectation value (denoted for short) is minimized by the *true ground state wave function of the system*. This can be written mathematically as follows:

where is the ground state energy of the system. A proof can be found at the bottom of page 60 of my masters thesis. The triple-bar equals sign simply means “is defined as”, what’s important is the other part of the equation.

If the name of the game is to find the ground state energy, which is often the case, then a good estimate can be achieved using VMC. A can include variable parameters to optimize, which is done by calculating for a particular set of parameters (the same way we’ve calcualted in this post) and repeating for different parameters until the lowest value is found. The resulting energy estimate will be an upper bound to the true ground state energy of the system.

To show how changing can impact the calculation of we’ll adjust the wave function:

where the two species and now have different single-particle wave functions. As shown above, we define as the left Gaussian (looking back to the first figure) and as the right one. For we now have:

The new probability density will be:

def prob_density(R, N): ''' The square of the many body wave function Psi_V(R). ''' psi_v = sum([psi_1(R[n][0], R[n][1]) for n in range(int(N/2))]) + \ sum([psi_2(R[n][0], R[n][1]) for n in range(int(N/2),N)]) # Setting anything outside the box equal to zero # This will keep particles inside for coordinate in R.ravel(): if abs(coordinate) >= 1: psi_v = 0 return np.float64(psi_v**2)

This should have the effect of separating the two species of particles. Do you think will become larger or smaller as a result?

Plotting some of the samples, we can see how the system now equilibrates according to the new probability density:

Comparing a calculation with the new (red) to the old one (blue), we see an increase in :

Thanks for reading! You can find the entire ipython notebook document here. If you would like to discuss any of the plots or have any questions or corrections, please write a comment. You are also welcome to email me at agalea91@gmail.com or tweet me @agalea91

[1] – For example we could have a system of cold atoms with two species that are identical except for the total spin. Where one species is spin-up and the other is spin-down.

[2] – I made the many-body wave function a sum of single-particle wavefunctions but it would have been more realistic to make it a product of ‘s as can be seen here (for example).

[3] – In practice, the error on Monte Carlo calculations is often given by the standard deviation divided by . Otherwise the error does not decrease as increases, which should intuitively be the case because we are building confidence in the calculation as the simulation is run longer. For more information see this article – specifically the first few equations and Figure 1.

]]>First we need to load the data.

import statsmodels.api as sm import pandas as pd esoph = sm.datasets.get_rdataset('esoph') print(type(esoph)) df = esoph.data # Rename a column df.columns = ['Age_group']+list(df.columns[1:]) df.head()

<class ‘statsmodels.datasets.utils.Dataset’>

After cleaning the data (which can be seen along with the full code in my ipython notebook), it ended up looking like this:

Here we’re only seeing the first 5 elements i.e., df.head(). The features have been converted to dummy variables representing the different groups.

A new column has been added called “positive_frac” which was found by calculating ncases/ncontrols for each row. It is the percentage of each age/alcohol/tobacco group that was diagnosed positive for esophagus cancer and we’re going to create linear models in an attempt to predict it. Let’s get some quick statistics about it:

df.describe()['positive_frac']

count 88.000000

mean 0.346807

std 0.357342

min 0.000000

25% 0.000000

50% 0.267857

75% 0.583333

max 1.000000

Name: positive_frac, dtype: float64

So it can range from 0% to 100% of the group with an average of ~35% being diagnosed positive. Let’s take a look at how this is distributed depending on the age group using Seaborn’s FacetGrid() object to plot a set of histograms.

import seaborn as sns colors = sns.color_palette("BrBG", 10) g = sns.FacetGrid(df, row='Age_group', size=2, aspect=4, legend_out=False) g.map(plt.hist, 'positive_frac', normed=True, color=colors[3])

As could be expected, the younger groups are less likely to be diagnosed positive.

There is clearly a trend of higher percentages for older age groups, let’s take a different look:

sns.set_style('ticks') sns.regplot(y='positive_frac', x='Age_group_i', data=df, fit_reg = True, marker='o', color=colors[1])

Seaborn has produced a line of best fit that confirms our previous observation. We can make analogous plots for our other features.

I was surprised to see that alcohol seems to be more a more influential factor in esophagus cancer than tobacco!

We can build a model to predict the likelihood of positive diagnosis depending on multiple features. In general this will be a hyperplane with the equation

where is the intercept and the other ‘s are the slopes. We have three independent variables:

- alcohol consumption
- tobacco consumption
- age

For illustration purposes, let’s build a two-feature model where ‘alcohol consumption’ and ‘tobacco consumption’. We’ll use the ordinary least squares regression class of statsmodels.

from statsmodels.formula.api import ols reg = ols('positive_frac ~ alcgp + tobgp', df).fit() reg.params

Intercept -0.173821

alcgp 0.176903

tobgp 0.035869

dtype: float64

In this case we’ll be able to plot the resulting hyperplane because it’s just a regular (i.e., 2D) plane. In the figures below I’ve done just that. The data is also plotted where the point size is determined by how many people were in the group, so a larger point represents a more statistically significant data-point [1].

This model was fit with an intercept (i.e., a non-zero value for ). We can remove this by doing the following:

reg = ols('positive_frac ~ alcgp + tobgp -1', df).fit() reg.params

alcgp 0.144220

tobgp 0.003582

dtype: float64

Notice that the and parameters have changed because the plane was re-fit with . The hyperplane looks similar but is less slanted along the ‘Tobacco consumption’ axes.

Let’s compare predictions for the following combination of features:

- alcohol => group 3 (80-119 g/day)
- tobacco => group 4 (130+ g/day)

With the intercept we predict 50% positive diagnoses and without we predict ~45% [2].

The previous predictions did not account in any way for the age groups, and we have already seen from the single-feature models that it’s correlated to being positively diagnosed. We can include it like this:

reg = ols('positive_frac ~ alcgp + tobgp + Age_group_i', df).fit() reg.params

Intercept -0.615799

alcgp 0.180166

tobgp 0.047964

Age_group_i 0.119547

dtype: float64

It’s not practical to try and visualize the hyperplane in this case. The intercept can be removed the same way as before:

reg = ols('positive_frac ~ alcgp + tobgp + Age_group_i -1', df).fit() reg.params

alcgp 0.099652

tobgp -0.035494

Age_group_i 0.066731

dtype: float64

Wait, the coefficient for tobacco consumption is negative? This means that a higher tobacco intake implies *lower* risk according to the zero-intercept model!

Now we can re-visit the test case from earlier and make predictions for each age group. We see a large variation depending on whether or not the intercept parameter is fit.

To give some indication as to the significance and accuracy of the three-feature models we can look at the residual plots.

The linear models we’ve seen have been fit by minimizing the sum of the squared residuals , defined as:

,

where are the actual values we are trying to predict (i.e., the “positive_frac” column of our dataframe).

Below we plot as a function of predictions . For the model where we get:

Whereas for the model with a fitted intercept we find a slightly tighter distribution (which can be confirmed, as I have done, by comparing the R-squared fit values for each model):

But a close look at the x-axis values reveals that we are predicting positive diagnosis fractions to be negative in some cases, which makes no sense! This doesn’t mean the model won’t give more accurate predictions for the majority of age/alcohol/tobacco combinations, but it is worth noting!

As mentioned above, the full code used to make this post can be found in my ipython notebook.

Thanks for reading! If you would like to discuss my code or have any questions or corrections, please write a comment. You are also welcome to email me at agalea91@gmail.com or tweet me @agalea91

[1] – In this post I did not do weighted regression, as may be suggested by having different sizes data-points.

[2] – Please don’t think this means you have a ~1/2 chance of being diagnosed positive yourself if you fit into this category! We are simply modeling a study.

]]>Part 1 – calculating Pi with Monte Carlo

Part 2 – Galton’s peg board and the central limit theorem

So far in this series we have seen various examples of random sampling. Here we’ll look at a simple Python script that uses Markov chains and the Metropolis algorithm to randomly sample complicated two-dimensional probability distributions.

If you come from a math, statistics, or physics background you may have leaned that a Markov chain is a **set of states that are sampled from a probability distribution**.

More recently, they have been used to string together words and make pseudo-random sentences [1]. In this case the state is defined by e.g. the current and previous words in the sentence and the next words is generated based on this “state”. We won’t be looking at this sort of thing today, but instead going back to where it all began.

In the early 1900’s a Russian mathematician named Andrey Markov published a series of papers describing, in part, his method for randomly sampling probability distributions using a *dependent* data set. It was not always clear how this could be done, and some believed that the law of large numbers (and hence the central limit theorem) would only apply to an independent data set. Among these disbelievers was another Russian professor named Pavel Nekrasov and there’s an interesting story about a “rivalry” between Markov and Nekrasov. To quote Eugene Seneta (1996):

**“Nekrasov’s (1902) attempt to use mathematics and statistics in support of ‘free will’ … led Markov to construct a scheme of dependent random variables in his 1906 paper”**

Markov layed out the rules for properly creating a *chain, *which is a series of states where each is connected, in sequence, according to a specific set of rules [2]. The transition between states must be ergodic, therefore

- Any state can be achieved within a finite number of steps; this ensures that the entire configuration space is traversable
- There is a chance of staying in the same place when the system steps forward
- The average number of steps required to return to the current state is finite

These rules may not apply, as such, for the modern Markovian-chain pseudo random text generators discussed above. However for other applications (such as QMC) these are very important.

The stage was set, but Markov would never live to see his ideas applied to QMC. This was done in the 1950’s and paralleled the creation of the worlds first electronic computer. And speaking about that …

QMC requires a set of configurations distributed according to the probability distribution (i.e., sampled from the square of the wave function). A configuration is a vector that contains the positions (e.g., , , coordinates) of each particle.

Recall, for comparison, how we used the Galton board to sample the binomial distribution. For QMC the probability distributions , where is a “many-body” configuration vector, are much more complicated and samples can be produced using the Metropolis algorithm. This algorithm obeys the rules for creating a Markov chain and adds some (crucial) details. Namely, when transitioning from the current state to the next in the chain of configurations, we accept the move with probability:

,

where is the configuration of the current state and is the configuration of the next proposed state. The move itself involves shifting the location of some or (in practice) all of the particles and this is done randomly according to a transition rule . In my experience it’s usually the case that is equal to the opposite transition and therefore we can simplify the acceptance probability to:

.

In English, this means we take as the ratio of the square of the wave function evaluated at the proposed and current configurations, or we take as 1 if this ratio is larger than 1. If then we accept the move and if we do the following:

- Produce a random number from 0 to 1
- Calculate
- Accept the move if , otherwise reject

The average acceptance ratio is an important quantity for controlling and understanding simulations and it will depend on the “maximum move size” (which is the maximum distance each particle can be shifted in each coordinate for each move – the actual distance shifted will depend also on a random number). Usually a desirable acceptance ratio is 50%.

Let’s look at a simple script for sampling two-dimensional probability distributions. If you’re familiar with Python then reading over the code should be a great way of solidifying / understanding the Metropolis algorithm as discussed above.

import numpy as np def Metroplis_algorithm(N, m, dr): ''' A Markov chain is constructed, using the Metropolis algorithm, that is comprised of samples of our probability density: psi(x,y). N - number of random moves to try m - will return a sample when i%m == 0 in the loop over N dr - maximum move size (if uniform), controls the acceptance ratio ''' # we'll want to return the average # acceptance ratio a_total = 0 # sample locations will be stored in a list samples = [] # get the starting configuration # and sample probability distribution # we'll start at r=(0,0) r_prime = np.zeros(2) p_prime = psi(r_prime[0], r_prime[1]) for i in range(N): # propose a random move: r'-> r r = r_prime + np.random.uniform(-dr,dr, size=2) p = psi(r[0], r[1]) # calculate the acceptance ratio # for the proposed move a = min(1, p/p_prime) a_total += a # check for acceptance p_prime, r_prime = check_move(p_prime, p, r_prime, r) if i%m == 0: samples.append(r_prime) return np.array(samples), a_total/N*100.0 def check_move(p_prime, p, r_prime, r): ''' The move will be accepted or rejected based on the ratio of p/p_prime and a random number. ''' if p/p_prime >= 1: # accept the move return p, r else: rand = np.random.uniform(0, 1) if p/p_prime + rand >= 1: # accept the move return p, r else: # reject the move return p_prime, r_prime

Here we are building one Markov chain by propagating a single “walker”. A walker is generally a configuration of particles, but in our case we are only worrying about one “particle” – our sampling location. In order to ensure that our samples are sufficiently well spread out, we only take one sample at every iterations. The probability distribution is called `psi` and it takes the positional arguments and . We’ll use this tricky combination of 2D (bivariate) Gaussians:

import matplotlib.mlab as mlab def psi(x, y): ''' Our probability density function is the addition of two 2D Gaussians with different shape. ''' g1 = mlab.bivariate_normal(x, y, 2.0, 2.0, -5, -5, 0) g2 = mlab.bivariate_normal(x, y, 0.5, 5.0, 10, 10, 0) return g1 + g2

Let’s see what happens when we run this script:

N, m, dr = 50000, 10, 3.5 samples, a = Metroplis_algorithm(N, m, dr)

We get the following samples (or something similar):

Because the first configuration in the Markov chain was defined as , the algorithm pulled the walker into the nearest area of high probability and the other area was completely ignored! Perhaps this can be corrected by increasing the maximum move size for the “particle” at each iteration.

As can be seen, this doesn’t really work. The move size must be quite large to achieve the desired effect, and by this point the sampling quality has degraded [3]. Notice how the acceptance ratio changes in response to altering the average move size (which is equal to half of the maximum move size: ).

A better solution is to propagate multiple walkers (effectively building a set of Markov chains) and choosing the initial configurations randomly in the simulation area. This way, although walkers may still be trapped inside one of the two Gaussians, they will be more evenly distributed between the two. Below we can see the results of doing this for a large number of iterations.

In this case, because the one distribution is so skinny, it may be beneficial to reduce the move size (as seen in the right panel) even though the acceptance ratio becomes larger than 50% [3]. The modified script for using more than one walker is included below:

def Metroplis_algorithm_walkers(N, m, walkers, dr): ''' A Markov chain is constructed, using the Metropolis algorithm, that is comprised of samples of our probability density: psi(x,y). N - number of random moves to try m - will return a samples when i%m == 0 in the loop over N walkers - number of unique Markov chains dr - maximum move size, controls the acceptance ratio ''' # we'll want to return the average # acceptance ratio a_total = 0 # sample locations will be stored in a list samples = [] # get the starting configuration # and sample probability distribution # we'll start at a randomly # selected position for each walker r_prime = [np.random.uniform(-10, 15, size=2) for w in range(walkers)] p_prime = [psi(r_prime[w][0], r_prime[w][1]) for w in range(walkers)] # initialize lists r = [np.zeros(2) for w in range(walkers)] p = [np.zeros(1) for w in range(walkers)] for i in range(N): for w in range(walkers): # propose a random move: r'-> r r[w] = r_prime[w] + np.random.uniform(-dr,dr, size=2) p[w] = psi(r[w][0], r[w][1]) # calculate the acceptance ratio # for the proposed move a = min(1, p[w]/p_prime[w]) # update the total a_total += a # check for acceptance p_prime[w], r_prime[w] = check_move(p_prime[w], p[w], r_prime[w], r[w]) if i%m == 0: samples.append(r_prime[w]) return np.array(samples), a_total/N/walkers*100.0 def check_move(p_prime, p, r_prime, r): ''' The move will be accepted or rejected based on the ratio of p/p_prime and a random number. ''' if p/p_prime >= 1: # accept the move return p, r else: rand = np.random.uniform(0, 1) if p/p_prime + rand >= 1: # accept the move return p, r else: # reject the move return p_prime, r_prime

I thought it would also be fun to plot the distributions in 3D so I modified code from a matplotlib contour plot example and produced this:

The samples can be seen as blue dots at the base of the distributions.

I’ve included the modified code here:

import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm fig = plt.figure(figsize=(14,14)) ax = fig.gca(projection='3d') # ax = fig.add_subplot(111, projection='3d') x1, x2 = -15, 15 y1, y2 = -15, 30 # set up a meshgrid - like labeling (x,y) coordinates # for each vertex on a piece of graph paper dx = 0.1 pad = 5 x = np.arange(x1, x2, dx) y = np.arange(y1, y2, dx) X, Y = np.meshgrid(x, y) # define Z as the value of the probability # distribution q at each 'vertex' # Z becomes a 2D Numpy array Z = psi(X, Y) # plot ax.plot_wireframe(X,Y,Z, rstride=5, cstride=7, color='r', alpha=0.7) ax.scatter(samples[:, 0], samples[:, 1], color='b', s=0.2) # make it pretty (as found in Axes3D.contour documentation) # cset = ax.contour(X, Y, Z, zdir='z', offset=-100, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='y', offset=y2, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='x', offset=x1, cmap=cm.coolwarm) # define the limits ax.set_xlabel('x', labelpad=15, fontsize=15) ax.set_xlim(x1, x2) ax.set_ylabel('y', labelpad=15, fontsize=15) ax.set_ylim(y1, y2) ax.set_zlabel('psi(x,y)', labelpad=15, fontsize=15) ax.set_zlim(0, 0.06) # ax.view_init(elev=20, azim=-45) plt.savefig('pretty_plot_metropolis_sampling.png', bbox_inches='tight', dpi=144) plt.show()

The 2D plots we’ve been looking at were produced using this code:

import matplotlib.pyplot as plt def plot_samples(samples, psi, limits=[]): ''' Plot the results of our Monte Carlo sampling along with the underlying probability distribution psi. ''' # set up a meshgrid - like labeling (x,y) # coordinates for each vertex on a piece # of graph paper dx = 0.1 pad = 5 if limits: xlow, xhigh = limits[0], limits[1] ylow, yhigh = limits[2], limits[3] else: xlow = np.min(samples)-pad xhigh = np.max(samples)+pad ylow = np.min(samples)-pad yhigh = np.max(samples)+pad x = np.arange(xlow, xhigh, dx) y = np.arange(ylow, yhigh, dx) X, Y = np.meshgrid(x, y) # define Z as the value of the probability # distribution psi at each 'vertex' # Z becomes a Numpy array Z = psi(X, Y) # must be feeding in numpy arrays below plt.scatter(samples[:, 0], samples[:, 1], alpha=0.5, s=1) CS = plt.contour(X, Y, Z, 10) plt.clabel(CS, inline=1, fontsize=10) plt.xlim(xlow, xhigh) plt.ylim(ylow, yhigh) plt.xlabel('x', fontsize=20) plt.ylabel('y', fontsize=20) plt.tick_params(axis='both', which='major', labelsize=15)

Thanks for reading! You can find the entire ipython notebook document here. If you would like to discuss any of the plots or have any questions or corrections, please write a comment. You are also welcome to email me at agalea91@gmail.com or tweet me @agalea91

[1] – A great Python example of pseudo-random text generation can be found here. See also Andy’s comment below.

[2] – These rules can be found on page 58 of my MSc thesis.

[3] – These plots (the second and third) used to look different due to a typo in my Metropolis_algorithm() function – pointed out my a reader. I have also changed the text accordingly and updated the ipython notebook file in my Github repository.

]]>