Today we’ll scrape some data from wikipedia and use Bokeh to make a stacked bar chart showing the number of NHL captains per team in the World Cup of Hockey. This was the inaugural year of the tournament and it was held in Toronto.
As usual, the complete work can be seen in my ipython notebook.
Getting the data
Using the python libraries requests and beautiful soup, this can be done as seen below. We first get the names of current NHL captains.
url = 'https://en.wikipedia.org/wiki/List_of_current_NHL_captains_and_alternate_captains' page = requests.get(url) print('Got %s type object of url using the requests library' % str(type(page)))
>> Got <class ‘requests.models.Response’> type object of url using the requests library
soup = BeautifulSoup(page.content, 'html.parser') print('Fed %s type object into BeautifulSoup to create a %s type object' % (str(type(page.content)), str(type(soup))))
>> Fed <class ‘bytes’> type object into BeautifulSoup to create a <class ‘bs4.BeautifulSoup’> type object
page_tables = soup.findAll('table') print('Got %s type object of length %d' % (str(type(page_tables)), len(page_tables)))
>> Got <class ‘bs4.element.ResultSet’> type object of length 11
Each table in the URL is an object and they can be iterated over. For example:
for table in page_tables: print(table.find('caption'))
>> Position abbreviations
>> List of current NHL Captains
>> List of current NHL Alternate Captains
We clearly are interested in the 2nd and 3rd items in page_tables. There are many ways to extract the data at this point. We’ll iterate through the data in our table using mod(i, 3) to access only one of the 3 columns.
# Getting the captains C_players =  for i, n in enumerate(page_tables.findAll('td')): if i % 3 == 0: try: C_players.append(n.findAll('a').text) except: # N/A entry print('Skipping entry:', n.contents)
We can do something similar to get the alternate captains.
The world cup rosters can be acquired in a similar way, as seen in my ipython notebook linked to at the top of this post.
Plotting the results
At this point we have the following data in memory:
- players – a list containing lists of players on each team
- teams – the World Cup team names
- C_players – list of NHL team captains
- A_players – list of NHL team alternate captains
We could get the total number of captains playing in the tournament as follows:
# Flatten player list all_players = [p for p_list in players for p in p_list] # Initialize counters N_C, N_A = 0, 0 for player in all_players: if player in C_players: N_C += 1 elif player in A_players: N_A += 1 print('%d captains and %d alternates ' % (N_C, N_A))
>> 18 captains and 25 alternates
This turns out to be 69% of the NHL team captains and 42% of the alternates. Overall about half of the NHL captains are playing in this tournament.
Building the above counter into a function, we can apply it to each list of players. Using the pandas library, we can then build the following dataframe:
With this dataframe (named df), we can use Bokeh to create the stacked bar plot as follows :
# Import libraries and setup for ipython notebook display from bokeh.charts import Bar from bokeh.charts.attributes import color, cat from bokeh.charts.operations import blend from bokeh.io import output_notebook, show output_notebook() # Make the plot bar = Bar(df, values=blend('Number of Captains', 'Number of Alternate Captains', name='Number of Captains', labels_name='caps'), stack=cat(columns='caps', sort=False), label=cat(columns='Team', sort=False), color=color(columns='caps', palette=['OrangeRed', 'Orange'], sort=False), title='2016 World Cup of Hockey NHL Captains', legend='top_right') show(bar)
This method is good for creating charts where the data is ordered in the same way as in the dataframe. In this case we ordered df by the total number of captains.
Most of my time on this little project was spent trying to get some of Bokeh’s interactive plot features working. In particular, I wanted to use the hover tool to show the names of the players in each bar. Unfortunately it is not possible to do this easily with the current version of Bokeh and I could not figure out how to get it working. For more details check out my stack overflow question about the issue.
Here is the list of captains / alternates on each team:
for T, C, A in zip(df.Team, df.C, df.A): print(T, '\nC -', C, '\nA -', A, '\n')
C – Alex Pietrangelo, Sidney Crosby, Ryan Getzlaf, Claude Giroux, Steven Stamkos, John Tavares, Jonathan Toews
A – Drew Doughty, Shea Weber, Patrice Bergeron, Logan Couture, Ryan O’Reilly, Corey Perry, Joe Thornton
C – Ryan McDonagh, Max Pacioretty, Joe Pavelski, Blake Wheeler
A – Dustin Byfuglien, Ryan Suter, Brandon Dubinsky, Ryan Kesler, Zach Parise, Derek Stepan
C – Erik Karlsson, Gabriel Landeskog, Henrik Sedin
A – Oliver Ekman-Larsson, Nicklas Backstrom, Daniel Sedin
C – Zdeno Chara, Anze Kopitar
A – Roman Josi, Mark Streit
C – Alexander Ovechkin
A – Andrei Markov, Evgeni Malkin
A – Martin Hanzal, Tomas Plekanec
C – Mikko Koivu
A – Jussi Jokinen
Team North America
A – Ryan Nugent-Hopkins, Mark Scheifele
Thanks for reading! If you would like to discuss anything or have questions/corrections then please write a comment, email me at firstname.lastname@example.org, or tweet me @agalea91
 – I used this example as a guide from the Bokeh docs to create my bar plot.