Loading tweets into a Pandas dataframe using generators

This kicks off a series of posts looking at tweets with NHL content that were posted over the course of the playoffs. I searched for posts containing #NHL as well as those containing the names of a select group of players – one from each playoff team. By the end I had collected and saved millions of tweets in .JSON format, but was left with ~400,000 after filtering out the non-relevant posts [1].

I used my own Python script for collecting tweets, it’s available on GitHub here. You can learn about its dependencies and how to run it in this blog post. For the analysis I used an ipython notebook (available here), where you can see the inner workings of this series in more detail.


2016 NHL Playoffs on Twitter – Part 1: Loading Tweets


After collection, the tweets are stored in .JSON formatted files which can each be read into the notebook by calling the following function.

import json
def load_tweets(file, skip):
    with open(file, 'r') as f:
        tweets = (json.loads(line) for i, line in enumerate(f.readlines()) if i%skip==0)
    return tweets

The entire file is iterated over line by line (where there is one tweet per line) and stored in a generator object. For now the memory usage is extremely low, however at some point we’ll have to store pieces of this generator object as lists. This will be quite memory intensive and so I’ve specified the variable skip to allow for some lines to be left out during exploration. Setting skip=1 is equivalent to skipping no tweets because i%1==0 for all integer values of i.

The key insight to using a generator for temporary tweet storage is that we are only interested in a small portion of the total number of tweet attributes that exist. We can iterate over the generator and append only the desired information to lists, which will lead to a large reduction in memory usage compared to storing everything. In particular, the attributes we care about for this study are:

  • text
  • date created
  • user name
  • number of favorites
  • number of retweets
  • number user is following
  • number of user followers

We can get this information into a Pandas dataframe named df by doing something like this:

data = {'search_phrase': [], 'text': [], 'screen_name': [], 'created_at': [],
        'retweet_count': [], 'favorite_count': [],
        'friends_count': [], 'followers_count': []}

# tweets is a generator object returned from
# calling the load_tweets function
for t in tweets:

import pandas as pd
df = pd.DataFrame(data)

Posts that are retweets will have “RT” as the first two characters of the text entry. As such, we can make a new column to identify these by running the following:

RT = []
for t in df.text:
df['RT'] = RT

It’s also a good idea to convert our created_at column data to datetimes:

# Convert created_at to datetimes
df['created_at'] = pd.to_datetime(df['created_at'])

Let’s see an overview of the dataframe, where I’ve iterated over the first two code snippets to load the tweets from all files.


We can easily and quickly search specific queries. For example let’s see the original posts that were retweeted at least 20 times but never were favorited.


If you are just getting started with Pandas, I’ve made a short reference gist with some useful dataframe commands that may be helpful if you are just getting started with Pandas.

Thanks for reading. Keep an eye out for my next post, where we’ll start to visualize the data.

If you would like to discuss anything or have questions/corrections then please write a comment, email me at agalea91@gmail.com, or tweet me @agalea91


[1] – The filtering, which took my computer nearly two days, was necessary to make sure the tweets were about NHL players and not other people with the same last names. Next time I do this I’ll attempt to customize my search better via the twitter API, or alternatively implement the pre-processing algorithm at the tweet-scraping stage before writing to the file.


3 thoughts on “Loading tweets into a Pandas dataframe using generators

  1. Hey Alexander,

    Excellent tutorial! You explained everything both comprehensively and simply enough for a newbie like me to understand. Just one question. I ran the above code for my first .json file and the dataframe formed perfectly. However, I did another stream, and when tried the same code for new .json file, I got the following error:

    KeyError Traceback (most recent call last)
    in ()
    6 # calling the load_tweets function
    7 for t in tweets_df:
    —-> 8 data[‘text’].append(t[‘text’])
    9 data[‘screen_name’].append(t[‘user’][‘screen_name’])
    10 data[‘created_at’].append(t[‘created_at’])

    KeyError: ‘text’

    I assume maybe the second stream was cut off before it retried the ‘text’ column and so the code was couldn’t proceed? I am not entirely sure, let me know if you are aware of a solution. Thanks!


    1. The KeyError you get means you are using a dictionary key that does not exist. In this case it could either be the ‘text’ key of “data” or the ‘text’ key of one of the tweets. I would bet it’s the later. What does your “tweets” list look like? There must be at least one element that isn’t a tweet. You could use a try / except statement to probe this problem further.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s