Collecting Twitter data with Python

Mining for tweets

 

This post explains generally how my Python 3 tweet searching script works. Twitter limits the maximum age of searchable tweets to roughly a week. As such, the script can search for tweets posted up to just over a week ago. Twitter also limits the maximum number of tweets downloaded in every 15 minute interval.

The Python script twitter_search.py will search for tweets and save them to a JSON formatted file. When an exception is raised (i.e., the maximum number of tweets has been downloaded) the script will pause for 15 minutes and then continue. This will repeat continuously as tweets with a matching query are found.  The tweet creation dates are used to label the JSON output files. The search limits must be specified: a maximum number of days old (up to about 8) and a minimum age (as low as 0 i.e., “right now”). I prefer to collect tweets over only one day intervals such that each day is exported into its own file.

 

Dependencies

I used Python 3 for this project; if you do not have Python then I would recommend installing it via the Anaconda distribution. Other dependencies are Tweepy 3.5.0 (a library for accessing the Twitter API) and a personal Twitter “data-mining” application (which is very easy to set up). I used this guide to register my app. You will need to register your own in order to generate a consumer key, consumer secret, access token, and access secret; these are required to authenticate the script in order to access the Twitter API.

 

Running the script

You can download my Python tweet searching/saving script using Git Shell:

git clone https://github.com/agalea91/twitter_search

or directly from its git repository.

Open the twitter_search.py file and then find the load_api() function (at the top) and add your consumer key, consumer secret, access token, and access secret. For example:

consumer_key = '189YcjF4IUzF156RGNGNucDD8'
consumer_secret = 'e4KPiY4pSh03HxjDg782HupUjmzdOOSDd98hd'
access_token = '2543812-cpaIuwndjvbdjaDDp5izzndhsD7figa9gb'
access_secret = '4hdyfnas7d988ddjf87sJdj3Dxn4d5CcNpwe'

This is not my actual information as it should be kept private.

Before running the script, go to the main() function and edit the search criteria. Namely, you should enter a search phrase, the maximum time limit for the script to run, and the date range for the search (relative to today). For example:

search_phrase = '#makedonalddrumpfagain'
time_limit = 1.0 # runtime limit in hours
min_days_old, max_days_old = 1, 2 # search limits

# e.g. min_days_old, max_days_old = 7, 8
# gives the current weekday from last week,
# min_days_old=0 will search from right now

I found that max_days_old=9 was the largest value possible.

To run the script, open the terminal/command line to the file location and type:

python twitter_search.py

The script will search for tweets and save them to a JSON file until they have all been found or the time limit has exceeded.

 

twitter_search.py functions

The main program is contained within the main() function, which is called automatically when running the script from the command line.  This part of the code is not shown below. Instead we only discuss the other functions.  Before we get started I’ll list the libraries used in the script:

import tweepy
from tweepy import OAuthHandler
import json
import datetime as dt
import time
import os
import sys

 

Firstly, the function load_api() authenticates the user and returns the Tweepy API wrapper:

def load_api():
    ''' Function that loads the twitter API after authorizing
        the user. '''

    consumer_key = '189YcjF4IUzF156RGNGNucDD8'
    consumer_secret = 'e4KPiY4pSh03HxjDg782HupUjmzdOOSDd98hd'
    access_token = '2543812-cpaIuwndjvbdjaDDp5izzndhsD7figa9gb'
    access_secret = '4hdyfnas7d988ddjf87sJdj3Dxn4d5CcNpwe'
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    # load the twitter API via tweepy
    return tweepy.API(auth)

By running api=load_api() we can access Twitter’s search function, e.g. api.search(q=’#makedonalddrumpfagain’).

 

Twitter limits the maximum number of tweets returned per search to 100. We use a function [1] called tweet_search() that searches for up to max_tweets=100 tweets:

def tweet_search(api, query, max_tweets, max_id, since_id, geocode):
    ''' Function that takes in a search string 'query', the maximum
        number of tweets 'max_tweets', and the minimum (i.e., starting)
        tweet id. It returns a list of tweepy.models.Status objects. '''

    searched_tweets = []
    while len(searched_tweets) < max_tweets:
        remaining_tweets = max_tweets - len(searched_tweets)
        try:
            new_tweets = api.search(q=query, count=remaining_tweets,
                                    since_id=str(since_id),
                                    max_id=str(max_id-1))
#                                    geocode=geocode)
            print('found',len(new_tweets),'tweets')
            if not new_tweets:
                print('no tweets found')
                break
            searched_tweets.extend(new_tweets)
            max_id = new_tweets[-1].id
            except tweepy.TweepError:
            print('exception raised, waiting 15 minutes')
            print('(until:', dt.datetime.now()+dt.timedelta(minutes=15), ')')
            time.sleep(15*60)
            break # stop the loop
    return searched_tweets, max_id

This function loops over an api.search() call because it’s possible for less than 100 tweets to be returned and thus it is called until all 100 tweets are found. In the main program we loop over this function until the exception is raised, at which point our script sleeps for 15 minutes before continuing.

The search can be limited to a specific radial area around longitude & latitude coordinates by uncommenting the geocode line and defining the parameter appropriately. For example nearly all states in America are included in the geocode ‘39.8,-95.583068847656,2500km’. The issue here is a vast majority of the tweets are not geocoded and will therefore be excluded.

 

The api.search() function can start from a given tweet ID or date and will always search back in time. If we are appending the tweet data to an already existing JSON file, the “starting” tweet ID is defined based on the last tweet appended to the file (this is done in the main program). Otherwise we run the function get_tweet_id() to find the ID of a tweet that was posted at the end of a given day and this is used as the starting point for the search.

def get_tweet_id(api, date='', days_ago=9, query='a'):
    ''' Function that gets the ID of a tweet. This ID can
        then be used as a 'starting point' from which to
        search. The query is required and has been set to
        a commonly used word by default. The variable
        'days_ago' has been initialized to the maximum amount
        we are able to search back in time (9).'''

    if date: # return an ID from the start of the given day
        td = date + dt.timedelta(days=1)
        tweet_date = '{0}-{1:0>2}-{2:0>2}'.format(td.year, td.month, td.day)
        tweet = api.search(q=query, count=1, until=tweet_date)
    else:
        # return an ID from __ days ago
        td = dt.datetime.now() - dt.timedelta(days=days_ago)
        tweet_date = '{0}-{1:0>2}-{2:0>2}'.format(td.year, td.month, td.day)
        # get list of up to 10 tweets
        tweet = api.search(q=query, count=10, until=tweet_date)
        print('search limit (start/stop):',tweet[0].created_at)
        # return the id of the first tweet in the list
        return tweet[0].id

 

After each call of tweet_search() in the main program, we append the new tweets to a file in JSON format:

def write_tweets(tweets, filename):
    ''' Function that appends tweets to a file. '''

    with open(filename, 'a') as f:
        for tweet in tweets:
            json.dump(tweet._json, f)
            f.write('\n')

 

The resulting JSON file can easily (although not necessarily quickly) be read and converted to a Pandas dataframe for analysis.

 

Reading in JSON files to a dataframe

The twitter_search.py file is only used for collecting tweets and I use ipyhton notebooks for analysis. First we’ll need to read the JSON file(s):

import json

tweet_files = ['file_1.json', 'file_2.json', ...]
tweets = []
for file in tweet_files:
    with open(file, 'r') as f:
        for line in f.readlines():
            tweets.append(json.loads(line))

 

We now have a dictionary named “tweets”. This can be accessed to create a dataframe with the required information. We’ll include the location information of the user who published the tweet as well as the coordinates (if available) and tweet text.

def populate_tweet_df(tweets):
    df = pd.DataFrame()

    df['text'] = list(map(lambda tweet: tweet['text'], tweets))

    df['location'] = list(map(lambda tweet: tweet['user']['location'], tweets))

    df['country_code'] = list(map(lambda tweet: tweet['place']['country_code']
                                  if tweet['place'] != None else '', tweets))

    df['long'] = list(map(lambda tweet: tweet['coordinates']['coordinates'][0]
                        if tweet['coordinates'] != None else 'NaN', tweets))

    df['latt'] = list(map(lambda tweet: tweet['coordinates']['coordinates'][1]
                        if tweet['coordinates'] != None else 'NaN', tweets))

    return df

 

Data analysis: plotting tweet coordinates

We can now, for example, plot the locations from which the tweets were sent using the Basemap library (which must by manually installed [2]).

from mpl_toolkits.basemap import Basemap

# plot the blank world map
my_map = Basemap(projection='merc', lat_0=50, lon_0=-100,
                     resolution = 'l', area_thresh = 5000.0,
                     llcrnrlon=-140, llcrnrlat=-55,
                     urcrnrlon=160, urcrnrlat=70)
# set resolution='h' for high quality

# draw elements onto the world map
my_map.drawcountries()
#my_map.drawstates()
my_map.drawcoastlines(antialiased=False,
                      linewidth=0.005)

# add coordinates as red dots
longs = list(df.loc[(df.long != 'NaN')].long)
latts = list(df.loc[df.latt != 'NaN'].latt)
x, y = my_map(longs, latts)
my_map.plot(x, y, 'ro', markersize=6, alpha=0.5)

plt.show()

 

In the next post we’ll look at a politically inspired analysis of tweets posted with the hashtag #MakeDonaldDrumpfAgain. The phrase was trending a couple weeks ago in reaction to an episode of HBO’s “Last Week Tonight” with John Oliver. The phrase represents a negative sentiment towards Donald Trump – a Republican candidate for the upcoming American election. I’ve collected every #MakeDonaldDrumpfAgain tweet since the video was posted and was able to produce, using the plotting script above, this illustration of tweet locations:

drumpf_tweet_locations_world

 

 

From the 550,000+ tweets I collected, only ~400 of them had longitude and latitude coordinates and these locations are all plotted above. As can be seen, most geocoded tweets about this topic have come from the eastern USA.

Thanks for reading!  If you would like to discuss any of the plots or have any questions or corrections, please write a comment. You are also welcome to email me at agalea91@gmail.com or tweet me @agalea91

 

[1] – My function is based on one I found on Stackoverflow.

[2] – I used these instructions to install Basemap on Windows 10 for Python 3.

44 thoughts on “Collecting Twitter data with Python

  1. Hi Alexander,

    Interesting code/post… thanks. I’m getting a weird error:

    longs = list(df.loc[(df.long != ‘NaN’)].long)
    NameError: name ‘df’ is not defined

    With that last block of code under #add coordinates as red dots. Any thoughts off the top of your head?

    Like

    1. It looks like your twitter data isn’t in a dataframe named df. Referring to the code cell above the one you are looking at, I define a function called populate_tweet_df(tweets). Make sure you are including a line like this:

      df = populate_tweet_df(tweets)

      to create a dataframe with your tweet data.

      Like

      1. I get the same error too even though I included the defined function to create the dataframe. I tried viewing the dataframe by calling df.head(), but I get an error indicated that df is not defined. Any ideas on why this error is occuring?

        Like

      2. I get the same error. I wasn’t able to fix it using the line
        df = populate_tweet_df(tweets)
        Maybe I am putting the line in the wrong place? If I enter it after the function populate_tweet_df it says that pd is not defined. Maybe there is more that I need to import?

        Like

      3. Hey Rick, I’m sure I wasn’t clear enough with this since you and others are having issues. The function uses pandas as a dependency, so make sure to run “import pandas as pd” to load that library

        Like

      4. Hi Alexander – thank you! I missed some of the imports. In case anyone missed them I think they are:

        import json
        import pandas as pd
        import matplotlib
        import matplotlib.pyplot as plt
        from mpl_toolkits.basemap import Basemap

        Now the red dots aren’t showing up. But I’m sure this is due to something I missed- I am very new at this. Thank you so much for the tutorials!

        Liked by 1 person

    2. Thank you for the code. It works great, however, doesn’t identify geolocation properly. Any ideas why?

      Like

  2. thanks for you “AGALE91” but i didn’t find how to make the search with emoticons
    i tried this but it doesn’t work
    try:
    import json
    except ImportError:
    import simplejson as json

    # Import the necessary methods from “twitter” library
    import twitter
    import nltk
    from twitter import OAuth, TwitterHTTPError, TwitterStream

    # Variables that contains the user credentials to access Twitter API
    ACCESS_TOKEN = ‘XXXXXXXXXXXXXXXXXXXXXXXXX’
    ACCESS_SECRET = ‘XXXXXXXXXXXXXXXXXXXXXXX’
    CONSUMER_KEY = ‘XXXXXXXXXXXXXXXXXXXXXXXXXX’
    CONSUMER_SECRET = ‘XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX’

    oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

    # Initiate the connection to Twitter Streaming API
    twitter_stream = TwitterStream(auth=oauth)

    iterator = twitter_stream.statuses.filter(track=”Google”, language=”en”,emticon=”true”)
    tweet_count = 1
    for tweet in iterator:
    tweet_count -= 1
    print json.dumps(tweet)

    if tweet_count <= 0:
    break

    Like

  3. Hi Alexander,

    Great tutorial! Was just wondering if you had any idea off the top of your head why I keep encountering the error:
    File “location.py”, line 26, in
    df[‘text’] = list(map(lambda tweet: tweet[‘text’], tweets))
    TypeError: list indices must be integers, not str

    Like

  4. hi, how can i do to crate a function that saves in a csv file 50 tweets by their timestamp, the user and the tweet by searching a hashtag. Thank you

    Like

  5. Hello, I am very thankful for your code. Now I can crawl tweets that I need for topic modeling subject

    I have a question. When I investigated the result of the crawling process, I noticed that some tweets acquired are not from the boundary that I specified in the variable. Is this the behavior of Twitter API or is this from the code?

    Like

    1. Thanks Muhamad. I’m really not sure. One thing to consider is time zones. It’s best to gather more data and then filter down to the date range you want after.

      Like

  6. Thanks Alex.

    I was able to run the script and get twitter data. However, there is one issue that needs to be addressed: some of the texts would stop when reaching a certain number of words and end with “…”. Is there any way that you could improve this? Otherwise, the data would be the best data ever!

    Like

    1. Hey Xunyi, are you referring to the text in the .json file? If you were to load this file into python and look at the text field, is this where you see the … ? Please look into this and let me know if the issue exists there

      Like

      1. Thanks for replying, Agalea91!
        Yes, I am referring to the text in the .json file. In json file, it does not show “…” in the end. Instead, it shows “\u2026”. But in .csv file, I will get “…” at the same place.

        This is an example:

        In .json file:
        “text”: “RT @CUPE129: This #RemembranceDay @CUPE129 was represented by veterans Sean Ward and Chris Boal. Join us in honouring those who in past and\u2026”

        In .csv file:
        RT @CUPE129: This #RemembranceDay @CUPE129 was represented by veterans Sean Ward and Chris Boal. Join us in honouring those who in past and…

        Thanks again!
        Xunyi

        Like

      2. Hmm thanks for looking into this more. Yeah it’s definitely an issue. I did a quick google search and found this resource (among others) – https://github.com/tweepy/tweepy/issues/935

        Try looking for a field called “full_text” in the .json file. Otherwise it would require changing some code in the script – which has been unmaintained for quite some time. There appears to be some solutions though in that link above. If you add a solution to my codebase, please let me know or submit a pull request for me to commit here: https://github.com/agalea91/twitter_search

        Cheers,
        Alex

        Like

  7. Hi @agalea89 . Your code is so awesome for me!

    But i just do the tweet search part, I got a problem.

    I already changed the geocode to 3.127887, 101.594489,70km (selangor). But after I check the json result, it also give the tweets from others location. How to fix this issue ya? 🙂

    Like

  8. Thank you Alex. It is rare to find such complete and nicely crafted code examples with real utility. Just looking through some of the raw json I find instances of “place” elements not null where “coordinates” and “location” are null. I can figure that out now, riding down the highway you paved for us. Cheers!

    Like

  9. Unquestionably consider that which you said.
    Your favourite reason seemed to be at the web the easiest factor to take into account of.
    I say to you, I certainly get annoyed whilst folks think
    about issues that they plainly do not know about.

    You controlled to hit the nail upon the highest as well as defined out the whole thing with no
    need side effect , other folks can take a signal. Will likely be again to get more.
    Thanks

    Like

  10. import json

    tweet_files = [‘file_1.json’, ‘file_2.json’, …]
    tweets = []
    for file in tweet_files:
    with open(file, ‘r’) as f:
    for line in f.readlines():
    tweets.append(json.loads(line))

    We now have a dictionary named “tweets”.

    In this code Tweets is a list, not a dictionary. I am reading something wrong here as because the tweets data which is a list needs to be finally loaded into a dataframe, need some suggestion around this pls.

    Like

  11. Please help me to fix this

    NameError Traceback (most recent call last)
    in
    15
    16 # add coordinates as red dots
    —> 17 longs = list(df.loc[(df.long != ‘NaN’)].long)
    18 latts = list(df.loc[df.latt != ‘NaN’].latt)
    19 x, y = my_map(longs, latts)

    NameError: name ‘df’ is not defined

    Like

  12. Hello, I am very thankful for your code.. but how to generate tweet[‘ text ‘] without anything else in its json file?

    Like

Leave a comment