March 18, 2016April 28, 2016

Collecting Twitter data with Python

Mining for tweets

This post explains generally how my Python 3 tweet searching script works. Twitter limits the maximum age of searchable tweets to roughly a week. As such, the script can search for tweets posted up to just over a week ago. Twitter also limits the maximum number of tweets downloaded in every 15 minute interval.

The Python script twitter_search.py will search for tweets and save them to a JSON formatted file. When an exception is raised (i.e., the maximum number of tweets has been downloaded) the script will pause for 15 minutes and then continue. This will repeat continuously as tweets with a matching query are found. The tweet creation dates are used to label the JSON output files. The search limits must be specified: a maximum number of days old (up to about 8) and a minimum age (as low as 0 i.e., “right now”). I prefer to collect tweets over only one day intervals such that each day is exported into its own file.

Dependencies

I used Python 3 for this project; if you do not have Python then I would recommend installing it via the Anaconda distribution. Other dependencies are Tweepy 3.5.0 (a library for accessing the Twitter API) and a personal Twitter “data-mining” application (which is very easy to set up). I used this guide to register my app. You will need to register your own in order to generate a consumer key, consumer secret, access token, and access secret; these are required to authenticate the script in order to access the Twitter API.

Running the script

You can download my Python tweet searching/saving script using Git Shell:

git clone https://github.com/agalea91/twitter_search

or directly from its git repository.

Open the twitter_search.py file and then find the load_api() function (at the top) and add your consumer key, consumer secret, access token, and access secret. For example:

consumer_key = '189YcjF4IUzF156RGNGNucDD8'
consumer_secret = 'e4KPiY4pSh03HxjDg782HupUjmzdOOSDd98hd'
access_token = '2543812-cpaIuwndjvbdjaDDp5izzndhsD7figa9gb'
access_secret = '4hdyfnas7d988ddjf87sJdj3Dxn4d5CcNpwe'

This is not my actual information as it should be kept private.

Before running the script, go to the main() function and edit the search criteria. Namely, you should enter a search phrase, the maximum time limit for the script to run, and the date range for the search (relative to today). For example:

search_phrase = '#makedonalddrumpfagain'
time_limit = 1.0 # runtime limit in hours
min_days_old, max_days_old = 1, 2 # search limits

# e.g. min_days_old, max_days_old = 7, 8
# gives the current weekday from last week,
# min_days_old=0 will search from right now

I found that max_days_old=9 was the largest value possible.

To run the script, open the terminal/command line to the file location and type:

python twitter_search.py

The script will search for tweets and save them to a JSON file until they have all been found or the time limit has exceeded.

twitter_search.py functions

The main program is contained within the main() function, which is called automatically when running the script from the command line. This part of the code is not shown below. Instead we only discuss the other functions. Before we get started I’ll list the libraries used in the script:

import tweepy
from tweepy import OAuthHandler
import json
import datetime as dt
import time
import os
import sys

Firstly, the function load_api() authenticates the user and returns the Tweepy API wrapper:

def load_api():
    ''' Function that loads the twitter API after authorizing
        the user. '''

    consumer_key = '189YcjF4IUzF156RGNGNucDD8'
    consumer_secret = 'e4KPiY4pSh03HxjDg782HupUjmzdOOSDd98hd'
    access_token = '2543812-cpaIuwndjvbdjaDDp5izzndhsD7figa9gb'
    access_secret = '4hdyfnas7d988ddjf87sJdj3Dxn4d5CcNpwe'
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    # load the twitter API via tweepy
    return tweepy.API(auth)

By running api=load_api() we can access Twitter’s search function, e.g. api.search(q=’#makedonalddrumpfagain’).

Twitter limits the maximum number of tweets returned per search to 100. We use a function [1] called tweet_search() that searches for up to max_tweets=100 tweets:

def tweet_search(api, query, max_tweets, max_id, since_id, geocode):
    ''' Function that takes in a search string 'query', the maximum
        number of tweets 'max_tweets', and the minimum (i.e., starting)
        tweet id. It returns a list of tweepy.models.Status objects. '''

    searched_tweets = []
    while len(searched_tweets) &amp;lt; max_tweets:
        remaining_tweets = max_tweets - len(searched_tweets)
        try:
            new_tweets = api.search(q=query, count=remaining_tweets,
                                    since_id=str(since_id),
                                    max_id=str(max_id-1))
#                                    geocode=geocode)
            print('found',len(new_tweets),'tweets')
            if not new_tweets:
                print('no tweets found')
                break
            searched_tweets.extend(new_tweets)
            max_id = new_tweets[-1].id
            except tweepy.TweepError:
            print('exception raised, waiting 15 minutes')
            print('(until:', dt.datetime.now()+dt.timedelta(minutes=15), ')')
            time.sleep(15*60)
            break # stop the loop
    return searched_tweets, max_id

This function loops over an api.search() call because it’s possible for less than 100 tweets to be returned and thus it is called until all 100 tweets are found. In the main program we loop over this function until the exception is raised, at which point our script sleeps for 15 minutes before continuing.

The search can be limited to a specific radial area around longitude & latitude coordinates by uncommenting the geocode line and defining the parameter appropriately. For example nearly all states in America are included in the geocode ‘39.8,-95.583068847656,2500km’. The issue here is a vast majority of the tweets are not geocoded and will therefore be excluded.

The api.search() function can start from a given tweet ID or date and will always search back in time. If we are appending the tweet data to an already existing JSON file, the “starting” tweet ID is defined based on the last tweet appended to the file (this is done in the main program). Otherwise we run the function get_tweet_id() to find the ID of a tweet that was posted at the end of a given day and this is used as the starting point for the search.

def get_tweet_id(api, date='', days_ago=9, query='a'):
    ''' Function that gets the ID of a tweet. This ID can
        then be used as a 'starting point' from which to
        search. The query is required and has been set to
        a commonly used word by default. The variable
        'days_ago' has been initialized to the maximum amount
        we are able to search back in time (9).'''

    if date: # return an ID from the start of the given day
        td = date + dt.timedelta(days=1)
        tweet_date = '{0}-{1:0&amp;gt;2}-{2:0&amp;gt;2}'.format(td.year, td.month, td.day)
        tweet = api.search(q=query, count=1, until=tweet_date)
    else:
        # return an ID from __ days ago
        td = dt.datetime.now() - dt.timedelta(days=days_ago)
        tweet_date = '{0}-{1:0&amp;gt;2}-{2:0&amp;gt;2}'.format(td.year, td.month, td.day)
        # get list of up to 10 tweets
        tweet = api.search(q=query, count=10, until=tweet_date)
        print('search limit (start/stop):',tweet[0].created_at)
        # return the id of the first tweet in the list
        return tweet[0].id

After each call of tweet_search() in the main program, we append the new tweets to a file in JSON format:

def write_tweets(tweets, filename):
    ''' Function that appends tweets to a file. '''

    with open(filename, 'a') as f:
        for tweet in tweets:
            json.dump(tweet._json, f)
            f.write('\n')

The resulting JSON file can easily (although not necessarily quickly) be read and converted to a Pandas dataframe for analysis.

Reading in JSON files to a dataframe

The twitter_search.py file is only used for collecting tweets and I use ipyhton notebooks for analysis. First we’ll need to read the JSON file(s):

import json

tweet_files = ['file_1.json', 'file_2.json', ...]
tweets = []
for file in tweet_files:
    with open(file, 'r') as f:
        for line in f.readlines():
            tweets.append(json.loads(line))

We now have a dictionary named “tweets”. This can be accessed to create a dataframe with the required information. We’ll include the location information of the user who published the tweet as well as the coordinates (if available) and tweet text.

def populate_tweet_df(tweets):
    df = pd.DataFrame()

    df['text'] = list(map(lambda tweet: tweet['text'], tweets))

    df['location'] = list(map(lambda tweet: tweet['user']['location'], tweets))

    df['country_code'] = list(map(lambda tweet: tweet['place']['country_code']
                                  if tweet['place'] != None else '', tweets))

    df['long'] = list(map(lambda tweet: tweet['coordinates']['coordinates'][0]
                        if tweet['coordinates'] != None else 'NaN', tweets))

    df['latt'] = list(map(lambda tweet: tweet['coordinates']['coordinates'][1]
                        if tweet['coordinates'] != None else 'NaN', tweets))

    return df

Data analysis: plotting tweet coordinates

We can now, for example, plot the locations from which the tweets were sent using the Basemap library (which must by manually installed [2]).

from mpl_toolkits.basemap import Basemap

# plot the blank world map
my_map = Basemap(projection='merc', lat_0=50, lon_0=-100,
                     resolution = 'l', area_thresh = 5000.0,
                     llcrnrlon=-140, llcrnrlat=-55,
                     urcrnrlon=160, urcrnrlat=70)
# set resolution='h' for high quality

# draw elements onto the world map
my_map.drawcountries()
#my_map.drawstates()
my_map.drawcoastlines(antialiased=False,
                      linewidth=0.005)

# add coordinates as red dots
longs = list(df.loc[(df.long != 'NaN')].long)
latts = list(df.loc[df.latt != 'NaN'].latt)
x, y = my_map(longs, latts)
my_map.plot(x, y, 'ro', markersize=6, alpha=0.5)

plt.show()

In the next post we’ll look at a politically inspired analysis of tweets posted with the hashtag #MakeDonaldDrumpfAgain. The phrase was trending a couple weeks ago in reaction to an episode of HBO’s “Last Week Tonight” with John Oliver. The phrase represents a negative sentiment towards Donald Trump – a Republican candidate for the upcoming American election. I’ve collected every #MakeDonaldDrumpfAgain tweet since the video was posted and was able to produce, using the plotting script above, this illustration of tweet locations:

From the 550,000+ tweets I collected, only ~400 of them had longitude and latitude coordinates and these locations are all plotted above. As can be seen, most geocoded tweets about this topic have come from the eastern USA.

Thanks for reading! If you would like to discuss any of the plots or have any questions or corrections, please write a comment. You are also welcome to email me at agalea91@gmail.com or tweet me @agalea91

[1] – My function is based on one I found on Stackoverflow.

[2] – I used these instructions to install Basemap on Windows 10 for Python 3.

44 thoughts on “Collecting Twitter data with Python”

Pingback: Where the haters at? A Twitter study on Donald Trump – Alexander Galea's Blog
Pingback: Where the haters at? Part 2 of a Twitter study on Donald Trump – Alexander Galea's Blog
Pingback: Loading tweets into a Pandas dataframe using generators – Alexander Galea's Blog
Pingback: Player popularity: the 2016 NHL Playoffs on Twitter – Alexander Galea's Blog
Pingback: Ink Master Finale Twitter Vote Results- Season 8 – Alexander Galea's Blog
Andrew says:

February 16, 2017 at 12:07 pm

Hi Alexander,

Interesting code/post… thanks. I’m getting a weird error:

longs = list(df.loc[(df.long != ‘NaN’)].long)
NameError: name ‘df’ is not defined

With that last block of code under #add coordinates as red dots. Any thoughts off the top of your head?

LikeLike

Reply
1. agalea91 says:
  
  February 16, 2017 at 8:28 pm
  
  It looks like your twitter data isn’t in a dataframe named df. Referring to the code cell above the one you are looking at, I define a function called populate_tweet_df(tweets). Make sure you are including a line like this:
  
  df = populate_tweet_df(tweets)
  
  to create a dataframe with your tweet data.
  
  LikeLike
  
  Reply
  1. Andrew says:
    
    February 20, 2017 at 4:30 am
    
    Thanks 🙂
    
    LikeLike
  2. nita says:
    
    October 25, 2017 at 11:30 am
    
    I get the same error too even though I included the defined function to create the dataframe. I tried viewing the dataframe by calling df.head(), but I get an error indicated that df is not defined. Any ideas on why this error is occuring?
    
    LikeLike
  3. Rick says:
    
    December 11, 2017 at 3:27 am
    
    I get the same error. I wasn’t able to fix it using the line
    df = populate_tweet_df(tweets)
    Maybe I am putting the line in the wrong place? If I enter it after the function populate_tweet_df it says that pd is not defined. Maybe there is more that I need to import?
    
    LikeLike
  4. agalea91 says:
    
    December 11, 2017 at 4:26 am
    
    Hey Rick, I’m sure I wasn’t clear enough with this since you and others are having issues. The function uses pandas as a dependency, so make sure to run “import pandas as pd” to load that library
    
    LikeLike
  5. Rick says:
    
    December 11, 2017 at 5:33 am
    
    Hi Alexander – thank you! I missed some of the imports. In case anyone missed them I think they are:
    
    import json
    import pandas as pd
    import matplotlib
    import matplotlib.pyplot as plt
    from mpl_toolkits.basemap import Basemap
    
    Now the red dots aren’t showing up. But I’m sure this is due to something I missed- I am very new at this. Thank you so much for the tutorials!
    
    LikeLiked by 1 person
2. Anzhalika Shastapalava says:
  
  August 8, 2018 at 9:14 pm
  
  Thank you for the code. It works great, however, doesn’t identify geolocation properly. Any ideas why?
  
  LikeLike
  
  Reply
benyamina soumia says:

February 24, 2017 at 9:41 pm

hi please help me if i can extract tweet wich contain emoticons
?

LikeLike

Reply
1. agalea91 says:
  
  February 25, 2017 at 12:53 am
  
  I think this can be done by searching for a utf-8 encoded emoji. See this SO answer for more details http://stackoverflow.com/a/35084153
  
  LikeLike
  
  Reply
benyamina soumia says:

February 25, 2017 at 5:35 pm

thanks for you “AGALE91” but i didn’t find how to make the search with emoticons
i tried this but it doesn’t work
try:
import json
except ImportError:
import simplejson as json

# Import the necessary methods from “twitter” library
import twitter
import nltk
from twitter import OAuth, TwitterHTTPError, TwitterStream

# Variables that contains the user credentials to access Twitter API
ACCESS_TOKEN = ‘XXXXXXXXXXXXXXXXXXXXXXXXX’
ACCESS_SECRET = ‘XXXXXXXXXXXXXXXXXXXXXXX’
CONSUMER_KEY = ‘XXXXXXXXXXXXXXXXXXXXXXXXXX’
CONSUMER_SECRET = ‘XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX’

oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)

iterator = twitter_stream.statuses.filter(track=”Google”, language=”en”,emticon=”true”)
tweet_count = 1
for tweet in iterator:
tweet_count -= 1
print json.dumps(tweet)

if tweet_count <= 0:
break

LikeLike

Reply
Rachel Solomon says:

March 1, 2017 at 12:23 pm

Hi Alexander,

Great tutorial! Was just wondering if you had any idea off the top of your head why I keep encountering the error:
File “location.py”, line 26, in
df[‘text’] = list(map(lambda tweet: tweet[‘text’], tweets))
TypeError: list indices must be integers, not str

LikeLike

Reply
1. agalea91 says:
  
  March 1, 2017 at 4:35 pm
  
  Thank you! This error means that tweets must be a list for you, whereas it should be a dictionary.
  
  LikeLike
  
  Reply
Pingback: Text Analytics on Twitter | Just My Notes
Pingback: Collecting Twitter Data with Python – Full-Stack Feed
Marcos says:

November 17, 2017 at 12:59 pm

hi, how can i do to crate a function that saves in a csv file 50 tweets by their timestamp, the user and the tweet by searching a hashtag. Thank you

LikeLike

Reply
Muhamad Iqbal says:

December 5, 2017 at 5:42 pm

Hello, I am very thankful for your code. Now I can crawl tweets that I need for topic modeling subject

I have a question. When I investigated the result of the crawling process, I noticed that some tweets acquired are not from the boundary that I specified in the variable. Is this the behavior of Twitter API or is this from the code?

LikeLike

Reply
1. agalea91 says:
  
  December 5, 2017 at 7:13 pm
  
  Thanks Muhamad. I’m really not sure. One thing to consider is time zones. It’s best to gather more data and then filter down to the date range you want after.
  
  LikeLike
  
  Reply
abhi180 says:

January 12, 2018 at 2:46 pm

can you tell me how to read messages and reply messages using twiter api in python

LikeLike

Reply
Pingback: Beginners Guide: Data Cleaning Tweet Locations – Scott Charlesworth, PhD, QTS, BSc (Hons)
Xunyi says:

May 3, 2018 at 7:00 pm

Thanks Alex.

I was able to run the script and get twitter data. However, there is one issue that needs to be addressed: some of the texts would stop when reaching a certain number of words and end with “…”. Is there any way that you could improve this? Otherwise, the data would be the best data ever！

LikeLike

Reply
1. agalea91 says:
  
  May 3, 2018 at 9:34 pm
  
  Hey Xunyi, are you referring to the text in the .json file? If you were to load this file into python and look at the text field, is this where you see the … ? Please look into this and let me know if the issue exists there
  
  LikeLike
  
  Reply
  1. Xunyi says:
    
    May 4, 2018 at 1:56 pm
    
    Thanks for replying, Agalea91!
    Yes, I am referring to the text in the .json file. In json file, it does not show “…” in the end. Instead, it shows “\u2026”. But in .csv file, I will get “…” at the same place.
    
    This is an example:
    
    In .json file:
    “text”: “RT @CUPE129: This #RemembranceDay @CUPE129 was represented by veterans Sean Ward and Chris Boal. Join us in honouring those who in past and\u2026”
    
    In .csv file:
    RT @CUPE129: This #RemembranceDay @CUPE129 was represented by veterans Sean Ward and Chris Boal. Join us in honouring those who in past and…
    
    Thanks again!
    Xunyi
    
    LikeLike
  2. agalea91 says:
    
    May 4, 2018 at 8:17 pm
    
    Hmm thanks for looking into this more. Yeah it’s definitely an issue. I did a quick google search and found this resource (among others) – https://github.com/tweepy/tweepy/issues/935
    
    Try looking for a field called “full_text” in the .json file. Otherwise it would require changing some code in the script – which has been unmaintained for quite some time. There appears to be some solutions though in that link above. If you add a solution to my codebase, please let me know or submit a pull request for me to commit here: https://github.com/agalea91/twitter_search
    
    Cheers,
    Alex
    
    LikeLike
AmPhotoVideo (amzar snap) says:

May 16, 2018 at 1:30 pm

Hi @agalea89 . Your code is so awesome for me!

But i just do the tweet search part, I got a problem.

I already changed the geocode to 3.127887, 101.594489,70km (selangor). But after I check the json result, it also give the tweets from others location. How to fix this issue ya? 🙂

LikeLike

Reply
1. victor Irekponor says:
  
  June 28, 2019 at 10:20 am
  
  Yeah, that is because you set the radius distance as 70km, its pretty wide.. Reduce it to somewhere between 1km and 5km.. that should work
  
  LikeLike
  
  Reply
thelittledatascientist says:

June 8, 2018 at 12:48 am

Thank you so much for this. It is gold.

LikeLiked by 1 person

Reply
John B Dougherty says:

November 24, 2018 at 11:55 pm

Thank you Alex. It is rare to find such complete and nicely crafted code examples with real utility. Just looking through some of the raw json I find instances of “place” elements not null where “coordinates” and “location” are null. I can figure that out now, riding down the highway you paved for us. Cheers!

LikeLike

Reply
sbobetwikipedia.com says:

December 17, 2018 at 12:26 pm

It’s hard to find well-informed people for this subject,
however, you seem like you know what you’re talking about!
Thanks

LikeLike

Reply
vyriskos striukes says:

January 9, 2019 at 2:02 am

It’s actually a cool and helpful piece of information. I am glad that you simply shared this helpful info with us.
Please keep us informed like this. Thank you for sharing.

LikeLike

Reply
Home Page says:

February 21, 2019 at 10:42 pm

It’s awesome for me to have a website, which is beneficial in favor of my know-how.
thanks admin

LikeLike

Reply
delhi petals says:

March 5, 2019 at 6:15 am

Unquestionably consider that which you said.
Your favourite reason seemed to be at the web the easiest factor to take into account of.
I say to you, I certainly get annoyed whilst folks think
about issues that they plainly do not know about.

You controlled to hit the nail upon the highest as well as defined out the whole thing with no
need side effect , other folks can take a signal. Will likely be again to get more.
Thanks

LikeLike

Reply
roofer says:

March 28, 2019 at 10:29 pm

Keep up the amazing work.I’ve added your article to my personal blogroll.

LikeLike

Reply
vinod says:

June 9, 2019 at 8:25 am

import json

tweet_files = [‘file_1.json’, ‘file_2.json’, …]
tweets = []
for file in tweet_files:
with open(file, ‘r’) as f:
for line in f.readlines():
tweets.append(json.loads(line))

We now have a dictionary named “tweets”.

In this code Tweets is a list, not a dictionary. I am reading something wrong here as because the tweets data which is a list needs to be finally loaded into a dataframe, need some suggestion around this pls.

LikeLike

Reply
Meby Lesnanda says:

August 18, 2019 at 2:41 pm

Please help me to fix this

NameError Traceback (most recent call last)
in
15
16 # add coordinates as red dots
—> 17 longs = list(df.loc[(df.long != ‘NaN’)].long)
18 latts = list(df.loc[df.latt != ‘NaN’].latt)
19 x, y = my_map(longs, latts)

NameError: name ‘df’ is not defined

LikeLike

Reply
Leon says:

January 8, 2020 at 2:12 am

The best search API code so far. I really appreciate that you share this with us.

LikeLike

Reply
jakobusedobp says:

July 6, 2020 at 2:25 am

helo what must i do to do this code, when i run twitter_search.py, the script open in sublime, please help me, thanks

LikeLike

Reply
1. victor irekponor says:
  
  July 15, 2020 at 4:21 pm
  
  Hi.
  Check out this implementation. Its more straightforward. I made this sometime last year. Thank me later.
  
  https://github.com/marquisvictor/Optimized-Modified-GetOldTweets3-OMGOT
  
  LikeLike
  
  Reply
luluscumlaude says:

January 24, 2022 at 6:38 pm

Hello, I am very thankful for your code.. but how to generate tweet[‘ text ‘] without anything else in its json file?

LikeLike

Reply