Scraping Tweets with Tweepy Python

This is a step by step guide to scrape Twitter tweets using a Python library called Tweepy.

Case Study: Hong Kong Protest Movement 2019

In this example, we will be extracting tweets related to the Hong Kong Protest Movement 2019, which I have written an analysis on. The codes can be configured to suit your own needs.

The first order of affair was to obtain the tweets. I had considered and tried out tools such as Octoparse, but they either only support Windows (I am using a Macbook), were unreliable, or they only allow you to download a certain number of tweets unless you subscribe to a plan. In the end, I threw these ideas into the bin and decided to do it myself.

image

Source: https://tenor.com/view/thanos-fine-ill-do-it-myself-gif-11168108

I tried out a few Python libraries and decided to go ahead with Tweepy. Tweepy was the only library that did not throw any errors for my environment, and it was quite easy to get things doing. One downside is that I couldn’t find any documentation that tells you what are the parameter values for pulling certain metadata out of a tweet. I only managed to get most of them that I needed after a few rounds of trial and error.

Prerequisites: Setting up a Twitter Developer Account

Before you start using Tweepy, you would need a Twitter Developer Account in order to call Twitter’s APIs. Just follow the instructions and after some time (only a few hours for me), they would grant you your access.

image

You can view this page after you have been granted access and created an app.

You would need 4 pieces of information ready — API key, API secret key, Access token, Access token secret.

Import Libraries

Switch over to Jupyter Notebook and import the following libraries:

from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import tweepy
import json
import pandas as pd
import csv
import re
from textblob import TextBlob
import string
import preprocessor as p
import os
import time

Authenticating Twitter API

If you ran into any authentication errors, regenerate your keys and try again.

# Twitter credentials
# Obtain them from your twitter developer account
consumer_key = <your_consumer_key>
consumer_secret = <your_consumer_secret_key>
access_key = <your_access_key>
access_secret = <your_access_secret_key>
# Pass your twitter credentials to tweepy via its OAuthHandler
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

Batch Scraping

Due to the limited number of API calls one can make using a basic and free developer account, (~900 calls every 15 minutes before your access is denied) I created a function that extract 2,500 tweets per run once every 15 minutes (I tried to extract 3,00 and above but that got me denied after the second batch). In this function you specify the:

  1. search parameter such as key words and hashtags etc.
  2. starting date, after which all tweets would be extracted (you can only extract tweets that are not older than the last 7 days)
  3. number of tweets to pull per run
  4. number of runs that happen once every 15 minutes

I only extracted those metadata that I deemed relevant to my case. You may explore the list of metadata from the tweepy.Cursor object in detail (this is the real messy part).

def scraptweets(search_words, date_since, numTweets, numRuns):

    # Define a for-loop to generate tweets at regular intervals
    # We cannot make large API call in one go. Hence, let's try T times

    # Define a pandas dataframe to store the date:
    db_tweets = pd.DataFrame(columns = ['username', 'acctdesc', 'location', 'following',
                                        'followers', 'totaltweets', 'usercreatedts', 'tweetcreatedts',
                                        'retweetcount', 'text', 'hashtags']
                                )
    program_start = time.time()
    for i in range(0, numRuns):
        # We will time how long it takes to scrape tweets for each run:
        start_run = time.time()

        # Collect tweets using the Cursor object
        # .Cursor() returns an object that you can iterate or loop over to access the data collected.
        # Each item in the iterator has various attributes that you can access to get information about each tweet
        tweets = tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since, tweet_mode='extended').items(numTweets)# Store these tweets into a python list
        tweet_list = [tweet for tweet in tweets]# Obtain the following info (methods to call them out):
        # user.screen_name - twitter handle
        # user.description - description of account
        # user.location - where is he tweeting from
        # user.friends_count - no. of other users that user is following (following)
        # user.followers_count - no. of other users who are following this user (followers)
        # user.statuses_count - total tweets by user
        # user.created_at - when the user account was created
        # created_at - when the tweet was created
        # retweet_count - no. of retweets
        # (deprecated) user.favourites_count - probably total no. of tweets that is favourited by user
        # retweeted_status.full_text - full text of the tweet
        # tweet.entities['hashtags'] - hashtags in the tweet# Begin scraping the tweets individually:
        noTweets = 0for tweet in tweet_list:# Pull the values
            username = tweet.user.screen_name
            acctdesc = tweet.user.description
            location = tweet.user.location
            following = tweet.user.friends_count
            followers = tweet.user.followers_count
            totaltweets = tweet.user.statuses_count
            usercreatedts = tweet.user.created_at
            tweetcreatedts = tweet.created_at
            retweetcount = tweet.retweet_count
            hashtags = tweet.entities['hashtags']try:
                text = tweet.retweeted_status.full_text
            except AttributeError:  # Not a Retweet
                text = tweet.full_text# Add the 11 variables to the empty list - ith_tweet:
            ith_tweet = [username, acctdesc, location, following, followers, totaltweets,
                         usercreatedts, tweetcreatedts, retweetcount, text, hashtags]# Append to dataframe - db_tweets
            db_tweets.loc[len(db_tweets)] = ith_tweet# increase counter - noTweets
            noTweets += 1

        # Run ended:
        end_run = time.time()
        duration_run = round((end_run-start_run)/60, 2)

        print('no. of tweets scraped for run {} is {}'.format(i + 1, noTweets))
        print('time take for {} run to complete is {} mins'.format(i+1, duration_run))

        time.sleep(920) #15 minute sleep time# Once all runs have completed, save them to a single csv file:
    from datetime import datetime

    # Obtain timestamp in a readable format
    to_csv_timestamp = datetime.today().strftime('%Y%m%d_%H%M%S')# Define working path and filename
    path = os.getcwd()
    filename = path + '/data/' + to_csv_timestamp + '_sahkprotests_tweets.csv'# Store dataframe in csv with creation date timestamp
    db_tweets.to_csv(filename, index = False)

    program_end = time.time()
    print('Scraping has completed!')
    print('Total time taken to scrap is {} minutes.'.format(round(program_end - program_start)/60, 2))

With this function, I usually performed 6 runs in total, where each run extracted 2,500 tweets. It usually takes approximately 2.5 hours to finish one round of extraction that would yield 15,000 tweets. Not bad.

Specific to the protests, I surveyed Twitter and found out the most common hashtags used by users in their tweets. Hence, I used a multitude of these related hashtags as my searching criteria.

It is also possible for other hashtags that are not defined in your ‘search_words’ parameter to appear because users might include them in their tweets altogether.

# Initialise these variables:
search_words = "#hongkong OR #hkprotests OR #freehongkong OR #hongkongprotests OR #hkpolicebrutality OR #antichinazi OR #standwithhongkong OR #hkpolicestate OR #HKpoliceterrorist OR #standwithhk OR #hkpoliceterrorism"
date_since = "2019-11-03"
numTweets = 2500
numRuns = 6
# Call the function scraptweets
scraptweets(search_words, date_since, numTweets, numRuns)

I have been running the above script once daily since 3rd Nov 2019 and have since amassed more than 200k tweets. The following is the first 5 lines of the dataset:

image

Enjoyed this article?

Share it with your network to help others discover it

Continue Learning

Discover more articles on similar topics