# Comprehensive Guide to Twitter Data Scraping with Twitter Toolbox
Written on
Chapter 1: Introduction to Twitter Toolbox
Introducing the Twitter Toolbox, a robust platform aimed at streamlining the process of obtaining, preprocessing, and analyzing data from Twitter.
With recent price hikes for the Twitter API, acquiring data has become increasingly costly. However, with a strategic approach to data scraping, you can navigate around these limitations, enjoy request flexibility without facing rate limits, and even access historical data—privileges that would otherwise set you back $5000 monthly with Twitter's premium API. In this article, I will walk you through a clever yet straightforward technique for scraping Twitter data, focusing on aspects like time frames, specific users, hashtags, and keyword-based scraping. Let’s get started!
Chapter 2: Understanding Twitter Toolbox's Functionality
The Twitter Toolbox is part of a new suite of tools dedicated to analyzing Twitter data. This toolbox arose as a response to the recent shifts in Twitter’s API and user interface following Elon Musk’s acquisition. As many existing libraries lagged in updates, I took it upon myself to develop this all-encompassing solution for Twitter data analysis, ensuring that it adapts to the latest changes on the platform.
If you haven't checked out my previous article on streaming Twitter data, you can do so here.
Section 2.1: Key Features of the Toolbox
The Twitter Toolbox provides a wide range of functionalities aimed at simplifying data acquisition, preprocessing, and analysis.
- Data Acquisition: This includes features for streaming and scraping data, making API calls, and processing tweets. These functionalities allow you to gather diverse data from Twitter in real-time or from historical records.
- Preprocessing: Essential for ensuring the accuracy and dependability of your analyses, the toolbox offers features for data cleaning, language filtering, labeling, and group generation tailored to your requirements.
- Natural Language Processing (NLP): This feature analyzes tweet content to extract meaningful insights. The toolbox includes sentiment analysis, emotion recognition, topic identification, and named entity recognition, providing tools to gauge public sentiment, identify emotional trends, and recognize key entities like organizations or individuals.
This toolbox is continuously evolving, with new features introduced as they become available. All code is open-source on my GitHub repository, and I will publish updates regularly to help guide you through new features.
Note: This project integrates and enhances existing libraries as necessary, giving due credit to the original sources in the relevant coding files.
Section 2.2: Setting Up the Toolbox
To get started with the Twitter Toolbox, you need to download or clone the project from the GitHub repository, ensuring your system has all required dependencies. The data scraping feature necessitates the following libraries:
- selenium
- pandas
- requests
- python-dotenv
- argparse
Additionally, you’ll need to install ChromeDriver for web navigation. Installation instructions are provided for both Mac and Linux systems.
For Mac users:
# Use Homebrew
brew install --cask chromedriver
# Verify installation
chromedriver --version
# Get the path of ChromeDriver
brew info chromedriver
For Linux users:
A shell script is provided for easy installation:
# Make the script executable (at the project's base)
chmod +x chromedriver.sh
# Run the script
./chromedriver.sh
# Verify installation
chromedriver --version
# Get path
which chromedriver
Before you start using the toolbox, ensure your environment is correctly set up. You will need to provide your credentials to connect and scrape data, as unauthenticated users are currently restricted from accessing Twitter's search functionalities. Save your credentials in a .env file in your working directory.
.env File:
USERNAME = "YourUsername"
PASSWORD = "YourPassword"
EMAIL = "[email protected]"
CHROME_DRIVER_PATH = "YourChromedriverPath"
Chapter 3: Utilizing the Toolbox for Data Scraping
To initiate scraping data from Twitter, execute the run-scraping.py file found in the src/dataAcquisition directory:
python3 src/dataAcquisition/run-scraping.py --e .env --a elonmusk --s 2023-01-01
In this command:
- --env or --e specifies the environment file with your credentials.
- --start or --s indicates the start date for scraping.
- --from_account or --a specifies the account you wish to investigate.
For instance, the command provided will scrape tweets from Elon Musk starting from January 1, 2023, to the present day.
You can customize the command using optional arguments:
- --end or --e: End date for scraping (default is today).
- --interval or --i: Interval between scraping (default is one day).
- --headless: Run the script without the Chrome UI.
- --only_id: Save only the tweet IDs collected.
Various search methods are available for gathering tweets:
- By Account: Use --from_account or --a.
- By Hashtag: Use --hashtag or --h.
- By Word: Use --word or --w.
Note: Only one search type can be conducted at a time.
For example, to scrape tweets from the year 2021 on the hashtag #covid:
python3 src/dataAcquisition/run-scraping.py --e .env --h covid --s 2021-01-01 --e 2022-01-01
To gather a sample of English tweets from a specific date using the word "you":
python3 src/dataAcquisition/run-scraping.py --e .env --w you --s 2023-01-01 --e 2023-01-02
Chapter 4: Analyzing the Output
Upon completion of the scraping process, a .csv file will be generated in the data/scraping directory. The filename indicates the current date:
- data/scraping/<user>/<user>_<start>_<end>.csv
- data/scraping/<hashtag>/<hashtag>_<start>_<end>.csv
- data/scraping/<word>/<word>_<start>_<end>.csv
The data structure includes:
- tweet_id: Unique identifier for each tweet.
- user_id: Unique identifier for the Twitter account that posted the tweet.
- created_at: Timestamp of when the tweet was posted.
- text: Content of the tweet.
Note: Unlike the streaming method, language assessment cannot be conducted through scraping. However, you can apply Natural Language Processing and language detection using tools available in the toolbox on GitHub.
Chapter 5: Error Handling and Interruptions
During the scraping process, you may encounter challenges like connection errors or timeouts. The script is designed to manage these issues effectively: if an error occurs, it will pause and attempt to refresh the page to continue scraping. It is built to be resilient, capable of running for extended periods without supervision.
If you need to interrupt the process (CTRL+C), the script will ensure a smooth exit, closing the CSV file to maintain data integrity.
When testing the scraping functionality for the first time, I recommend running it without the --headless argument, as Twitter may request a verification code sent to your email. Enter this code to gain access. After this initial setup, you can utilize the --headless option for convenience. Running without --headless also allows you to visually monitor the scraping process, which can help identify and troubleshoot potential issues.
Note: If you encounter any issues with the code, feel free to open a ticket on GitHub or leave a comment on this article.
Chapter 6: Practical Use Cases
The functionality of the Twitter Toolbox can be applied in various scenarios:
- Brand Monitoring: Companies can track mentions of their brands or products in real time, enabling quick responses to customer inquiries or complaints.
- Trend Analysis: By collecting extensive tweet data over time, analysts can identify and explore trending topics.
- Sentiment Analysis: When paired with NLP techniques, the tool can assess public sentiment regarding specific topics, events, or individuals.
In my case, I am working on quantifying the level of information and entropy in language across different topics, users, and dates.
Chapter 7: Behind the Scenes of the Code
Once launched, the script operates in an infinite loop, automatically reconnecting after any disconnection, error, or timeout. To exit the loop, you will need to terminate the script manually (CTRL+C).
The script can run continuously for several days until the scraping process is finished. If the script is stopped for any reason, it will resume and append the existing scraping data upon relaunch, provided the same user, hashtag, or word, along with the start and end dates, are used.
Keep in mind that when searching by user, Twitter restricts the number of tweets retrieved per search to 50. Therefore, if you set a one-day interval, a maximum of 50 tweets can be collected daily.
Chapter 8: Upcoming Feature: Tweet Hydration via Scraping
I am excited to announce an upcoming feature that will allow users to hydrate tweets through scraping. Given recent changes to the API, the former method of hydrating tweet IDs is now exclusive to paid accounts. This enhancement will create a free and straightforward means to retrieve and hydrate a comprehensive dataset of tweet IDs, ensuring compliance with Twitter's Terms of Service.
Disclaimer regarding Twitter Privacy Policy:
Using the Twitter Toolbox to extract data from Twitter's API must comply with Twitter’s Developer Agreement, Privacy Policy, and all relevant terms of service. Data, particularly user IDs and personal information, should be handled ethically and in accordance with applicable laws and regulations. The toolbox also includes a dehydration script (src/dataAcquisition/dehydrate-tweets.py) to ensure adherence to Twitter's TOS by only sharing tweet IDs. Misuse of this data may lead to severe consequences, including termination of API access. Remember, the responsibility for ethical and legal data use lies solely with you.
Conclusion
This concludes the overview of the Twitter Toolbox's scraping feature. This tool is an efficient and user-friendly resource for gathering user activity on Twitter, making it invaluable for researchers, marketers, data analysts, and anyone interested in current language trends or market dynamics.
While this article highlights the scraping functionality, future articles will explore additional features of the Twitter Toolbox. I encourage you to visit the GitHub repository, propose new features, or report any issues. The toolbox is more than just a tool; it is a collaborative platform designed to grow with input and feedback from users like you.
© All rights reserved, June 2023, Siméon FEREZ
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.
A tutorial on scraping Twitter data using Python for data science.
Step-by-step guide on using a Twitter scraper in Python.