TACIT Twitter Crawler

From CSSL
Jump to: navigation, search

Overview

The Twitter Crawler tool collects text from the Twitter website and writes that data into text files that are readable by automated text analysis programs. This crawler collects tweets from the "live stream" which may include tweets up to a couple days old, but Twitter does not allow crawlers to collect older tweets.

Basic Tutorial: Collecting data using Twitter Crawler

Setting Up Account Authorization
Before you can start using the TACIT crawler, Twitter requires that you set up an API which provides specific account-tied authorization information to grant the tool permission to crawl their website.
To acquire this authentication information from Twitter:
1. Go to http://twitter.com/apps and log into your twitter account (or create an account if you don't already have one).
2. The creation Application page will open. Click on Create Application.
3. Fill in the TACIT crawler specifications
application name: [name that will help you keep track of your data]
app description: [description that will help you keep track of your data]
website: http://cssl.usc.edu
callback url: [leave blank]
4. You will then be provided an overview page with details about the application you just set up for crawling. Click on the Keys and Access Tokens Tab. Here, you will find the information needed for the TACIT Twitter User Configuration form. Copy the Consumer key and Consumer secret into the TACIT preferences page.
5. At the bottom of the page under Token Actions, click Generate My Access Token. Copy the Access Token and Access Token Secret into the spaces provided on the TACIT Preferences page and click apply.
6. Your name should now appear in the grey User Name box if the information was entered successfully. Click OK to close the preferences page.
Step-by-step instructions with pictures for setting up the Twitter API/retrieving access tokens can be found here

Crawler Filters

Twitter data can be crawled based on keywords (Word Filter) or locations (Geo Filter) of interest.
The Word Filter allows you to crawl for tweets using a specific keyword or set of keywords. For multiple keywords, separate each word by entering a semicolon in between each word.

The Geo Filter lets you enter in a specific geolocation box (latitude and longitude) to crawl tweets that originate from a certain location. To determine the bounding box needed for your area of interest, you can use the tool available here. Change the Copy & Paste option in the bottom panel drop down menu to CSV. You can then enter in the name of the city, state, or country you want to crawl or click and drag the selection box on the map to the desired location. The geo-coordiate box for that area will appear at the bottom of the page. Copy paste these numbers into the Geo Filter box in the TACIT tool. If you would like to crawl from multiple distinct locations, simply add the geo-coordiate box for each location into the tool separated by semi-colons.

Crawler Limits

The Time Limit lets you specify a length of time that the tool should continue to run and crawl data for events that are infrequent or that you expect to unfold over a certain time frame. TACIT will collect tweets as long as the program remains running on your computer and you continue to be connected to the Internet.

The Maximum Limit section limits the number of tweets to crawl. The Twitter crawler is automatically set to download 10 total tweets per request. This option can be changed to any number of tweets you are interested in collecting. The tool will continue to run until the number is reached.

Specifying Tweet Attributes for Collection

Users can select which types of information they would like the crawler to collect about each tweet by checking the box next to the following attributes: User Name: The name of the user who wrote the tweet.
Geo Location: The geological latitude and longitude coordinates of the tweet's origin (when available).
Status ID:
Text: The text of the tweet.
Created At: The time and date when the tweet was created.
Language: The language used in the tweet text.
Re-tweet Number: The number of times the tweet has been retweeted by other Twitter users.
Favorite Count: The number of times the tweet has been favorited by other Twitter users.
For events that are infrequent or that you expect to unfold over a certain time frame, you can specify a length of time that the tool should continue to run and crawl data. TACIT will collect tweets as long as the program remains running on your computer and you continue to be connected to the Internet.

Specifying Output Folder

To specify an output folder where crawled files will be saved, click on the Browse button to the right of the Output Location bar and select a folder. If you create a new folder within this menu and change its name from "New Folder", click on any other folder and then click back on your newly created & renamed folder to select it. After specifying all parameters, click the green and white play button (Image 1) located in the top right corner of the window to run the program. Output information will display in the console panel at the bottom of the tool. Note: The program will create a new folder for each author requested and it will save all files by that author within that folder. Sub-folders will also be generated for document files from collected works.

Understanding Twitter Crawler Output

Crawl data is saved in a single .JSON format file in the specified output folder with the file name Twitter_Stream_[date/time]. This file is automatically stored in Corpus Management. This format allows the program to store metadata about each post crawled that can be used in the analysis plugins to select groups based on their properties for more focused analysis or sort data into groups (or classes) for comparison.