TACIT Reddit Crawler

From CSSL
Jump to: navigation, search

Overview

The Reddit Crawler provides full access to all data available on Reddit.com. This crawler provides various filtering options that can be used to search for specific topics, keywords and authors, from subreddits through either the full website, the trending data or top/controversial section of the website. This data, including vote counts, is then saved in a format that is accessible and analyzable by the TACIT tool as well as other language analysis algorithms.

Basic Tutorial: Collecting data using the Reddit Crawler

The Crawl panel provides three Reddit webpage stream options to focus your data collection. Selecting the Search option will allow you to search the entire Reddit website for posts that fit user-specified criteria such as keywords, titles, website names, authors, and subreddits. Selecting the Trending Data option will focus the crawl on the Hot, New, and Rising pages for information on new and upcoming Reddit topics. Selecting the Top/Controversial option will focus the crawl on the Top and Controversial pages for individuals interested in the topics with the most conflicted opinions (same number of up and down votes) or the highest scoring posts.

Selecting Input Parameters
The Textinput box allows you to specify a keyword/keywords you would like to base your search on. The crawler will return posts that include the keyword. Note: Reddit does not allow comments to be included as searchable text at this time.

The Subreddit input box lets you specify specific subreddit threads to which you would like to limit your search.

The Sort Links By drop down menu will allow you to collect data that matches your keyword(s) based on the following order:
the Relevance of the post,
the Top posts (overall posts with the highest rating),
the Hot posts (the new posts with the highest rating),
the New posts that have been posted most recently,
or the posts with the most Comments.


Crawler Filters

You can further tailor and filter your crawling results using the following options: Title Keyword: search for your keyword within posts that include a separate, specific keyword. Site: search for your keyword within posts that metion a specific website url. Author search for your keyword within posts by a specific author. Link ID: search for your keyword within posts that reference a specific Reddit page link ID.

Crawler Limits

The Time Frame drop down menu allows you to specify the time frame that you would like to receive data for (All posts regardless of time frame, past hour, 24 hours, week, month, or year). The crawler defaults to limiting the amount of data collected to 10 links (posts) and 200 comments per link. Changing the numbers in the Limit Links Per Request and Limit Comments Per Link text boxes will allow you to change these default numbers. Note: Both inks per request and comments per link must be specified in order for the crawler to work, due to the vast size of Reddit.

Selecting the Trending Data Reddit Stream for Data Collection

Selecting the Stream Type to Crawl

The Select Stream panel provides the option crawl data from either the Top Reddit Stream or the Controversial Reddit Stream. The Top Reddit Stream is a listing of the highest scoring submissions (posts) to Reddit, and comments on submissions, regardless of their age. The Controversial page is a list of Reddit submissions (posts) that the community has given an equal amount of upvotes and downvotes.

Crawler Limits

The crawler defaults to limiting the amount of data collected to 10 links (posts) and 200 comments per link. Changing the numbers in the Limit Links Per Request and Limit Comments Per Link text boxes will allow you to change these default numbers. Note: Both links per request and comments per link must be specified in order for the crawler to work, due to the vast size of Reddit.

Selecting the Top/Controversial Reddit Stream for Data Collection

Selecting the Stream Type to Crawl

The Select Stream panel provides the option to crawl data from either the Hot Stream, New Stream, or Rising Stream. The Hot page crawls the posts considered "hot" at that moment, which signifies the new posts with the highest rating (combining the "top" and "new" streams).The New Stream crawls the most recent Reddit posts first, regardless of their other attributes.The Rising Stream crawls the posts that are getting the most upvotes per minute at that moment.

Crawler Limits

The Time Framedrop down menu allows you to specify the time frame that you would like to receive data for (All posts regardless of time frame, past hour, 24 hours, week, month, or year).The crawler defaults to limiting the amount of data collected to 10 links (posts) and 200 comments per link.Changing the numbers in the Limit Links Per Request and Limit Comments Per Link text boxes will allow you to change these default numbers. Note: Both links per request and comments per link must be specified in order for the crawler to work, due to the vast size of Reddit.

Specifying Output Folder

To specify an output folder where crawled files will be saved, click on the Browse button to the right of the Output Location bar and select a folder. If you create a new folder within this menu and change its name from "New Folder", click on any other folder and then click back on your newly created & renamed folder to select it. After specifying all parameters, click the green and white play button located in the top right corner of the window to run the program. Output information will display in the console panel at the bottom of the tool.

Understanding Reddit Crawler Output

Crawl data is saved in a single .JSON format file in the specified output folder with the file name Reddit_Crawl_[date/time]. This file is automatically stored in Corpus Management. This format allows the program to store metadata about each post crawled that can be used in the analysis plugins to select groups based on their properties for more focused analysis or sort data into groups (or classes) for comparison.