TACIT Crawler Overview

From CSSL
Jump to: navigation, search

Overview of Crawlers

We designed TACIT’s Crawler plugins to automate this data compilation stage. The five built-in TACIT Crawlers search and download relevant content from various historical (the Latin Library Crawler), political (the US Supreme Court and US Congress Crawlers) and social media (the Reddit and Twitter crawlers) services to make this data available for further processing in TACIT or any other text-processing tool. All data that is downloaded using TACIT’s crawlers is automatically converted by the Corpus Management tool into a corpus (annotated collection of texts), which can be used for analysis within the TACIT program or exported as plain text files for use in other analysis software.

Latin Library Crawler

The Latin Library Crawler collects text from The Latin Library Website and writes that data into text files that are readable by automated text analysis programs.

Reddit Crawler

The Reddit Crawler collects text from the Reddit website and writes that data into text files that are readable by automated text analysis programs.

Twitter Crawler

The Twitter Crawler collects text from the Twitter website and writes that data into text files that are readable by automated text analysis programs.

United States Congress Crawler

The United States Congress Crawler collects speech transcription and metadata from the Library of Congress THOMAS Website (http://thomas.loc.gov/home/thomas.php) for present day speeches to as far back as the 101st Congress.

United States Supreme Court Case Crawler

The United States Supreme Court Crawler is a program that automatically scans and collects court case transcription data from the IIT Chicago-Kent College of Law Supreme Court Case Website and writes that data into text files that are readable by automated text analysis programs. This crawler also saves meta data about each file (e.g., majority author, date decided; see output section below) This crawler also includes options to download case audio mp3 files.