TACIT Preprocessing

From CSSL
Jump to: navigation, search

Overview

TACIT provides a number of preprocessing options that can be applied to text before analysis. To access these settings, click on the Presprocess link within any of the tools. You can also access preprocessing through the main toolbar by clicking on View , followed by Preferences, TACIT, and Preprocess. Finally, for cases where you do not want to run an analysis yet and want to first Preprocess your text, there is a stand-alone Preprocessing tool that can be accessed by clicking on Preprocess under View in the main toolbar.

The Preprocessor has options for stop word removal, lowercase conversion, delimiter specification, stemming, and preprocessed-file storage.

Basic Tutorial: Preprocessing

Stop Words

Stop words are common words that occur in more than 99% of documents such as 'a' and 'the'. You can specify a text file with a list of stop words that you would like to remove from the text, leaving the presumably meaningful words as the focus of analysis (Salton, 1989). However, one should only alter the texts' structure in this manner with clear theoretic justification and knowledge of the tradeoffs involved. Faced with small amounts of text, it is sometimes helpful to remove uninformative variations; however, elements like stopwords can be useful indicators for tasks like authorship identification

Delimiters

The program will use the punctuation (or delimiters) provided in this box to divide the text into meaningful units (sentences, clauses, etc).

Stemming

When the box Stemming is checked, TACIT will use the Porter stemming algorithm to reduce words in the text to their root form (e.g., 'running', 'runs' and 'ran' would all reduce to 'run') which serves to reduce the number of unique words in the documents and make sure that words that represent the same meaning, but that are in different tenses, are treated the same(Porter, 2006).

Convert to Lowercase

Selecting Convert to Lowercase converts all text to single case (all lowercase) so that algorithms can recognize that 'Freedom' and 'freedom' represent the same word.

Saving Preprocessed Files

Click Browse next to Pre-Processed File Location to specify where you would like the newly created, pre-processed files to be stored.

By selecting Clean up preprocessed files , the program will automatically delete the created, pre-processed files after running the analysis.