TACIT WordCount

From CSSL
Jump to: navigation, search

Overview

Word count programs (or dictionary methods) are text analysis tools that quantify the frequency of topics of interest across a document or set of documents.To accomplish this task, word count programs require a user-defined dictionary of topics, with pre-identified words for each topic.Using this dictionary, the word count program then counts the number of times words in each topic domains occur and calculates the percentage of each document that reflects each topic.

The TACIT Word Count plugin was developed exclusively for the TACIT tool as a more comprehensive approach to word count techniques.Rather than only relying on static categories, the Standard Word Count tool uses Apache OpenNLP to automatically segment sentences, tokenize words and find the part of speech tags of all words in the text. We believe that this is a more modern and comprehensive approach to segmenting sentences and counting parts of speech compared to LIWC's approach (e.g. locating ',' as marker for sentences). By default, analyses using this tool report word counts for the categories in user-provided dictionaries. Our wordcount is a straightforward implementation where, if a word exists in the user dictionary, it is considered, else it is ignored.This tool uses Apache OpenNLP's POS tagging API to calculate part of speech tag counts and words per sentence.

This tool provides additional word count options, including weighted word count capabilities and default part of speech tag counts (e.g., numbers, symbols, defaults based on The Penn Treebank Project).

Basic Tutorial: Using TACIT Word Count

TACIT Standard Word Count vs. TACIT Weighted Word Count

Standard word count techniques are ideal when one is interested in assessing the content features of text because they are binary by nature; these techniques count dictionary words by presence (by adding a 1 to that word's score each time it occurs in the document) or absence (score of 0 if the word never appears) and treat all words within the dictionary as being equally important.

By comparison, Weighted word count techniques are often preferred when assessing stylistic variables that capture various degrees of a concept so that dictionary words that are a weaker reflection of the topic count for less than dictionary words that stronger reflect that topic.

For example, if you are interested in assessing negative affect, you may assign a weight of 0.5 to the word "sad" so that it only counts for half of a point during the word count process as "sad" usually reflects relatively low levels of negative affect. By comparison, you may assigna weight of 2 to the word "devastated" so that each time the word occurs it counts for 2 points as this word represents much stronger negative affect and thus may be more indicative of the topic.

The method you select (standard or weighted) will determine the necessary format of your dictionary files (see the dictionary section below for details).

Specifying Input Text for Analysis

All input data files must be saved as a corpus in Corpus Management or in .txt file format to be compatible with TACIT. See the Corpus Management Help Section if you need to convert files to compatible formats.

To specify which files you would like to analyze, select Add Corpus,Add Folder, or Add File(s) under the Input panel. All files and folders added to the input panel are automatically selected to be included in analysis by the check mark box to the left of the corpus/file/folder name. The number of files selected for analysis is indicated at the bottom of the input panel. To de-select an unwanted folder or file, uncheck the box next to its name. To remove a file completely from the program list, click on the file name to highlight it, and then click the "Remove" button. Note: Files within folders/corpora cannot be removed from the tool without removing the entire folder or corpus, but de-selecting files using the check boxes will remove them from analysis.

Add Corpus: The Add Corpus button will allow you to add a stored corpus from Corpus Management and all included sub-groups and fies for analysis. To expand the corpus and view the subgroups/files included, click on the arrow to the left of the corpus name in the input panel.
Add Folder: The Add Folder button will allow you to add a folder and all included subfolders and files for analysis. To expand the folder and view the subfolders/files included, click on the arrow to the left of the folder name in the input panel.
Add File(s): The Add Files button will allow you to add .txt files to be included for analysis. Multiple files within the same folder can be selected at the same time using standard multi-select functions.

Additional Input Options

Preprocess: The Preprocess check box allows users to apply a variety of data cleaning techniques to the input data before analysis. Click on the hyperlink in order to specify which cleaning processes you would like to apply (i.e., Stop Word Removal, Delimiters, Porter Stemming, Capitalization).

Selecting the Create output for default tags will create additional count columns in the output file for typical, default parts of speech based on The Penn Treebank Project In the current implementation, the weights associated with default tags are not customizable, though this feature may be added at a later date.

Selecting the Create .DAT File checkbox will generate an output data file in .dat format compatible for use with outside data analysis programs.

Selecting the Create Category-wise Word Distribution Files

Selecting a Dictionary

All dictionary files must be in .txt format to be compatible with TACIT. Standard word count requires that the dictionary specify the words for each topic. Weighted word count requires that you also provide the weights assigned to each word of interest. For dictionary specifications, see the Dictionary Help Page. It is important to note that TACIT dictionaries do not support phrases.

To add a dictionary to be used for word count analysis, click the Add Files button on the right side of the Dictionary panel. All files added to the input panel are automatically selected to be included in analysis by the check mark box to the left of the dictionary file name. The number of files selected for analysis is indicated at the bottom of the dictionary panel. To de-select an unwanted file, uncheck the box next to its name. To remove a file completely from the program list, click on the file name to highlight it, and then click the "Remove" button.

Using Multiple Dictionaries

TACIT allows users to analyze text using multiple dictionaries simultaneously. However, category number assignment must be consistent across all included dictionaries. Specifically, if category 2 is pronoun in Dictionary_1, the program will assume that category 2 is also pronoun for Dictionary_2. However, both dictionaries do not require that all categories be present. You may introduce additional set of categories in the second dictionary file, as long as the category numbers used are unique from those found in the first dictionary file.

Additional Dictionary Options

Stem Dictionary: We do not support *'s for TACIT word count dictionaries. Selecting the Stem Dictionary option will apply a Porter Stemmerto the words in the dictionary before conducting the word count analysis. Porter stemming reduces words to their root form (e.g., 'running', 'runs' and 'ran' would all reduce to 'run') so that they are counted as the same word during analysis.

Specifying Output Path

To specify an output folder where crawled files will be saved, click on the Browse button to the right of the Output Location bar and select a folder. If you create a new folder within this menu and change its name from "New Folder", click on any other folder and then click back on your newly created & renamed folder to select it. After specifying all parameters, click the green and white play button located in the top right corner of the window to run the program. Output information will display in the console panel at the bottom of the tool.

Understanding Word Count Output

The data output will be in .csv file format. The file name includes the type of technique used for analysis and the time stamp for when the analysis was completed. WordCount-UserTags indicates word counts for categories in the user dictionary. The last line in the .csv file indicates the overall counts for all input documents. The word count output values are the percentage of each document that is captured by each category.

If Create output for default tags is selected, the program will create an additional output file called WordCount-DefaultTags which includes word counts for POS tags using OpenNLP.