TACIT WordCount Overview

From CSSL
Jump to: navigation, search

TACIT Word Count Tools

List of TACIT Word Count Tools

Standard Word Count
Z-Label LIWC-Style Word Count
Co-Occurrence Analysis

Overview of Word Count Tools

Word count techniques measure the amount that a certain document reflects a predetermined topic of interest using researcher created dictionaries. To use this method, the researcher manually or through crowdsourcing first creates a list of words they believe are related to a topic of interest called a dictionary. For example, if they are interested in the use of moral words, they might create a dictionary of words including ethic, moral, right, etc. Then, the program uses this dictionary and counts the number of times the words occur in each document of interest. These programs provide information about the total number of words in a document as well as the percentage of words in each document from each topic.

Word count is arguably the most popular natural language platform in social psychology is the Linguistic Inquiry Word Count (LIWC) developed by James Pennebaker and his collaborators (James W Pennebaker, Booth, & Francis, 2007; Tausczik & Pennebaker, 2010). Word Count techniques have been used in a variety of different tasks including: demonstrating the relationship between use of first person singular pronouns and negative experiences and depression (Rude, Gortner, & Pennebaker, 2004; Stirman & Pennebaker, 2001), extraversion and use of shorter words, less complex language, but longer written passages (Mehl, Gosling, & Pennebaker, 2006) and others. TACIT provides three plugins which perform word count techniques: LIWC-style Word Count, Standard Word Count, and Co-occurrence. For detailed descriptions of each of these options, see below.


Description of TACIT Word Count Tools

LIWC-Style Word Count

The LIWC-style word count plugin was developed using LIWC documentation (J W Pennebaker, Francis, & Booth, 2001) and reverse engineering of the program to understand and implement the algorithm. We compared the results of our implementation and the original LIWC software on a variety of test texts, including files from Project Gutenberg. Except for very rare cases (e.g. some occurrences of double hyphens, documented here), the result of our plugin provide an exact match with that of LIWC up to the hundredth decimal point. While TACIT provide the word count capability of LIWC, users must import their own dictionaries as it does not provide any of the LIWC dictionary categories.

TACIT Word Count

The TACIT Word Count plugin was developed exclusively for the TACIT tool as a more comprehensive approach to word count techniques. Rather than only relying on static categories, the Standard Word Count tool uses Apache OpenNLP to automatically segment sentences, tokenize words and find the part of speech tags of all words in the text. Analyses using this tool report word counts for the categories in user-provided dictionaries. We believe that this is a more modern and comprehensive approach to segmenting sentences and counting parts of speech compared to LIWC’s approach (e.g. locating “.” as marker for sentences).

Weighted Word Count: Generally, words in a category are all treated as being equally important and thus are assigned the same weight during analysis (i.e. no weight). However, weighted word count can allow the program to incorporate different weights for each word. In weighted word count, higher weights are assigned to words that are particularly diagnostic of a category, resulting in higher output scores when that word is present, even if the the word occurs less frequently than other category words. Although LIWC word count is generally unweighted, both of TACIT’s word count plugins support weighted dictionaries.

Co-Occurrence Analysis

Co-occurrence analysis gives insights into the interconnection between terms or entities within any given text. Two words are said to “co-occur” with each other if both words appear in a defined window size, or within a certain number of words of each other (e.g. in “Fiji apples are red”, “Fiji” and “red” co-occur in a window size of 4). TACIT’s Co-occurrence plugin calculates both word level and phrase level co-occurrences. At the word level, the Co-occurrence plugin computes the frequency that each pair of words co-occur in a text or corpus. At the phrase level, the Co-occurrence tool locates “neighborhoods” defined as multiple terms (“neighbors”) that co-occur with each other (e.g. “Fiji” and “red” are neighbors within the neighborhood of “Fiji apples are red”). To use this method, users need to specify a set of words that they are interested to see if they co-occur (e.g. “fiji”, “red”, “island”), a threshold for the maximum number of words allowed in a neighborhood (e.g. 4) and a threshold for the minimum number of words to occur in the group from the specified set of words (e.g. 2 for “fiji” & “red”, “fiji” & “island”, “red” & “island” or “fiji”, “red” & “island”). The output of this process are the groups of text satisfying the constraints mentioned above along with their actual location information (i.e. filename and line number). Inferences can be drawn easily through the output of the co-occurrence analysis to locate and verify relationships between given entities. The overall matrix of word pairs frequency can be used to locate clusters, synonyms and learn the overall connections between entities.