TACIT Cooccurrence

From CSSL
Jump to: navigation, search

Overview

It is often useful as a preliminary but important step in data analysis to study the relationships and patterns between concepts. Co-occurrence analysis gives insights into the interconnection between terms or entities within any given text. Two words are said to co-occur with each other if both words appear in a defined window size, or within a certain number of words of each other (e.g. in Fiji apples are red, Fiji and red co-occur in a window size of 4). TACIT's Co-occurrence plugin calculates both word to word level and multi-word level co-occurrences. At the word to word level, the Co-occurrence plugin computes the frequency that each pair of words co-occur in a text or corpus. At the multi-word level, the Co-occurrence tool locates neighborhoods defined as multiple terms (neighbors) that co-occur with each other (e.g. Fiji and red are neighbors within the neighborhood of Fiji apples are red). To use this method, users need to specify a set of words that they are interested to see if they co-occur (e.g. fiji, red, island), a threshold for the maximum number of words allowed in a neighborhood (e.g. 4) and a threshold for the minimum number of words to occur in the group from the specified set of words (e.g. 2 for fiji & red, fiji & island, red & island or fiji, red & island). The output of this process are the groups of text satisfying the constraints mentioned above along with their actual location information (i.e. filename and line number). Inferences can be drawn easily through the output of the co-occurrence analysis to locate and verify relationships between given entities. The overall matrix of word pairs frequency can be used to locate clusters, synonyms and learn the overall connections between entities.

Basic Tutorial: Using TACIT Co-Occurrence Tool

Specifying Input Files for Analysis

All input data files must be saved as a corpus in Corpus Managementor in .txt file format to be compatible with TACIT. See the Corpus Management Help Section if you need to convert files to compatible formats.

To specify which files you would like to analyze, select Add Corpus,Add Folder, or Add File(s) under the Input panel. All files and folders added to the input panel are automatically selected to be included in analysis by the check mark box to the left of the corpus/file/folder name. The number of files selected for analysis is indicated at the bottom of the input panel. To de-select an unwanted folder or file, uncheck the box next to its name.To remove a file completely from the program list, click on the file name to highlight it, and then click the "Remove" button. Note: Files within folders/corpora cannot be removed from the tool without removing the entire folder or corpus, but de-selecting files using the check boxes will remove them from analysis.

Add Corpus: The Add Corpus button will allow you to add a stored corpus from Corpus Management</b> and all included sub-groups and fies for analysis. To expand the corpus and view the subgroups/files included, click on the arrow to the left of the corpus name in the input panel.
Add Folder: The Add Folder button will allow you to add a folder and all included subfolders and files for analysis. To expand the folder and view the subfolders/files included, click on the arrow to the left of the folder name in the input panel.
Add File(s): The Add Files button will allow you to add .txt files to be included for analysis. Multiple files within the same folder can be selected at the same time using standard multi-select functions.

Additional Analysis Specifications

Preprocess: The Preprocess check box allows users to apply a variety of data cleaning techniques to the input data before analysis. Click on the hyperlink in order to specify which cleaning processes you would like to apply (i.e., stopwords, delimiters, stemming, capitalization).

Word file: This file specifies the words you are interested in including for the co-occurence analysis. Words of interest should be saved in a .txt file on separate lines.

Window Size: Specify the number of words that can exist between the words from the word file list for the program to count it as co-occurring. If two words are immediately next to each other,this number would be 1. Larger window sizes allow words to coexist further from each other and still be counted as occurring together (e.g., window size of 4 would mark "apple" and "Fiji" as co-occurring in the sentence "The apple is from Fiji.")

Threshold: Threshold indicates the smallest number of word combinations from the word list you would like to assess together. For example, a threshold of 2 would take the combination of each pair of words in the list. A threshold of 3 would take all combinations of 3 words from the list, but not word pairs.

Specifying Output Path

To specify an output folder where crawled files will be saved, click on the Browse button to the right of the Output Location bar and select a folder. If you create a new folder within this menu and change its name from "New Folder", click on any other folder and then click back on your newly created & renamed folder to select it. After specifying all parameters, click the green and white play button located in the top right corner of the window to run the program. Output information will display in the console panel at the bottom of the tool.

Selecting the Build co-occurrence matrices check box will create an additional output file.

Understanding Co-Occurrence Output

The data output will be in .csv file format. The file name includes the type of technique used for analysis and the time stamp for when the analysis was completed. The output file contains the frequencies for each word combination and their location within the files (i.e., filename and line number)