TACIT zlabel Topic Modelling

Jump to: navigation, search

Method Overview

TACIT's seeded z-label LDA plugin expands upon the capabilities of LDA by implementing the z-label LDA (Andrzejewski & Zhu, 2009) algorithm. Z-label LDA allows the researcher to provide a word or list of words they are interested in exploring called seed words. The algorithm then uses these seed words as the core of the topics that it generates, building the rest of the words around those key concepts.

Basic Tutorial: Using TACIT Z-Label LDA Tool

Specifying Input Files for Analysis

All input data files must be saved as a corpus in Corpus Management or in .txt file format to be compatible with TACIT. See the Corpus Management Help Section if you need to convert files to compatible formats.

To specify which files you would like to analyze, select Add Corpus,Add Folder, or Add File(s) under the Input panel. All files and folders added to the input panel are automatically selected to be included in analysis by the check mark box to the left of the corpus/file/folder name. The number of files selected for analysis is indicated at the bottom of the input panel. To de-select an unwanted folder or file, uncheck the box next to its name. To remove a file completely from the program list, click on the file name to highlight it, and then click the "Remove" button. Note: Files within folders/corpora cannot be removed from the tool without removing the entire folder or corpus, but de-selecting files using the check boxes will remove them from analysis.

Add Corpus: The Add Corpus button will allow you to add a stored corpus from Corpus Management and all included sub-groups and fies for analysis. To expand the corpus and view the subgroups/files included, click on the arrow to the left of the corpus name in the input panel.
Add Folder: The Add Folder button will allow you to add a folder and all included subfolders and files for analysis. To expand the folder and view the subfolders/files included, click on the arrow to the left of the folder name in the input panel.
Add File(s): The Add Files button will allow you to add .txt files to be included for analysis. Multiple files within the same folder can be selected at the same time using standard multi-select functions.

In addition to specifying the input files for the analysis, the ZLabel LDA tool also requires a list of seed words. The seed words should be contained in a single .txt formatted file. Each line in the seed words .txt file will be interpreted by the ZLDA tool as seed words for a different topic. For example, to specify seed words for two topics, a seed words file would need to contain one line of seed words for topic 1 and one line of seed words for topic 2. To specify a seed words file, use the Browse button located next to the Seed File Location input box in the input path field.

Additional Options

Preprocessing can be implemented by selecting the Preprocess option, which is located in the Input Path field.

Specifying Output Path

To specify an output folder where crawled files will be saved, click on the Browse button to the right of the Output Location bar and select a folder. If you create a new folder within this menu and change its name from "New Folder", click on any other folder and then click back on your newly created & renamed folder to select it. After specifying all parameters, click the green and white play button located in the top right corner of the window to run the program. Output information will display in the console panel at the bottom of the tool.

Understanding Z-Label LDA Output

The data output will be in .csv and .txt file formats. The file names will include the type output and a time stamp.The ZLDA tool generates the following output files:

Run report: The run report lists the version of TACIT used for the analysis and a time stamp.

Phi: This file is output by the ZLDA implementation, but it is not readily interpretaable.

Theta: The Theta file contains the distributions of topics over documents. For each document,the estimated probability that it contains each topic is provided. These probability estimates can be used to identify the most likely topics for a given document.

Topic Words: The topic words file lists the words that have the highest probability of occurrence for each topic. Both the words and their probabilities are contained in the file. This information can be used to gain insight into the semantic content of a given topic.