TACIT Lda Topic modelling

From CSSL
Jump to: navigation, search

Overview

TACIT's LDA plugin allows you to explore the structure of topics within a set of documents and the relations between them. Researchers can adjusting the number of topics generated to focus the level of granularity, with smaller numbers of topics generating a few overarching themes to larger numbers of topics providing many tightly focused topics. Each topic is described by a distribution over the words in the vocabulary, allowing you to see which words are most associated with that topic and explore connections between them across the entire corpus. Additionally, each document is described by the mixture of topics which underlie it, providing a way to compare topic use across documents.

Basic Tutorial: Using TACIT LDA Topic Modelling Tool

Specifying Input Files for Analysis

All input data files must be saved as a corpus in Corpus Management or in .txt file format to be compatible with TACIT. See the Corpus Management Help Section if you need to convert files to compatible formats.

To specify which files you would like to analyze, select Add Corpus, Add Folder, or Add File(s) under the Input panel. All files and folders added to the input panel are automatically selected to be included in analysis by the check mark box to the left of the corpus/file/folder name. The number of files selected for analysis is indicated at the bottom of the input panel. To de-select an unwanted folder or file, uncheck the box next to its name. To remove a file completely from the program list, click on the file name to highlight it, and then click the "Remove" button. Note: Files within folders/corpora cannot be removed from the tool without removing the entire folder or corpus, but de-selecting files using the check boxes will remove them from analysis.

Add Corpus: The Add Corpus button will allow you to add a stored corpus from Corpus Management and all included sub-groups and fies for analysis. To expand the corpus and view the subgroups/files included, click on the arrow to the left of the corpus name in the input panel.
Add Folder: The Add Folder button will allow you to add a folder and all included subfolders and files for analysis. To expand the folder and view the subfolders/files included, click on the arrow to the left of the folder name in the input panel.
Add File(s): The Add Files button will allow you to add .txt files to be included for analysis. Multiple files within the same folder can be selected at the same time using standard multi-select functions.
In addition to selecting files for analysis, users must specify the number of topics to extract from the corpus. This parameter can be be set within the Input Path field within the LDA tool.

Additional Options

Preprocessing: Preprocessing can be implemented by selecting the Preprocess option, which is located in the Input Path field.

Word weights: The word weights calculated during the analysis can be saved as an output file by selecting the Create Word Weights Fileoption in the Output Path field.

Specifying Output Path

To specify an output folder where the LDA output files will be saved, click on the Browse button to the right of the Output Location bar and select a folder. If you create a new folder within this menu and change its name from "New Folder", click on any other folder and then click back on your newly created & renamed folder to select it. After specifying all parameters, click the green and white play button (Image 1) located in the top right corner of the window to run the program. Output information will display in the console panel at the bottom of the tool.

The output file prefix can also be specified using the Output Prefix input box located in the Output Path field.

Understanding LDA Output

The data output will be in .csv and .txt file formats. The file names will include the type of technique used for analysis (LDA) and the output type.The LDA tool generates the following output files:

Run report: The run report lists the version of TACIT used for the analysis and a time stamp.

Topic composition: The topic composition file contains the topic probabilities for each document. Each row in the file is associated with a single a single document and the columns are: Number, which specifies the document number; File Name, which lists the document file name, and Topic N Probability,where the number of columns and N ranges from 0 through the number of topics specified. The Topic N Probability columns contain the probability that each document is associated with that columns topic.

Topic keys: The topic keys file lists the top 19 words for each topic. This information can be used to gain a general idea of what each topic represents.

Word counts: The word-counts file lists the frequency of occurrence for each word within each topic.For example, for an LDA analysis with two topics the line "Turing 1:32 2:1" would indicate that the word "Turing"occurs 32 times within topic 1 and 1 time within topic 2. If a word does not occur within a topic, that topic is not listed]for that word within the word-counts file; that is, there are no entries of the form i:0.

Word weights (optional): The word weights file lists the weight of each word for each topic. The words that have the highest weights for a given topic are the words that constitute the general focus of the topic. This output file is not automatically generated and must be requested by selecting Create Word Weight File option, which is located in theoutput path field.