TACIT Kmeans Cluster

From CSSL
Jump to: navigation, search

Overview

The K-means clustering plugin (MacQueen, 1967) aims to cluster texts into a user-specified number of clusters (or groups) such that the texts included in each cluster are the nearest to the cluster's centroid (the prototypical document of that cluster), and have the farthest distance from other clusters' centroids. Once this optimization is performed, TACIT outputs a .csv file containing the membership information for the documents in the corpus.

Basic Tutorial: Using TACIT K-Means Clustering Tool

Specifying Input Files for Analysis

All input data files must be saved as a corpus in Corpus Management or in .txt file format to be compatible with TACIT. See the Corpus Management Help Section if you need to convert files to compatible formats.

To specify which files you would like to analyze, select Add Corpus,vAdd Folder, or Add File(s) under the Input panel. All files and folders added to the input panel are automatically selected to be included in analysis by the check mark box to the left of the corpus/file/folder name. The number of files selected for analysis is indicated at the bottom of the input panel. To de-select an unwanted folder or file, uncheck the box next to its name. To remove a file completely from the program list, click on the file name to highlight it, and then click the "Remove" button. Note: Files within folders/corpora cannot be removed from the tool without removing the entire folder or corpus, but de-selecting files using the check boxes will remove them from analysis.

Add Corpus: The Add Corpus button will allow you to add a stored corpus from Corpus Management and all included sub-groups and fies for analysis. To expand the corpus and view the subgroups/files included, click on the arrow to the left of the corpus name in the input panel.
Add Folder: The Add Folder button will allow you to add a folder and all included subfolders and files for analysis. To expand the folder and view the subfolders/files included, click on the arrow to the left of the folder name in the input panel.
Add File(s): The Add Files button will allow you to add .txt files to be included for analysis. Multiple files within the same folder can be selected at the same time using standard multi-select functions.

Additional Input Parameters

In addition to selecting files for analysis, users must specify the number of clusters into which the corpus will separated. This parameter can be be set within the Output Path field within the K-Means tool.

Additional Options

Preprocess

Specifying Output Path

To specify an output folder where the output files will be saved, click on the Browse button to the right of the Output Location bar and select a folder. If you create a new folder within this menu and change its name from "New Folder", click on any other folder and then click back on your newly created & renamed folder to select it.After specifying all parameters, click the green and white play button located in the top right corner of the window to run the program. Output information will display in the console panel at the bottom of the tool.

Understanding K-Means Clustering Output

The K-Means Cluster tool automatically generates two output files in .txt format, a run report and the cluster structure.The file name includes the type of technique used for analysis and the time stamp for when the analysis was completed.

The cluster structure report (see Image 2 for example) lists each cluster and the documents that comprise them.For example, Image 2. shows the results of analysis of 8 documents that were separated into 3 clusters. As the report indicates, most (6) of the documents are contained in the 2nd cluster and two documents each formed singleton clusters.