TACIT Support Vector Machine Classifier

From CSSL
Jump to: navigation, search

Overview

TACIT's SVM classification plugin can be used to investigate the degree to which two classes can be separated based on the provided data. The plugin requires that the researcher pre-organizes their texts by class into separate directories/corpora. Once the data is organized by group, the SVM plugin automatically conducts training and test runs on the data, determines predictive accuracy, and calculates relevant classification statistics. The SVM plugin supports k-fold cross validation of accuracy rates and can provide information about the language features that it has identified that distinguish the groups from each other. By exploring this output, users can identify the words that are most strongly associated with class membership.

Basic Tutorial: Using TACIT's SVM Plugin

Specifying Input Files for Classification

All input data files must be saved as a classes in a corpus using Corpus Managementor as .txt format files to be compatible with TACIT. See the Corpus Management Help Section if you need to convert files to compatible formats.

The tool has two panels for class inputs (Class 1 and Class 2. For each class, users must enter a label for the group (e.g., Democrat) and specify the files which reflect that group.The Add Folder and Add File(s) buttons permit users to assign corpora, folders, or files to each class.
Add Corpus: The Add Corpus button will allow you to add a stored corpus from Corpus Managementand all included sub-groups and fies for analysis. To expand the corpus and view the subgroups/files included, click on the arrow to the left of the corpus name in the input panel.
Add Folder: The Add Folder button will allow you to add a folder and all included subfolders and files for analysis. To expand the folder and view the subfolders/files included, click on the arrow to the left of the folder name in the input panel.
Add File(s): The Add Files button will allow you to add .txt files to be included for analysis. Multiple files within the same folder can be selected at the same time using standard multi-select functions.

Users can also elect to preprocessthe data selected for analysisby checking the box next to Preprocess.

Specifying Output Details

K-value for Cross Validation: K-fold validation determines the accuracy of the classifier at sorting the texts into groups based on the features it has identified. The number entered in this text box specifiesthe number of equal-sized randomly generated subsamples the text should be split into. One of these subsamples is then randomly selected as the test set for feature identification andthe algorithm is trained on the remaining samples. This process is then repeated k-times (the folds) so that each subsample is used once as the test set. The error rates obtained from each fold are then averaged to estimate the actual error rate.Any number up to the total sample size can be specified as the k-value; however, k=10 is one of the most commonly used values.
Create Feature Weight Files: Selecting this check box will generate a list of words that the algorithm used to predict group/class membership and a weight for each word signifying its predictive power (higher weight=more predictive of class membership). The list of features will be generated for each k-fold analysis.

To specify an output folder where crawled files will be saved, click on the Browse button to the right of the Output Location bar and select a folder. If you create a new folder within this menu and change its name from "New Folder", click on any other folder and then click back on your newly created & renamed folder to select it.

After specifying all parameters, click the green and white play button located in the top right corner of the window to run the program. Output information regarding the k-fold validation process as well as the average test accuracy, standard deviation, and standard error will be displayed in the console panel at the bottom of the tool.

Understanding SVM Classification Output

The data output will be in .csv file format. The file name includes the type of technique used for analysis and the time-stamp for when the analysis was completed.Two files are automatically generated by the SVM tool: (1) a summary .txt file, which states the run time, TACIT version number, and the tool used and (2) an accuracy report. The accuracy report contains an accuracy statistic that ranges 0 to 100 (100 = perfect accuracy) for each k-fold run and an average accuracystatistic that is aggregated across cross-validation runs. These accuracy statistics can be used to evaluate the distinguishability of the analyzed classes.

If the Create feature weights option was selected for an analysis, feature weight .csv files will be created for each k-fold validation run. These feature weights can be used to gain insight into the role individual data features (e.g. words) played in the classification process. For example, by examining the features that are most heavily weighted, users can identify the features that had the greatest discriminative predictive power. This can be valuable for identifying and understanding differences between classes.