TACIT Clustering Overview

From CSSL
Jump to: navigation, search

TACIT Clustering Tools

List of TACIT Clustering Tools

Hierarchical Clustering
K-Means Clustering

Overview of Clustering Tools

Cluster analysis techniques automatically sort texts into groups based on similarities in the text itself, allowing researchers to identify new ways of grouping texts based on similarities that they have not pre-determined.

TACIT includes cluster analysis plugins that interface with two of the most widely used cluster analysis algorithms: k-means clustering and hierarchical clustering. The K-means clustering tool (MacQueen, 1967) aims to cluster texts into a user-specified number of cluster (or groups) such that the texts included in each cluster are the nearest to the cluster’s centroid (the prototypical document of that cluster), and have the farthest distance from other clusters’ centroids.

Hierarchical clustering (Johnson, 1967) by comparison does not require a prespecified number of clusters. Instead, this technique identifies the optimal number of clusters by starting with the assumption that all data points (in this case documents) belong to one cluster. The algorithm then splits this root cluster into smaller child clusters based on the degree of similarity between the documents. These child clusters are recursively divided further until only singleton clusters remain. which have the longest distance.