Monthly Archives: July 2013

Clustify 3.2 Features

The press release announcing Clustify 3.2 went out yesterday.  In this post I’ll talk a bit more about some of the features in the new version and provide some screenshots.

visualize_keywords

Keyword view for news articles from 2008. Lines connect words that are highly correlated. Groups of highly correlated words are given the same color to highlight main themes.

The biggest addition is the ability to export an SVG file that provides an interactive graphical visualization of the keyword and cluster relationships that can be viewed in a web browser.  This is useful for getting a quick sense of the major themes in the document set and seeing which keywords are closely related to each other.  You can configure it to link the clusters and keywords to an arbitrary URL with the ID of the clicked cluster or keyword embedded in the URL, so you can link a cluster or keyword to a browser-based review platform to display a list of documents from the clicked cluster or containing the clicked keyword.

Cluster view after clicking "Obama" from the keyword view.

Cluster view after clicking “Obama” from the keyword view. Hovering the mouse over “Hillary” highlights clusters labeled with that word.

Version 3.2 also adds the ability to export tags into virtually any database having an ODBC driver (Microsoft SQL Server, Oracle, DB2, MySQL, PostgreSQL, etc.) as additional columns.  There is one column for each exported tag, containing “y” or “n” to indicate whether that tag was applied to the document.  Tagging can be done more efficiently in Clustify than in many review platforms because you can tag entire clusters, or all clusters labeled with a particular combination of keywords, with a single mouse click or key-press.  The addition of tag export allows you to export the tags from Clustify to virtually any review platform based on a standard database.  This makes it practical to do full tagging within Clustify, or to do very crude tagging within Clustify to prioritize the documents for full review in a review platform.

The new version contains many smaller feature additions like the “Literal Jaccard” similarity function, the ability to sort clusters on any field in any context, and improvements to the algorithm for detecting email footers so they can be ignored (to avoid grouping emails together merely because they contain the same confidentiality notice).  It also adds two new command-line tools, clustify_svg_creator (create graphical visualization) and clustify_match_highlighter (create side-by-side near-dupe comparison in HTML) , to allow users to tap into Clustify’s capabilities from within other programs without the Clustify GUI getting in the way.