Carrot2 Explained

Carrot2
Developer:Carrot Search
Latest Release Version:4.5.2
Operating System:Cross-platform
Programming Language:Java
Genre:Text mining and cluster analysis
License:BSD license

Carrot²[1] is an open source search results clustering engine.[2] It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Carrot² is written in Java and distributed under the BSD license.

History

The initial version of Carrot² was implemented in 2001 by Dawid Weiss as part of his MSc thesis to validate the applicability of the STC clustering algorithm to clustering search results in Polish.[3] In 2003, a number of other search results clustering algorithms were added, including Lingo, a novel text clustering algorithm designed specifically for clustering of search results. While the source code of Carrot² was available since 2002, it was only in 2006 when version 1.0 was officially released. In the same year, version 2.0 was released with improved user interface and extended tool set. In 2009, version 3.0 brought significant improvements in clustering quality, simplified API and new GUI application for tuning clustering based on the Eclipse Rich Client Platform. In 2020, version 4.0.0 brought further simplification of the API, code cleanups and removal of the desktop Workbench. Version 4.1.0 brings back the Workbench as a web-based application.

Carrot² releases
ReleaseRelease DateMajor changes and new features
4.6.0May 2024Dependency updates, build system improvements.
4.5.2November 2023Dependency updates, build system improvements.
4.5.1May 2023Dependency updates, minor bug fixes.
4.5.0November 2022Dependency updates, bug fixes.
4.4.3August 2022Dependency updates, bug fixes to STC and stemming infrastructure.
4.4.0, 4.4.1, 4.4.2December 2021Security fixes and dependency updates.
4.3.0July 2021Minor API changes and bug fixes. Improvements to the workbench (DCS search frontend).
4.2.0, 4.2.1March 2021Improvements to JSON dictionaries and the workbench. Bug fixes.
4.1.0January 2021Web-based Workbench. JSON dictionaries and new filtering options. API polishing.
4.0.0July 2020API changes and simplifications across the codebase. Removal of deprecated technologies and tools. New documentation and code cleanups.
3.16.2September 2019Update third party libraries (security-related issues).
3.16.1January 2019Update of JS visualizations. Migration of Microsoft Bing API v5 to v7.
3.16.0May 2018An overhaul of Java 9+ compatibility issues. Workbench compatibility for Ubuntu distros. Document source updates and removals of non-functional document sources.
3.15.1March 2017A bugfix for .NET release that could result in unchecked I/O exceptions on inaccessible current working directory.
3.15.0October 2016Bing API V2 to V5 transition. Upgrade of third party dependencies. Internal cosmetics.
3.14.0September 2016Workbench improvements (high DPI support, MacOSX improvements, bug fixes). PubMed switching to HTTPs. Other minor improvements.
3.13.0July 2016Servlet API bug fixes, Workbench bug fixes, removed Google document source, fixed language codes for a few languages.
3.12.0February 2016Upgrade of Morfologik Polish dictionary, infrastructural changes and adjustments allowing C2 to operate under more strict security manager policies.
3.11.0October 2015Upgrade of Apache Lucene, bug fixes and a rollup of changes from 3.10.x minors.
3.10.4October 2015Upgrade of Morfologik library.
3.10.3August 2015Repackaged Google Guava to avoid conflicts in Solr.
3.10.2July 2015Minor fixes to the Workbench (Arabic cluster display).
3.10.1May 2015Aduna visualization dropped from MacOS distribution. Minor fixes to the Workbench.
3.10.0May 2015Visualization updates. Bug fixes. Library dependency updates.
3.9.4November 2014FoamTree update. New attributes for multilingual clustering. Visualization fixes.
3.9.3July 2014FoamTree update. Infrastructure fixes and tweaks (jflex, sonatype repository URLs).
3.9.2April 2014Bug fix to FoamTree HTML5.
3.9.1April 2014Bug fixes, upgrades of HTML5 visualizations.
3.9.0February 2014HTML5 visualizations replacing flash, library dependencies update, bugfixes.
3.8.1October 2013Bug fixes, minor tweaks to functionality.
3.8.0July 2013Bug fixes, library dependency updates.
3.7.1May 2013Minor bug fixes (3.7.0 maintenance release).
3.7.0April 2013Infrastructure changes to the core (string IDs), better Solr integration XSLT, Workbench tweaks for larger inputs, updated dependencies.
3.6.3April 2013Minor bug fixes and improvements: customization of Solr adapter XSLT, Workbench tweaks for larger inputs, updated dependencies.
3.6.2November 2012Minor bug fixes and improvements.
3.6.1August 2012Minor bug fixes.
3.6.0June 2012Infrastructural changes, refactorings and bug fixes.
3.5.3December 2011Infrastructure updates resulting from migration to GitHub. Workbench update to SWT 3.7.1.
3.5.2September 2011Ajax support in Document Clustering Server, Bing document source improved, Workbench improvements, bug fixes.
3.5.1June 2011Bug fixes, visualization integration improvements, support for Yahoo BOSS API removed.
3.5.0May 2011FoamTree visualization, bisecting k-means clustering, resource management improvements
3.4.3March 2011Distribution to Maven central repository
3.4.2October 2010Bug fixes
3.4.1September 2010Solr 1.4.x compatibility package, bug fixes
3.4.0August 2010.NET API for calling Carrot² clustering
3.3.0April 2010Significant scalability improvements in the STC clustering algorithm
3.2.0March 2010Experimental support for clustering Arabic and Korean content, command line application for clustering in batch mode, LGPL-licensed dependencies removed
3.1.0September 2009Experimental support for clustering Chinese content, search results clustering plugin for Apache Solr
3.1.0September 2009Experimental support for clustering Chinese content, search results clustering plugin for Apache Solr
3.0.1March 2009Document Clustering Workbench available for Mac OS X
3.0.0January 2009Document Clustering Workbench added for easy experimenting with Carrot² clustering, radically simplified Java API, search results clustering web application re-implemented, user manual[4] available
2.1.0August 2007Document Clustering Server added for exposing clustering as a REST service
2.0.0September 2006New user interface of the search results clustering web application
1.0.0January 2006First official release, binaries available on SourceForge
0.0.0since 2002Incubation releases, source code available on SourceForge

Architecture

Carrot² 4.0 is predominantly a Java programming library with public APIs for management of language-specific resources, algorithm configuration and execution. A HTTP/REST component (document clustering server) is provided for interoperability with other languages.

Clustering algorithms

Carrot² offers a few document clustering algorithms that place emphasis on the quality of cluster labels:

Spin-offs

Carrot Search

Carrot Search,[7] a commercial spin-off of the Carrot² project, works on further development of Carrot², offers a real-time text clustering algorithm[8] compliant with the Carrot² framework as well as text mining consulting services based on open source and proprietary software.

Carrot Search Labs

Carrot² gave rise to a number of independent open source projects released under the umbrella of Carrot Search Labs.[9] The following projects are or were published as part of this initiative:

Discontinued projects:

Notes and References

  1. Web site: Carrot2 - Open Source Search Results Clustering Engine. Carrot2 Project, Stanislaw Osinski, Dawid Weiss.
  2. https://search.carrot2.org Carrot2 search results clustering demo
  3. Dawid Weiss: A Clustering Interface for Web Search Results in Polish and English. MSc thesis. Poznan University of Technology, Poznań, Poland, 2001 download PDF
  4. Web site: Carrot2.
  5. Stanisław Osiński, Dawid Weiss: A Concept-Driven Algorithm for Clustering Search Results. IEEE Intelligent Systems, May/June, 3 (vol. 20), 2005, pp. 48 - 54.
  6. Oren Zamir, Oren Etzioni: Web Document Clustering: A Feasibility Demonstration, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (1998), pp. 46 - 54
  7. Web site: Carrot Search: document clustering and visualization software. Carrot Search s.c..
  8. Web site: Carrot Search: Lingo3G: Text Document Clustering Engine. Carrot Search s.c..
  9. Web site: Carrot Search Labs. Carrot Search s.c..