List of text mining methods explained
Different text mining methods are used based on their suitability for a data set. Text mining is the process of extracting data from unstructured text and finding patterns or relations. Below is a list of text mining methodologies.
- Centroid-based Clustering: Unsupervised learning method. Clusters are determined based on data points.[1]
- Fast Global KMeans: Made to accelerate Global KMeans.[2]
- Global-K Means: Global K-means is an algorithm that begins with one cluster, and then divides in to multiple clusters based on the number required.
- KMeans: An algorithm that requires two parameters 1. K (a number of clusters) 2. Set of data.
- FW-KMeans: Used with vector space model. Uses the methodology of weight to decrease noise.
- Two-Level-KMeans: Regular KMeans algorithm takes place first. Clusters are then selected for subdivision into subclasses if they do not reach the threshold.
- Cluster Algorithm
-
- Divisive Clustering: Top-down approach. Large clusters are split in to smaller clusters.
- Density-based Clustering: A structure is determined by the density of data points.[4]
- Distribution-based Clustering: Clusters are formed based on mathematical methods from data.
- Stemming Algorithm
- Truncating Methods: Removing the suffix or prefix of a word.
- Lovins Stemmer: Removes longest suffix.
- Porters Stemmer: Allows programmers to stem words based on their own criteria.
- Statistical Methods: Statistical procedure is involved and typically results in affixes being removed.
- N-Gram Stemmer: A set of 'n' characters that are consecutive taken from a word
- Hidden Markov Model (HMM) Stemmer: Moves between states are based on probability functions.
- Yet Another Suffix Stripper (YASS) Stemmer: Hierarchal approach in creating clusters. Clusters are then considered a set of elements in classes and their centroids are the stems.
- Inflectional & Derivational Methods
- Krovetz Stemmer: Changes words to word stems that are valid English words.
- Xerox Stemmer: Removes prefixes.[5]
- Term Frequency
- Topic Modeling
- Wordscores: First estimates scores on word types based on a reference text. Then applies wordscores to a text that is not a reference text to get a document score. Lastly, documents that are not referenced are rescaled to then compare to the reference text.[6]
Notes and References
- Web site: 2018-01-15 . Different Types of Clustering Algorithm . 2024-04-04 . GeeksforGeeks . en-US.
- Jalil . Abdennour Mohamed . Hafidi . Imad . Alami . Lamiae . Khouribga . Ensa . 2016 . Comparative Study of Clustering Algorithms in Text Mining Context . International Journal of Interactive Multimedia and Artificial Intelligence . en . 3 . 7 . 42 . 10.9781/ijimai.2016.376 . 1989-1660.
- Web site: 2021-02-01 . Agglomerative Methods in Machine Learning . 2024-04-04 . GeeksforGeeks . en-US.
- Web site: Hahsler . etal . Michael . dbscan: Fast Density-based Clustering with R . 4 March 2024 . cran.r-project.org.
- Web site: Ganesh Jivani . Anjali . A Comparative Study of Stemming Algorithms .
- Lowe . Will . 2008 . Understanding Wordscores . Methods and Data Institute, School of Politics and International Relations, University of Nottingham, Nottingham . 10.2139/ssrn.1095280 . 1556-5068.