CRM114 (program) explained

CRM114 (full name: "The CRM114 Discriminator") is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

Origin of the name

The name comes from the CRM-114 Discriminator in the Stanley Kubrick movie Dr. Strangelove - a piece of radio equipment designed to filter out messages lacking a specific code-prefix.

Operation

While others have done statistical Bayesian spam filtering based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a Markov Random Field representing the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. Initial testing in 2002 by author Bill Yerazunis[1] gave a 99.87% accuracy;[2] Holden [3] and TREC 2005 and 2006[4] [5] gave results of better than 99%, with significant variation depending on the particular corpus.

CRM114's classifier can also be switched to use Littlestone's Winnow algorithm, character-by-character correlation, a variant on KNN (K-nearest neighbor algorithm) classification called Hyperspace, a bit-entropic classifier that uses entropy encoding to determine similarity, a SVM, by mutual compressibility as calculated by a modified LZ77 algorithm, and other more experimental classifiers. The actual features matched are based on a generalization of skip-grams.

The CRM114 algorithms are multi-lingual (compatible with UTF-8 encodings) and null-safe. A voting set of CRM114 classifiers have been demonstrated to detect confidential versus non-confidential documents written in Japanese at better than 99.9% detection rate and a 5.3% false alarm rate.[6]

CRM114 is a good example of pattern recognition software, demonstrating how machine learning can be accomplished with a reasonably simple algorithm. The program's C source code is available under the GPL.

At a deeper level, CRM114 is also a string pattern matching language, similar to grep or even Perl; although it is Turing complete it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines. Part of this is because the crm114 language syntax is not positional, but declensional. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the TRE approximate-match regex engine, so it is possible to write programs that do not depend on absolutely identical strings matching to function correctly.

CRM114 has been applied to email filtering in the KMail client[7] [8] and a number of other applications, including detection of bots on Twitter and Yahoo,[9] [10] as well as the first-level filter in the US Dept of Transportation's vehicle defect detection system.[11] It has also been used as a predictive method for classifying fault-prone software modules.[12]

See also

External links

Notes and References

  1. Web site: Garretson . Cara . 2007-03-19 . The antispam man . Network World . en.
  2. Web site: 2002-10-16 . CRM114 gets 99.87% . Paul Graham's website.
  3. https://web.archive.org/web/20050307062526/http://sam.holden.id.au/writings/spam2/ Spam Filtering II
  4. http://trec.nist.gov/pubs/trec14/papers/SPAM.OVERVIEW.pdf Spam Track Overview (2005)
  5. http://trec.nist.gov/pubs/trec15/papers/SPAM06.OVERVIEW.pdf Spam Track Overview (2006)
  6. Web site: Archived copy . media.blackhat.com . https://web.archive.org/web/20110708011918/https://media.blackhat.com/bh-us-10/whitepapers/Yerazunis/BlackHat-USA-2010-Yerazunis-Confidential-Mail-Filtering-wp.pdf . 2011-07-08 .
  7. Web site: Removing spam mail with CRM114 and KMail. https://web.archive.org/web/20191001092857/http://www.nnc3.com/mags/LM10/Magazine/Archive/2007/77/074-077_kmail/article.html. 2019-10-01. live. 2019-10-01.
  8. Web site: kmail.antispamrc at KDE/kdepim-addons. GitHub. 12 June 2022 .
  9. Chu . Zi . Gianvecchio . Steven . Wang . Haining . Jajodia . Sushil . November 2012 . Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? . IEEE Transactions on Dependable and Secure Computing . 9 . 6 . 811–824 . 10.1109/TDSC.2012.75 . 351844 . 1545-5971.
  10. Web site: Measurement and Classification of Humans and Bots in Internet Chat . 2023-01-16 . Usenix.
  11. Inadequate Data and Analysis Undermine NHTSA's Efforts To Identify and Investigate Vehicle Safety Concerns . Scovel III . Calvin L. . 2015-06-18 . Office of Inspector General - U.S. Department of Transportation.
  12. Book: Mizuno . Osamu . Ikami . Shiro . Nakaichi . Shuya . Kikuno . Tohru . Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007) . Spam Filter Based Approach for Finding Fault-Prone Software Modules . May 2007 . https://ieeexplore.ieee.org/document/4228641 . 4 . 10.1109/MSR.2007.29. 978-0-7695-2950-9 . 5867386 .