Predictive Model Markup Language Explained
The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format conceived by Robert Lee Grossman, then the director of the National Center for Data Mining at the University of Illinois at Chicago. PMML provides a way for analytic applications to describe and exchange predictive models produced by data mining and machine learning algorithms. It supports common models such as logistic regression and other feedforward neural networks. Version 0.9 was published in 1998.[1] Subsequent versions have been developed by the Data Mining Group.[2]
Since PMML is an XML-based standard, the specification comes in the form of an XML schema. PMML itself is a mature standard with over 30 organizations having announced products supporting PMML.[3]
PMML components
A PMML file can be described by the following components:[4] [5]
- Header: contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.
- Data Dictionary: contains definitions for all the possible fields used by the model. It is here that a field is defined as continuous, categorical, or ordinal (attribute optype). Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).
- Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of simple data transformations.
- Normalization: map values to numbers, the input can be continuous or discrete.
- Discretization: map continuous values to discrete values.
- Value mapping: map discrete values to discrete values.
- Functions (custom and built-in): derive a value by applying a function to one or more parameters.
- Aggregation: used to summarize or collect groups of values.
- Model: contains the definition of the data mining model. E.g., A multi-layered feedforward neural network is represented in PMML by a "NeuralNetwork" element which contains attributes such as:
- Model Name (attribute modelName)
- Function Name (attribute functionName)
- Algorithm Name (attribute algorithmName)
- Activation Function (attribute activationFunction)
- Number of Layers (attribute numberOfLayers)
This information is then followed by three kinds of neural layers which specify the architecture of the neural network model being represented in the PMML document. These attributes are NeuralInputs, NeuralLayer, and NeuralOutputs. Besides neural networks, PMML allows for the representation of many other types of models including support vector machines, association rules, Naive Bayes classifier, clustering models, text models, decision trees, and different regression models.
- Mining Schema: a list of all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:
- Name (attribute name): must refer to a field in the data dictionary
- Usage type (attribute usageType): defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.
- Outlier Treatment (attribute outliers): defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.
- Missing Value Replacement Policy (attribute missingValueReplacement): if this attribute is specified then a missing value is automatically replaced by the given values.
- Missing Value Treatment (attribute missingValueTreatment): indicates how the missing value replacement was derived (e.g. as value, mean or median).
- Targets: allows for post-processing of the predicted value in the format of scaling if the output of the model is continuous. Targets can also be used for classification tasks. In this case, the attribute priorProbability specifies a default probability for the corresponding target category. It is used if the prediction logic itself did not produce a result. This can happen, e.g., if an input value is missing and there is no other method for treating missing values.
- Output: this element can be used to name all the desired output fields expected from the model. These are features of the predicted field and so are typically the predicted value itself, the probability, cluster affinity (for clustering models), standard error, etc. The latest release of PMML, PMML 4.1, extended Output to allow for generic post-processing of model outputs. In PMML 4.1, all the built-in and custom functions that were originally available only for pre-processing became available for post-processing too.
PMML 4.0, 4.1, 4.2 and 4.3
PMML 4.0 was released on June 16, 2009.[6] [7] [8]
Examples of new features included:
PMML 4.1 was released on December 31, 2011.[9] [10]
New features included:
- New model elements for representing Scorecards, k-Nearest Neighbors (KNN) and Baseline Models.
- Simplification of multiple models. In PMML 4.1, the same element is used to represent model segmentation, ensemble, and chaining.
- Overall definition of field scope and field names.
- A new attribute that identifies for each model element if the model is ready or not for production deployment.
- Enhanced post-processing capabilities (via the Output element).
PMML 4.2 was released on February 28, 2014.[11] [12]
New features include:
- Transformations: New elements for implementing text mining
- New built-in functions for implementing regular expressions: matches, concat, and replace
- Simplified outputs for post-processing
- Enhancements to Scorecard and Naive Bayes model elements
PMML 4.3 was released on August 23, 2016.[13] [14]
New features include:
- New Model Types:
- Gaussian Process
- Bayesian Network
- New built-in functions
- Usage clarifications
- Documentation improvements
Version 4.4 was released in November 2019.[15] [16]
Release history
Version | Release date |
---|
Version 0.7 | July 1997 |
Version 0.9 | July 1998 |
Version 1.0 | August 1999 |
Version 1.1 | August 2000 |
Version 2.0 | August 2001 |
Version 2.1 | March 2003 |
Version 3.0 | October 2004 |
Version 3.1 | December 2005 |
Version 3.2 | May 2007 |
Version 4.0 | June 2009 |
Version 4.1 | December 2011 |
Version 4.2 | February 2014 |
Version 4.2.1 | March 2015 |
Version 4.3 | August 2016 |
Version 4.4 | November 2019 | |
Data Mining Group
The Data Mining Group is a consortium managed by the Center for Computational Science Research, Inc., a nonprofit founded in 2008.[17] The Data Mining Group also developed a standard called Portable Format for Analytics, or PFA, which is complementary to PMML.
See also
External links
Notes and References
- Web site: The management and mining of multiple predictive models using the predictive modeling markup language. ResearchGate. 2015-12-21. 10.1016/S0950-5849(99)00022-1.
- Web site: Data Mining Group . December 14, 2017 . The DMG is proud to host the working groups that develop the Predictive Model Markup Language (PMML) and the Portable Format for Analytics (PFA), two complementary standards that simplify the deployment of analytic models..
- Web site: PMML Powered . Data Mining Group . December 14, 2017.
- A. Guazzelli, M. Zeller, W. Chen, and G. Williams. PMML: An Open Standard for Sharing Models. The R Journal, Volume 1/1, May 2009.
- A. Guazzelli, W. Lin, T. Jena (2010). PMML in Action (2nd Edition): Unleashing the Power of Open Standards for Data Mining and Predictive Analytics. CreateSpace.
- http://www.dmg.org/v4-0/Changes.html Data Mining Group website | PMML 4.0 - Changes from PMML 3.2
- Web site: Zementis website PMML 4.0 is here! . 2009-06-17 . https://web.archive.org/web/20111003223232/http://adapasupport.zementis.com/2009/06/pmml-40-is-here.html . 2011-10-03 . dead .
- R. Pechter. What's PMML and What's New in PMML 4.0? The ACM SIGKDD Explorations Newsletter, Volume 11/1, July 2009.
- http://www.dmg.org/v4-1/Changes.html Data Mining Group website | PMML 4.1 - Changes from PMML 4.0
- http://www.predictive-analytics.info/2012/01/pmml-41-is-here-mature-standard-for.html Predictive Analytics Info website | PMML 4.1 is here!
- http://www.dmg.org/v4-2/Changes.html Data Mining Group website | PMML 4.2 - Changes from PMML 4.1
- http://www.predictive-analytics.info/2014/02/pmml-42-is-here-what-changed-what-is-new.html Predictive Analytics Info website | PMML 4.2 is here!
- http://dmg.org/pmml/v4-3/Changes.html Data Mining Group website | PMML 4.3 - Changes from PMML 4.2.1
- https://sourceforge.net/projects/pmml/ Predictive Model Markup Language product website | Project activity
- Web site: The Data Mining Group releases Predictive Model Markup Language v4.4 . 12 July 2021.
- Web site: PMML 4.4.1 - General Structure . Data Mining Group . 12 July 2021.
- Web site: 2008 EO 990. 16 Oct 2014.