Multi-label classification explained

Multi-label classification should not be confused with multiclass classification.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several (greater than or equal to two) classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y; that is, it assigns a value of 0 or 1 for each element (label) in y.

Problem transformation methods

Several problem transformation methods exist for multi-label classification, and can be roughly broken down into:

Transformation into binary classification problems

The baseline approach, called the binary relevance method,[1] amounts to independently training one binary classifier for each label. Given an unseen sample, the combined model then predicts all labels for this sample for which the respective classifiers predict a positive result. Although this method of dividing the task into multiple binary tasks may resemble superficially the one-vs.-all (OvA) and one-vs.-rest (OvR) methods for multiclass classification, it is essentially different from both, because a single classifier under binary relevance deals with a single label, without any regard to other labels whatsoever. A classifier chain is an alternative method for transforming a multi-label classification problem into several binary classification problems. It differs from binary relevance in that labels are predicted sequentially, and the output of all previous classifiers (i.e. positive or negative for a particular label) are input as features to subsequent classifiers. Classifier chains have been applied, for instance, in HIV drug resistance prediction.[2] [3] Bayesian network has also been applied to optimally order classifiers in Classifier chains.[4]

In case of transforming the problem to multiple binary classifications, the likelihood function reads

n
L=\prod
i=1

(\prodk

(\prod
jk
(p
k,jk
\delta
yi,k,jk
(x
i)

)))

where index

i

runs over the samples, index

k

runs over the labels,

jk

indicates the binary outcomes 0 or 1,

\deltaa,b

indicates the Kronecker delta,

yi,k\in{0,1}

indicates the multiple hot encoded labels of sample

i

.

Transformation into multi-class classification problem

The label powerset (LP) transformation creates one binary classifier for every label combination present in the training set. For example, if possible labels for an example were A, B, and C, the label powerset representation of this problem is a multi-class classification problem with the classes [0 0 0], [1 0 0], [0 1 0], [0 0 1], [1 1 0], [1 0 1], [0 1 1], and [1 1 1] where for example [1 0 1] denotes an example where labels A and C are present and label B is absent.[5]

Ensemble methods

A set of multi-class classifiers can be used to create a multi-label ensemble classifier. For a given example, each classifier outputs a single class (corresponding to a single label in the multi-label problem). These predictions are then combined by an ensemble method, usually a voting scheme where every class that receives a requisite percentage of votes from individual classifiers (often referred to as the discrimination threshold[6]) is predicted as a present label in the multi-label output. However, more complex ensemble methods exist, such as committee machines. Another variation is the random -labelsets (RAKEL) algorithm, which uses multiple LP classifiers, each trained on a random subset of the actual labels; label prediction is then carried out by a voting scheme.[7] A set of multi-label classifiers can be used in a similar way to create a multi-label ensemble classifier. In this case, each classifier votes once for each label it predicts rather than for a single label.

Adapted algorithms

Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Examples of these including for multi-label data are

Learning paradigms

Based on learning paradigms, the existing multi-label classification techniques can be classified into batch learning and online machine learning. Batch learning algorithms require all the data samples to be available beforehand. It trains the model using the entire training data and then predicts the test sample using the found relationship. The online learning algorithms, on the other hand, incrementally build their models in sequential iterations. In iteration t, an online algorithm receives a sample, xt and predicts its label(s) ŷt using the current model; the algorithm then receives yt, the true label(s) of xt and updates its model based on the sample-label pair: (xt, yt).

Multi-label stream classification

Data streams are possibly infinite sequences of data that continuously and rapidly grow over time.[14] Multi-label stream classification (MLSC) is the version of multi-label classification task that takes place in data streams. It is sometimes also called online multi-label classification. The difficulties of multi-label classification (exponential number of possible label sets, capturing dependencies between labels) are combined with difficulties of data streams (time and memory constraints, addressing infinite stream with finite means, concept drifts).

Many MLSC methods resort to ensemble methods in order to increase their predictive performance and deal with concept drifts. Below are the most widely used ensemble methods in the literature:

Statistics and evaluation metrics

Considering

Yi

to be a set of labels for

ith

data sample (do not confuse it with a one-hot vector; it is simply a collection of all of the labels that belong to this sample), the extent to which a dataset is multi-label can be captured in two statistics:
1
N
N
\sum
i=1

|Yi|

where

N

is the total number of data samples;
1
N

\sum

N
i=1
|Yi|
|L|
where

L=

N
cup
i=1

Yi

, the total number of available classes (which is the maximum number of elements that can make up

Yi

).

Evaluation metrics for multi-label classification performance are inherently different from those used in multi-class (or binary) classification, due to the inherent differences of the classification problem. If denotes the true set of labels for a given sample, and the predicted set of labels, then the following metrics can be defined on that sample:

1
|N||L|
|N|
\sum
i=1
|L|
\sum
j=1

\operatorname{xor}(yi,j,zi,j)

, where

yi,j

is the target,

zi,j

is the prediction, and

\operatorname{xor}()

is the "Exclusive, or" operator that returns zero when the target and prediction are identical and one otherwise. This is a loss function, so the optimal value is zero and its upper bound is one.
|T\capP|
|T\cupP|
, where

P

and

T

are sets of predicted labels and true labels respectively.
|T\capP|
|P|
, recall is
|T\capP|
|T|
, and

F1

is their harmonic mean.[22]

Cross-validation in multi-label settings is complicated by the fact that the ordinary (binary/multiclass) way of stratified sampling will not work; alternative ways of approximate stratified sampling have been suggested.[23]

Implementations and datasets

Java implementations of multi-label algorithms are available in the Mulan and Meka software packages, both based on Weka.

The scikit-learn Python package implements some multi-labels algorithms and metrics.

The scikit-multilearn Python package specifically caters to the multi-label classification. It provides multi-label implementation of several well-known techniques including SVM, kNN and many more. The package is built on top of scikit-learn ecosystem.

The binary relevance method, classifier chains and other multilabel algorithms with a lot of different base learners are implemented in the R-package mlr[24]

A list of commonly used multi-label data-sets is available at the Mulan website.

See also

Further reading

Notes and References

  1. Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank. Classifier Chains for Multi-label Classification. Machine Learning Journal. Springer. Vol. 85(3), (2011).
  2. Heider. D. Senge. R. Cheng. W. Hüllermeier. E. 2013. Multilabel classification for exploiting cross-resistance information in HIV-1 drug resistance prediction. Bioinformatics. 29. 16. 1946–52. 10.1093/bioinformatics/btt331. 23793752. free.
  3. Riemenschneider. M. Senge. R. Neumann. U. Hüllermeier. E. Heider. D. 2016. Exploiting HIV-1 protease and reverse transcriptase cross-resistance information for improved drug resistance prediction by means of multi-label classification. BioData Mining. 9. 10. 10.1186/s13040-016-0089-1. 4772363. 26933450. free.
  4. Soufan. Othman. Ba-Alawi. Wail. Afeef. Moataz. Essack. Magbubah. Kalnis. Panos. Bajic. Vladimir B.. 2016-11-10. DRABAL: novel method to mine large high-throughput screening assays using Bayesian active learning. Journal of Cheminformatics. 8. 64. 10.1186/s13321-016-0177-8. 27895719. 5105261. 1758-2946 . free .
  5. Spolaôr. Newton. Cherman. Everton Alvares. Monard. Maria Carolina. Lee. Huei Diana. March 2013. A Comparison of Multi-label Feature Selection Methods using the Problem Transformation Approach. Electronic Notes in Theoretical Computer Science. 292. 135–151. 10.1016/j.entcs.2013.02.010. 1571-0661. free.
  6. Web site: Discrimination Threshold — yellowbrick 0.9 documentation. www.scikit-yb.org. 2018-11-29.
  7. Tsoumakas . Grigorios . Ioannis . Vlahavas . Random -labelsets: An ensemble method for multilabel classification . ECML . 2007 . 2014-07-26 . https://web.archive.org/web/20140729055455/http://talos.csd.auth.gr/tsoumakas/publications/C19.pdf . 2014-07-29 . dead .
  8. M.L. . Zhang . Z.H. . Zhou . ML-KNN: A lazy learning approach to multi-label learning . Pattern Recognition . 40 . 7 . 2007 . 10.1016/j.patcog.2006.12.019 . 2038–2048. 2007PatRe..40.2038Z . 10.1.1.538.9597 . 14886376 .
  9. Gjorgji . Madjarov . Dragi . Kocev . Dejan . Gjorgjevikj . Sašo . Džeroski . An extensive experimental comparison of methods for multi-label learning . Pattern Recognition . 45 . 9 . 2012 . 10.1016/j.patcog.2012.03.004 . 3084–3104. 2012PatRe..45.3084M . 14064264 .
  10. Hsu. Chang-Ling. Chou. Shih-chieh. 2003. Constructing a multi-valued and multi-labeled decision tree. Expert Systems with Applications. 25. 2. 199–209. 10.1016/S0957-4174(03)00047-2. Yen-Liang. Chen.
  11. Chou. Shihchieh. Hsu. Chang-Ling. 2005-05-01. MMDT: a multi-valued and multi-labeled decision tree classifier for data mining. Expert Systems with Applications. 28. 4. 799–812. 10.1016/j.eswa.2004.12.035.
  12. Li. Hong. Guo. Yue-jian. Wu. Min. Li. Ping. Xiang. Yao. 2010-12-01. Combine multi-valued attribute decomposition with multi-label learning. Expert Systems with Applications. 37. 12. 8721–8728. 10.1016/j.eswa.2010.06.044.
  13. M.L. . Zhang . Z.H. . Zhou . Multi-label neural networks with applications to functional genomics and text categorization . IEEE Transactions on Knowledge and Data Engineering . 18 . 2006 . 1338–1351 .
  14. Book: 2007. Aggarwal. Charu C.. Data Streams. 31. 10.1007/978-0-387-47534-9. 978-0-387-28759-1 . Advances in Database Systems .
  15. Oza. Nikunj. 2005. Online Bagging and Boosting. IEEE International Conference on Systems, Man and Cybernetics. 2060/20050239012. free.
  16. Book: Read. Jesse. Pfahringer. Bernhard. Holmes. Geoff. 2008 Eighth IEEE International Conference on Data Mining . Multi-label Classification Using Ensembles of Pruned Sets . 2008-12-15. IEEE Computer Society. 995–1000. 10.1109/ICDM.2008.74. 9780769535029. 10289/8077. 16059274.
  17. Osojnik. Aljaź. Panov. PanăźE. DźEroski. Sašo. 2017-06-01. Multi-label classification via multi-target regression on data streams. Machine Learning. 106. 6. 745–770. 10.1007/s10994-016-5613-5. 0885-6125. free.
  18. Sousa. Ricardo. Gama. João. 2018-01-24. Multi-label classification from high-speed data streams with adaptive model rules and random rules. Progress in Artificial Intelligence. 7. 3. 177–187. 10.1007/s13748-018-0142-z. 32376722. 2192-6352.
  19. Read. Jesse. Bifet. Albert. Holmes. Geoff. Pfahringer. Bernhard. 2012-02-21. Scalable and efficient multi-label classification for evolving data streams. Machine Learning. 88. 1–2. 243–272. 10.1007/s10994-012-5279-6. 0885-6125. free.
  20. Book: Büyükçakir. Alican. Bonab. Hamed. Can. Fazli. Proceedings of the 27th ACM International Conference on Information and Knowledge Management . A Novel Online Stacked Ensemble for Multi-Label Stream Classification . 2018-10-17. ACM. 1063–1072. 10.1145/3269206.3271774. 9781450360142. 1809.09994. 52843253.
  21. Book: Xioufis. Eleftherios Spyromitros. Spiliopoulou. Myra. Tsoumakas. Grigorios. Vlahavas. Ioannis. 2011-07-16. Dealing with concept drift and class imbalance in multi-label stream classification. AAAI Press. 1583–1588. 10.5591/978-1-57735-516-8/IJCAI11-266. 9781577355144.
  22. Godbole . Shantanu . Sunita . Sarawagi. Sunita Sarawagi . Discriminative methods for multi-labeled classification . Advances in Knowledge Discovery and Data Mining . 2004 . 22–30 .
  23. Sechidis . Konstantinos . Grigorios . Tsoumakas . Ioannis . Vlahavas . On the stratification of multi-label data . . 2011 . 145–158 .
  24. Philipp Probst, Quay Au, Giuseppe Casalicchio, Clemens Stachl, Bernd Bischl. Multilabel Classification with R Package mlr. The R Journal (2017) 9:1, pages 352-369.