Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately.[1] [2] [3] Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks.[4] Early versions of MTL were called "hints".[5] [6]
In a widely cited 1997 paper, Rich Caruana gave the following characterization:
Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.[3]
In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance. Further examples of settings for MTL include multiclass classification and multi-label classification.[7]
Multi-task learning works because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled. However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.[8]
The key challenge in multi-task learning, is how to combine learning signals from multiple tasks into a single model. This may strongly depend on how well different task agree with each other, or contradict each other. There are several ways to address this challenge:
Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with sparsity, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases.[9] Task relatedness can be imposed a priori or learned from the data.[7] [10] Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.[11] [12] For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains.[11]
One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.[8]
Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep convolutional neural network GoogLeNet,[13] an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.[14]
Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed Group online adaptive learning (GOAL).[15] Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.
In some cases, the simultaneous training of seemingly related tasks may hinder performance compared to single-task models.[16] Commonly, MTL models employ task-specific modules on top of a joint feature representation obtained using a shared module. Since this joint representation must capture useful features across all tasks, MTL may hinder individual task performance if the different tasks seek conflicting representation, i.e., the gradients of different tasks point to opposing directions or differ significantly in magnitude. This phenomenon is commonly referred to as negative transfer. To mitigate this issue, various MTL optimization methods have been proposed. Commonly, the per-task gradients are combined into a joint update direction through various aggregation algorithms or heuristics.
The MTL problem can be cast within the context of RKHSvv (a complete inner product space of vector-valued functions equipped with a reproducing kernel). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al., 2015.
Suppose the training data set is
l{S}t
nt | |
=\{(x | |
i=1 |
t\inl{X} | |
x | |
i |
t\inl{Y} | |
y | |
i |
t\in1,...,T
Tn | |
n=\sum | |
t |
l{L}:R x R → R+
l{H}
f:lX → l{Y}T
ft:l{X} → l{Y}
The reproducing kernel for the space
l{H}
f:lX → RT
\Gamma:lX x lX → RT
\Gamma( ⋅ ,x)c\inl{H}
The form of the kernel induces both the representation of the feature space and structures the output across tasks. A natural simplification is to choose a separable kernel, which factors into separate kernels on the input space and on the tasks
\{1,...,T\}
ft
fs
f\inlH
\Gamma(xi,xj)=k(xi,xj)A
T x T
T=\{PSDmatrices | |
S | |
+ |
\}\subsetRT
This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by . Methods for non-separable kernels is a current field of research.
For the separable case, the representation theorem is reduced to . The model output on the training data is then, where is the
n x n
n x T
ci
With the separable kernel, equation can be rewritten as
where is a (weighted) average of applied entry-wise to and . (The weight is zero if
t | |
Y | |
i |
Note the second term in can be derived as follows:
2 | |
\begin{align} \|f\| | |
l{H}&= |
\left\langle\sumi=1nk( ⋅ ,xi)Aci,\sumj=1nk( ⋅ ,xj)Acj\right\ranglelH\\ &=\sumi,j=1n\langlek( ⋅ ,xi)Aci,k( ⋅ ,xj)Acj\ranglelH&(bilinearity) \\ &=\sumi,j=1n\langlek(xi,xj)Aci,cj\rangle
RT |
&(reproducingproperty) \\ &=\sumi,j=1nk(xi,xj)
\top | |
c | |
i |
A
\top | |
c | |
j=tr(KCAC |
)\end{align}
There are three largely equivalent ways to represent task structure: through a regularizer; through an output metric, and through an output mapping.
Via the regularizer formulation, one can represent a variety of task structures easily.
IT
nt
A\dagger=\alphaIT+(\alpha-λ)M
Mt,s=
1 | |
|Gr| |
I(t,s\inGr)
\alpha
\sumr\sum
t\inGr |
||ft-
1 | |
|Gr| |
\sum
s\inGr) |
fs||
|Gr|
I
A\dagger=\deltaIT+(\delta-λ)L
L=D-M
Mt,s
\delta
\sumt,s||ft-fs
2 | |
|| | |
lHk |
Mt,s
Learning problem can be generalized to admit learning task matrix A as follows:
Choice of
T → | |
F:S | |
+ |
R+
Restricting to the case of convex losses and coercive penalties Ciliberto et al. have shown that although is not convex jointly in C and A, a related problem is jointly convex.
Specifically on the convex set
lC=\{(C,A)\inRn x
T | |
S | |
+ |
|Range(C\topKC)\subseteqRange(A)\}
is convex with the same minimum value. And if
(CR,AR)
(CR
\dagger | |
A | |
R, |
AR)
may be solved by a barrier method on a closed set by introducing the following perturbation:
The perturbation via the barrier
\delta2tr(A\dagger)
+infty
Rn x
T | |
S | |
+ |
can be solved with a block coordinate descent method, alternating in C and A. This results in a sequence of minimizers
(Cm,Am)
\deltam → 0
Spectral penalties - Dinnuzo et al[17] suggested setting F as the Frobenius norm
\sqrt{tr(A\topA)}
Rn x x
T | |
S | |
+ |
Clustered tasks learning - Jacob et al[18] suggested to learn A in the setting where T tasks are organized in R disjoint clusters. In this case let
E\in\{0,1\}T x
Et,r=I(taskt\ingroupr)
M=I-E\daggerET
U=
1 | |
T |
11\top
A\dagger
M
A\dagger(M)=\epsilonMU+\epsilonB(M-U)+\epsilon(I-M)
lSc=\{M\in
T:I-M\in | |
S | |
+ |
T | |
S | |
+ |
\landtr(M)=r\}
F(A)=I(A(M)\in\{A:M\inlSC\})
Non-convex penalties - Penalties can be constructed such that A is constrained to be a graph Laplacian, or that A has low rank factorization. However these penalties are not convex, and the analysis of the barrier method proposed by Ciliberto et al. does not go through in these cases.
Non-separable kernels - Separable kernels are limited, in particular they do not account for structures in the interaction space between the input and output domains jointly. Future work is needed to develop models for these kernels.
A Matlab package called Multi-Task Learning via StructurAl Regularization (MALSAR) [19] implements the following multi-task learning algorithms: Mean-Regularized Multi-Task Learning,[20] [21] Multi-Task Learning with Joint Feature Selection,[22] Robust Multi-Task Feature Learning,[23] Trace-Norm Regularized Multi-Task Learning,[24] Alternating Structural Optimization,[25] [26] Incoherent Low-Rank and Sparse Learning,[27] Robust Low-Rank Multi-Task Learning, Clustered Multi-Task Learning,[28] [29] Multi-Task Learning with Graph Structures.