Weak supervision is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data (exclusively used in more expensive and time-consuming supervised learning paradigm), followed by a large amount of unlabeled data (used exclusively in unsupervised learning paradigm). In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.
The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.
See also: Active learning (machine learning).
More formally, semi-supervised learning assumes a set of
l
x1,...,xl\inX
y1,...,yl\inY
u
xl+1,...,xl+u\inX
Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data
xl+1,...,xl+u
X
Y
It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.
In order to make any use of unlabeled data, some relationship to the underlying distribution of data must exist. Semi-supervised learning algorithms make use of at least one of the following assumptions:
Points that are close to each other are more likely to share a label. This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries. In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision boundaries in low-density regions, so few points are close to each other but in different classes.[1]
The data tend to form discrete clusters, and points in the same cluster are more likely to share a label (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.
See main article: Manifold hypothesis. The data lie approximately on a manifold of much lower dimension than the input space. In this case learning the manifold using both the labeled and unlabeled data can avoid the curse of dimensionality. Then learning can proceed using distances and densities defined on the manifold.
The manifold assumption is practical when high-dimensional data are generated by some process that may be hard to model directly, but which has only a few degrees of freedom. For instance, human voice is controlled by a few vocal folds,[2] and images of various facial expressions are controlled by a few muscles. In these cases, it is better to consider distances and smoothness in the natural space of the generating problem, rather than in the space of all possible acoustic waves or images, respectively.
The heuristic approach of self-training (also known as self-learning or self-labeling) is historically the oldest approach to semi-supervised learning, with examples of applications starting in the 1960s.[3]
The transductive learning framework was formally introduced by Vladimir Vapnik in the 1970s.[4] Interest in inductive learning using generative models also began in the 1970s. A probably approximately correct learning bound for semi-supervised learning of a Gaussian mixture was demonstrated by Ratsaby and Venkatesh in 1995.[5]
Generative approaches to statistical learning first seek to estimate
p(x|y)
p(y|x)
x
y
p(x|y)p(y)
p(x)
Generative models assume that the distributions take some particular form
p(x|y,\theta)
\theta
The unlabeled data are distributed according to a mixture of individual-class distributions. In order to learn the mixture distribution from the unlabeled data, it must be identifiable, that is, different parameters must yield different summed distributions. Gaussian mixture distributions are identifiable and commonly used for generative models.
The parameterized joint distribution can be written as
p(x,y|\theta)=p(y|\theta)p(x|y,\theta)
\theta
f\theta(x)=\underset{y}{\operatorname{argmax}} p(y|x,\theta)
λ
\underset{\Theta}{\operatorname{argmax}}\left(logp(\{xi,yi\}
l | |
i=1 |
|\theta)+λlogp(\{xi\}
l+u | |
i=l+1 |
|\theta)\right)
(1-yf(x))+
(1-|f(x)|)+
y=\operatorname{sign}{f(x)}
f*(x)=h*(x)+b
l{H}
f*=\underset{f}{\operatorname{argmin}}\left(\displaystyle
l(1-y | |
\sum | |
if(x |
i))++λ1
2 | |
\|h\| | |
l{H} |
+λ2
l+u | |
\sum | |
i=l+1 |
(1-|f(xi)|)+ \right)
An exact solution is intractable due to the non-convex term
(1-|f(x)|)+
Other approaches that implement low-density separation include Gaussian process models, information regularization, and entropy minimization (of which TSVM is a special case).
Laplacian regularization has been historically approached through graph-Laplacian.Graph-based methods for semi-supervised learning use a graph representation of the data, with a node for each labeled and unlabeled example. The graph may be constructed using domain knowledge or similarity of examples; two common methods are to connect each data point to its
k
\epsilon
Wij
xi
xj
| |||||||||||||
e |
Within the framework of manifold regularization,[8] [9] the graph serves as a proxy for the manifold. A term is added to the standard Tikhonov regularization problem to enforce smoothness of the solution relative to the manifold (in the intrinsic space of the problem) as well as relative to the ambient input space. The minimization problem becomes
l | ||||
\underset{f\in
| ||||
i=1 |
V(f(xi),yi)+λA
2 | |
\|f\| | |
l{H}+ |
λI\intl{M}\|\nabla
2dp(x) \right) | |
l{M}f(x)\| |
where
l{H}
l{M}
λA
λI
L=D-W
Dii=
l+u | |
\sum | |
j=1 |
Wij
f
[f(x1)...f(xl+u)]
fTLf=
l+u | |
\displaystyle\sum | |
i,j=1 |
Wij(fi-f
2 | |
j) |
≈ \intl{M}\|\nabla
2dp(x) | |
l{M}f(x)\| |
The graph-based approach to Laplacian regularization is to put in relation with finite difference method.
The Laplacian can also be used to extend the supervised learning algorithms: regularized least squares and support vector machines (SVM) to semi-supervised versions Laplacian regularized least squares and Laplacian SVM.
Some methods for semi-supervised learning are not intrinsically geared to learning from both unlabeled and labeled data, but instead make use of unlabeled data within a supervised learning framework. For instance, the labeled and unlabeled examples
x1,...,xl+u
Self-training is a wrapper method for semi-supervised learning.[12] First a supervised learning algorithm is trained based on the labeled data only. This classifier is then applied to the unlabeled data to generate more labeled examples as input for the supervised learning algorithm. Generally only the labels the classifier is most confident in are added at each step.[13] In natural language processing, a common self-training algorithm is the Yarowsky algorithm for problems like word sense disambiguation, accent restoration, and spelling correction.[14]
Co-training is an extension of self-training in which multiple classifiers are trained on different (ideally disjoint) sets of features and generate labeled examples for one another.[15]
Human responses to formal semi-supervised learning problems have yielded varying conclusions about the degree of influence of the unlabeled data.[16] More natural learning problems may also be viewed as instances of semi-supervised learning. Much of human concept learning involves a small amount of direct instruction (e.g. parental labeling of objects during childhood) combined with large amounts of unlabeled experience (e.g. observation of objects without naming or counting them, or at least without feedback).
Human infants are sensitive to the structure of unlabeled natural categories such as images of dogs and cats or male and female faces.[17] Infants and children take into account not only unlabeled examples, but the sampling process from which labeled examples arise.[18] [19]