The softmax function, also known as softargmax[1] or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.
(0,1)
Formally, the standard (unit) softmax function
\sigma\colon\RK\to(0,1)K
K\ge1
z=(z1,...c,zK)\in\RK
\sigma(z)\in(0,1)K
In words, the softmax applies the standard exponential function to each element
zi
z
K
\sigma(z)
(1,2,8)
(0.001,0.002,0.997)
In general, instead of a different base can be used. As above, if then larger input components will result in larger output probabilities, and increasing the value of will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if then smaller input components will result in larger output probabilities, and decreasing the value of will create probability distributions that are more concentrated around the positions of the smallest input values. Writing
b=e\beta
b=e-\beta
A value proportional to the reciprocal of is sometimes referred to as the temperature: , where is typically 1 or the Boltzmann constant and is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating.
In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter (or) is varied.
See also: Arg max. The Softmax function is a smooth approximation to the arg max function: the function whose value is the index of a vector's largest element. The name "softmax" may be misleading. Softmax is not a smooth maximum (that is, a smooth approximation to the maximum function). The term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning. This section uses the term "softargmax" for clarity.
Formally, instead of considering the arg max as a function with categorical output
1,...,n
\operatorname{argmax}(z1,...,zn)=(y1,...,yn)=(0,...,0,1,0,...,0),
yi=1
i
(z1,...,zn)
zi
(z1,...,zn)
\operatorname{argmax}(1,5,10)=(0,0,1),
This can be generalized to multiple arg max values (multiple equal
zi
\operatorname{argmax}(1,5,5)=(0,1/2,1/2),
\operatorname{argmax}(z,...,z)=(1/n,...,1/n).
With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as, softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max pointwise, meaning for each fixed input as,
\sigma\beta(z)\to\operatorname{argmax}(z).
\sigma\beta(1,1.0001)\to(0,1),
\sigma\beta(1,0.9999)\to(1,0),
\sigma\beta(1,1)=1/2
(x,x)
Conversely, as, softargmax converges to arg min in the same way, where here the singular set is points with two arg min values. In the language of tropical analysis, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the log semiring instead of the max-plus semiring (respectively min-plus semiring), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization".
It is also the case that, for any fixed, if one input is much larger than the others relative to the temperature,
T=1/\beta
T=1/\beta\to0
In probability theory, the output of the softargmax function can be used to represent a categorical distribution – that is, a probability distribution over different possible outcomes.
In statistical mechanics, the softargmax function is known as the Boltzmann distribution (or Gibbs distribution):[2] the index set
{1,...,k}
zi
The softmax function is used in various multiclass classification methods, such as multinomial logistic regression (also known as softmax regression),[3] [4] multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks.[5] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of distinct linear functions, and the predicted probability for the th class given a sample vector and a weighting vector is:
P(y=j\midx)=
| |||||||||||||||||||
|
This can be seen as the composition of linear functions
x\mapsto
Tw | |
x | |
1, |
\ldots,x\mapsto
Tw | |
x | |
K |
xTw
x
w
w
x
RK
The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.
Since the function maps a vector and a specific index
i
This expression is symmetrical in the indexes
i,k
\partial | |
\partialqk |
\sigma(bf{q},i)=\sigma(bf{q},k)(\deltaik-\sigma(bf{q},i)).
Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).
To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.
If the function is scaled with the parameter
\beta
\beta
See multinomial logit for a probability model which uses the softmax activation function.
In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[6]
where the action value
qt(a)
\tau
\tau\toinfty
\tau\to0+
In neural network applications, the number of possible outcomes is often large, e.g. in case of neural language models that predict the most likely outcome out of a vocabulary which might contain millions of possible words.[7] This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the
zi
Approaches that reorganize the softmax layer for more efficient calculation include the hierarchical softmax and the differentiated softmax. The hierarchical softmax (introduced by Morin and Bengio in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming latent variables.[9] The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf. Ideally, when the tree is balanced, this would reduce the computational complexity from
O(K)
O(log2K)
A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor. These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).
RK
(K-1)
(K-1)
K
Along the main diagonal
(x,x,...,x),
(1/n,...,1/n)
More generally, softmax is invariant under translation by the same value in each coordinate: adding
c=(c,...,c)
z
\sigma(z+c)=\sigma(z)
ec
zi+c | |
e |
=
zi | |
e |
⋅ ec
\sigma(z+c)j=
| |||||||||||||
|
=
| |||||||||||||
|
=\sigma(z)j.
Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average:
c
e0=1
By contrast, softmax is not invariant under scaling. For instance,
\sigmal((0,1)r)=l(1/(1+e),e/(1+e)r)
\sigmal((0,2)r)=l(1/\left(1+e2\right),e2/\left(1+e2\right)r).
The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the x-axis in the plane. One variable is fixed at 0 (say
z2=0
e0=1
z1=x
(x/2,-x/2)
ex/2/\left(ex/2+e-x/2\right)=ex/\left(ex+1\right)
e-x/2/\left(ex/2+e-x/2\right)=1/\left(ex+1\right).
The softmax function is also the gradient of the LogSumExp function, a smooth maximum:
\partial | |
\partialzi |
\operatorname{LSE}(z)=
\expzi | |||||||||
|
=\sigma(z)i, fori=1,...c,K, z=(z1,...c,zK)\in\RK,
where the LogSumExp function is defined as
\operatorname{LSE}(z1,...,zn)=log\left(\exp(z1)+ … +\exp(zn)\right)
The softmax function was used in statistical mechanics as the Boltzmann distribution in the foundational paper,[10] formalized and popularized in the influential textbook .[11]
The use of the softmax in decision theory is credited to R. Duncan Luce,[12] who used the axiom of independence of irrelevant alternatives in rational choice theory to deduce the softmax in Luce's choice axiom for relative preferences.
In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, :[12] and :
With an input of, the softmax is approximately . The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: a change of temperature changes the output. When the temperature is multiplied by 10, the inputs are effectively and the softmax is approximately . This shows that high temperatures de-emphasize the maximum value.
Computation of this example using Python code:
The softmax function generates probability predictions densely distributed over its support. Other functions like sparsemax or α-entmax can be used when sparse probability predictions are desired.[13]
. Josiah Willard Gibbs. 1902. Elementary Principles in Statistical Mechanics. Elementary Principles in Statistical Mechanics.