The principle of transformation groups is a methodology for assigning prior probabilities in statistical inference issues, initially proposed by physicist E. T. Jaynes.[1] It is regarded as an extension of the principle of indifference.
Prior probabilities determined by this principle are objective in that they rely solely on the inherent characteristics of the problem, ensuring that any two individuals applying the principle to the same issue would assign identical prior probabilities. Thus, this principle is integral to the objective Bayesian interpretation of probability.
The principle is motivated by the following normative principle, or desideratum:
In scenarios where the prior information is identical, individuals should assign the same prior probabilities.
This rule is implemented by identifying symmetries, defined by transformation groups, that allow a problem to converted into an equivalent one, and utilizing these symmetries to calculate the prior probabilities.
For problems with discrete variables (such as dice, cards, or categorical data), symmetries are characterized by permutation groups and, in these instances, the principle simplifies to the principle of indifference. In cases involving continuous variables, the symmetries may be represented by other types of transformation groups. Determining the prior probabilities in such cases often requires solving a differential equation, which may not yield a unique solution. However, many continuous variable problems do have prior probabilities which are uniquely defined by the principle of transformation groups, which Jaynes referred to as "well-posed" problems.
Consider a coin with sides head (H) and tail (T). Denote this information by
I
P(H|I)
P(T|I)
In applying the desideratum, consider the information contained in the event of the coin flip as framed. It describes no distinction between heads and tails. Given no other information, the elements "head" and "tail" are interchangeable. Application of the desideratum then demands that
P(H|I)=P(T|I)
As
\{H,T\}
This argument extends to N categories, to give the "flat" prior probability 1/N.
This provides a consistency-based argument for the principle of indifference: If someone is truly ignorant about a discrete or countable set of outcomes apart from their potential existence but does not assign them equal prior probabilities, then they are assigning different probabilities when given the same information.
Alternatively, this can be phrased as: someone who does not use the principle of indifference to assign prior probabilities to discrete variables, either has information about those variables, or is reasoning inconsistently.
This is the easiest example for continuous variables. It is given by stating one is "ignorant" of the location parameter in a given problem. The statement that a parameter is a "location parameter" is that the sampling distribution, or likelihood of an observation X depends on a parameter
\mu
p(X|\mu,I)=f(X-\mu)
for some normalized probability distribution
f( ⋅ )
Note that the given information that
f( ⋅ )
f( ⋅ )
\mu
Examples of location parameters include the mean parameter of a normal distribution with known variance, and the median parameter of a Cauchy distribution with a known interquartile range.
The two "equivalent problems" in this case, given one's knowledge of the sampling distribution
p(X|\mu,I)=f(X-\mu)
\mu
\mu
f(X-\mu)=f([X+b]-[\mu+b])=f(X(1)-\mu(1))
"Shifting" all quantities up by some number b and solving in the "shifted space" and then "shifting" back to the original one should give exactly the same answer as if we just worked on the original space. Making the transformation from
\mu
\mu(1)
g(\mu)=p(\mu|I)
g(\mu)=\left|{\partial\mu(1)\over\partial\mu}\right|g(\mu(1))=g(\mu+b)
And the only function that satisfies this equation is the "constant prior":
p(\mu|I)\propto1
Therefore, the uniform prior is justified for expressing complete ignorance of a normalized prior distribution on a finite, continuous location parameter.
As in the above argument, a statement that
\sigma
p(X|\sigma,I)={1\over\sigma}f\left({X\over\sigma}\right)
Where, as before,
f( ⋅ )
\sigma>0
{X\over\sigma}={Xa\over\sigmaa}; a>0
and setting
X(1)=Xa
\sigma(1)=\sigmaa.
a
p(X(1)|\sigma,I)={1\overa} ⋅ {1\over\sigma}f\left({Xa\over\sigmaa}\right)={1\over\sigma(1)
which is invariant (i.e., has the same form before and after the transformation). Furthermore, the prior probability changes to
p(\sigma|I)={1\overa}p(\sigma(1)|I)={1\overa}p\left({\sigma\overa}|I\right)
which has the unique solution (up to proportionality)
p(\sigma|I)\propto{1\over\sigma}\impliesp(log(\sigma)|I)\propto1
This is a well-known Jeffreys prior for scale parameters, which is "flat" on the log scale, although it is derived using a different argument to that here, based on the Fisher information function. The fact that these two methods give the same results in this case does not imply they do in general.
Edwin Jaynes used this principle to provide a resolution to Bertrand's Paradox[2] by stating his ignorance about the exact position of the circle.
This argument depends crucially on
I
To illustrate, suppose that the coin flipping example also states as part of the information that the coin has a side (S) (i.e., it is a real coin). Denote this new information by
N
P(H|I,N)=P(T|I,N)=P(S|I,N)=1/3
Intuition tells us that we should have
P(S)
P(S|thincoin) ≠ P(S|thickcoin)
Note that this new information probably wouldn't break the symmetry between "heads" and "tails," so that permutation would still apply in describing "equivalent problems", and we would require:
P(T|thincoin)=P(H|thincoin) ≠ P(H|thickcoin)=P(T|thickcoin)
This is a good example of how the principle of transformation groups can be used to "flesh out" personal opinions. All of the information used in the derivation is explicitly stated. If a prior probability assignment doesn't "seem right" according to what your intuition tells you, then there must be some "background information" that has not been put into the problem.[3] It is then the task to try and work out what that information is. In some sense, combining the method of transformation groups with one's intuition can be used to "weed out" the actual assumptions one has. This makes it a very powerful tool for prior elicitation.
Introducing the thickness of the coin as a variable is permissible because its existence was implied (by being a real coin) but its value was not specified in the problem. Introducing a "nuisance parameter" and then making the answer invariant to this parameter is a very useful technique for solving supposedly "ill-posed" problems like Bertrand's Paradox. This has been called "the well-posing strategy" by some.[4]
A strength of this principle lies in its application to continuous parameters, where the notion of "complete ignorance" is not so well-defined as in the discrete case. However, if applied with infinite limits, it often gives improper prior distributions. Note that the discrete case for a countably infinite set, such as
\{0,1,2,...\}
f(M)={I(M\in[-b,b])\over2b}
b → infty
If the limit of the ratio does not exist or diverges, then this gives an improper posterior (i.e., a posterior that does not integrate into one). This indicates that the data are so uninformative about the parameters that the prior probability of arbitrarily large values still matters in the final answer. In some sense, an improper posterior means that the information contained in the data has not "ruled out" arbitrarily large values. Looking at the improper priors this way, it seems to make some sense that "complete ignorance" priors should be improper because the information used to derive them is so meagre that it cannot rule out absurd values on its own. From a state of complete ignorance, only the data or some other form of additional information can rule out such absurdities.