Information content explained
In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.
The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.
The Shannon information is closely related to entropy, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average". This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.[1]
The information content can be expressed in various units of information, of which the most common is the "bit" (more formally called the shannon), as explained below.
The term 'perplexity' has been used in language modelling to quantify the uncertainty inherent in a set of prospective events.
Definition
Claude Shannon's definition of self-information was chosen to meet several axioms:
- An event with probability 100% is perfectly unsurprising and yields no information.
- The less probable an event is, the more surprising it is and the more information it yields.
- If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.
The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number
and an
event
with
probability
, the information content is defined as follows:
The base b corresponds to the scaling factor above. Different choices of b correspond to different units of information: when, the unit is the shannon (symbol Sh), often called a 'bit'; when, the unit is the natural unit of information (symbol nat); and when, the unit is the hartley (symbol Hart).
Formally, given a discrete random variable
with
probability mass function
, the self-information of measuring
as
outcome
is defined as
[2] The use of the notation
for self-information above is not universal. Since the notation
is also often used for the related quantity of
mutual information, many authors use a lowercase
for self-entropy instead, mirroring the use of the capital
for the entropy.
Properties
Monotonically decreasing function of probability
For a given probability space, the measurement of rarer events are intuitively more "surprising", and yield more information content, than more common values. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function.
While standard probabilities are represented by real numbers in the interval
, self-informations are represented by extended real numbers in the interval
. In particular, we have the following, for any choice of logarithmic base:
- If a particular event has a 100% probability of occurring, then its self-information is
: its occurrence is "perfectly non-surprising" and yields no information.
- If a particular event has a 0% probability of occurring, then its self-information is
: its occurrence is "infinitely surprising".
From this, we can get a few general properties:
- Intuitively, more information is gained from observing an unexpected event—it is "surprising".
- For example, if there is a one-in-a-million chance of Alice winning the lottery, her friend Bob will gain significantly more information from learning that she won than that she lost on a given day. (See also Lottery mathematics.)
- This establishes an implicit relationship between the self-information of a random variable and its variance.
Relationship to log-odds
The Shannon information is closely related to the log-odds. In particular, given some event
, suppose that
is the probability of
occurring, and that
is the probability of
not occurring. Then we have the following definition of the log-odds:
This can be expressed as a difference of two Shannon informations:
In other words, the log-odds can be interpreted as the level of surprise when the event doesn't happen, minus the level of surprise when the event does happen.
Additivity of independent events
The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics, and sigma additivity in particular in measure and probability theory. Consider two independent random variables with probability mass functions
and
respectively. The joint probability mass function is
is
See
below for an example.
The corresponding property for likelihoods is that the log-likelihood of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.
Relationship to entropy
The Shannon entropy of the random variable
above is defined as
by definition equal to the
expected information content of measurement of
.
[3] [4] The expectation is taken over the discrete values over its
support.
Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies
Η(X)=\operatorname{I}(X;X)
, where
is the
mutual information of
with itself.
[5] For continuous random variables the corresponding concept is differential entropy.
Notes
This measure has also been called surprisal, as it represents the "surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was coined by Myron Tribus in his 1961 book Thermostatics and Thermodynamics.[6] [7]
When the event is a random realization (of a variable) the self-information of the variable is defined as the expected value of the self-information of the realization.
Self-information is an example of a proper scoring rule.
Examples
Fair coin toss
. The
probabilities of the
events of the coin landing as heads
and tails
(see
fair coin and
obverse and reverse) are
one half each,
. Upon
measuring the variable as heads, the associated information gain is
so the information gain of a fair coin landing as heads is 1
shannon. Likewise, the information gain of measuring tails
is
Fair die roll
with
probability mass function The probability of rolling a 4 is
, as for any other valid roll. The information content of rolling a 4 is thus
of information.
Two independent, identically distributed dice
Suppose we have two independent, identically distributed random variables each corresponding to an independent fair 6-sided dice roll. The joint distribution of
and
is
is
and can also be calculated by additivity of events
Information from frequency of rolls
If we receive information about the value of the dice without knowledge of which die had which value, we can formalize the approach with so-called counting variablesfor
, then
and the counts have the
multinomial distributionTo verify this, the 6 outcomes correspond to the event
and a
total probability of . These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other
combinations correspond to one die rolling one number and the other die rolling a different number, each having probability . Indeed,
, as required.
Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one dice was one number and the other was a different number. Take for examples the events
and
for
. For example,
and
.
The information contents are
Let be the event that both dice rolled the same value and
} be the event that the dice differed. Then
and
. The information contents of the events are
Information from sum of die
The probability mass or density function (collectively probability measure) of the sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable
has probability mass function
, where
represents the discrete convolution. The
outcome
has probability
. Therefore, the information asserted is
General discrete uniform distribution
Generalizing the example above, consider a general discrete uniform random variable (DURV)
X\simDU[a,b]; a,b\inZ, b\gea.
For convenience, define
. The
probability mass function is
In general, the values of the DURV need not be
integers, or for the purposes of information theory even uniformly spaced; they need only be
equiprobable. The information gain of any observation
is
Special case: constant random variable
If
above,
degenerates to a
constant random variable with probability distribution deterministically given by
and probability measure the
Dirac measure . The only value
can take is
deterministically
, so the information content of any measurement of
is
In general, there is no information gained from measuring a known value.
Categorical distribution
Generalizing all of the above cases, consider a categorical discrete random variable with support and probability mass function given by
For the purposes of information theory, the values
do not have to be
numbers; they can be any mutually exclusive
events on a
measure space of
finite measure that has been
normalized to a
probability measure
.
Without loss of generality, we can assume the categorical distribution is supported on the set
; the mathematical structure is
isomorphic in terms of
probability theory and therefore
information theory as well.
The information of the outcome
is given
From these examples, it is possible to calculate the information of any set of independent DRVs with known distributions by additivity.
Derivation
By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information.
For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin:
Weather forecast for tonight: dark.
Continued dark overnight, with widely scattered light by morning.[8]
Assuming that one does not reside near the
polar regions, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.
Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of event,
, depends only on the probability of that event.
for some function
to be determined below. If
\operatornameP(\omegan)=1
, then
\operatornameI(\omegan)=0
. If
\operatornameP(\omegan)<1
, then
\operatornameI(\omegan)>0
.
Further, by definition, the measure of self-information is nonnegative and additive. If a message informing of event
is the
intersection of two
independent events
and
, then the information of event
occurring is that of the compound message of both independent events
and
occurring. The quantity of information of compound message
would be expected to equal the
sum of the amounts of information of the individual component messages
and
respectively:
Because of the independence of events
and
, the probability of event
is
However, applying function
results in
Thanks to work on Cauchy's functional equation, the only monotone functions
having the property such that
are the
logarithm functions
. The only operational difference between logarithms of different bases is that of different scaling constants, so we may assume
where
is the
natural logarithm. Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative, that requires that
.
Taking into account these properties, the self-information
associated with outcome
with probability
is defined as:
The smaller the probability of event
, the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of
is
shannon. This is the most common practice. When using the
natural logarithm of base
, the unit will be the
nat. For the base 10 logarithm, the unit of information is the
hartley.
As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 shannons (probability 1/16), and the information content associated with getting a result other than the one specified would be ~0.09 shannons (probability 15/16). See above for detailed examples.
See also
Further reading
External links
Notes and References
- Jones, D.S., Elementary Information Theory, Vol., Clarendon Press, Oxford pp 11–15 1979
- Book: McMahon, David M.. Quantum Computing Explained. Wiley-Interscience. 2008. 9780470181386 . Hoboken, NJ. 608622533.
- Book: Fundamentals in Information Theory and Coding. Borda, Monica. Springer. 2011. 978-3-642-20346-6.
- Book: Mathematics of Information and Coding. American Mathematical Society. 2002. 978-0-8218-4256-0. Han, Te Sun . Kobayashi, Kingo .
- Thomas M. Cover, Joy A. Thomas; Elements of Information Theory; p. 20; 1991.
- R. B. Bernstein and R. D. Levine (1972) "Entropy and Chemical Change. I. Characterization of Product (and Reactant) Energy Distributions in Reactive Molecular Collisions: Information and Entropy Deficiency", The Journal of Chemical Physics 57, 434–449 link.
- http://www.eoht.info/page/Myron+Tribus Myron Tribus
- Web site: A quote by George Carlin . 2021-04-01. www.goodreads.com.