Independent and identically distributed random variables explained

In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.[1] This property is usually abbreviated as i.i.d., iid, or IID. IID was first defined in statistics and finds application in different fields such as data mining and signal processing.

Introduction

Statistics commonly deals with random samples. A random sample can be thought of as a set of objects that are chosen randomly. More formally, it is "a sequence of independent, identically distributed (IID) random data points."

In other words, the terms random sample and IID are synonymous. In statistics, "random sample" is the typical terminology, but in probability, it is more common to say "IID."

Application

Independent and identically distributed random variables are often used as an assumption, which tends to simplify the underlying mathematics. In practical applications of statistical modeling, however, this assumption may or may not be realistic.[3]

The i.i.d. assumption is also used in the central limit theorem, which states that the probability distribution of the sum (or average) of i.i.d. variables with finite variance approaches a normal distribution.[4]

The i.i.d. assumption frequently arises in the context of sequences of random variables. Then, "independent and identically distributed" implies that an element in the sequence is independent of the random variables that came before it. In this way, an i.i.d. sequence is different from a Markov sequence, where the probability distribution for the th random variable is a function of the previous random variable in the sequence (for a first-order Markov sequence). An i.i.d. sequence does not imply the probabilities for all elements of the sample space or event space must be the same.[5] For example, repeated throws of loaded dice will produce a sequence that is i.i.d., despite the outcomes being biased.

In signal processing and image processing, the notion of transformation to i.i.d. implies two specifications, the "i.d." part and the "i." part:

i.d. – The signal level must be balanced on the time axis.

i. – The signal spectrum must be flattened, i.e. transformed by filtering (such as deconvolution) to a white noise signal (i.e. a signal where all frequencies are equally present).

Definition

Definition for two random variables

Suppose that the random variables

X

and

Y

are defined to assume values in

I\subseteqR

. Let

FX(x)=\operatorname{P}(X\leqx)

and

FY(y)=\operatorname{P}(Y\leqy)

be the cumulative distribution functions of

X

and

Y

, respectively, and denote their joint cumulative distribution function by

FX,Y(x,y)=\operatorname{P}(X\leqx\landY\leqy)

.

Two random variables

X

and

Y

are identically distributed if and only if

FX(x)=FY(x)\forallx\inI

.

Two random variables

X

and

Y

are independent if and only if

FX,Y(x,y)=FX(x)FY(y)\forallx,y\inI

. (See further .)

Two random variables

X

and

Y

are i.i.d. if they are independent and identically distributed, i.e. if and only if

Definition for more than two random variables

The definition extends naturally to more than two random variables. We say that

n

random variables

X1,\ldots,Xn

are i.i.d. if they are independent (see further) and identically distributed, i.e. if and only if

where

F
X1,\ldots,Xn

(x1,\ldots,xn)=\operatorname{P}(X1\leqx1\land\ldots\landXn\leqxn)

denotes the joint cumulative distribution function of

X1,\ldots,Xn

.

Definition for independence

In probability theory, two events, \colorA and \definecolor\definecolor\colorB, are called independent if and only if \definecolor\definecolorP(\ \mathrm \)=PP. In the following, \definecolor\definecolorP is short for \definecolor\definecolorP(\ \mathrm \).

Suppose there are two events of the experiment, \colorA and \definecolor\definecolor\colorB. If P>0, there is a possibility P(|). Generally, the occurrence of \colorA has an effect on the probability of \definecolor\definecolor\colorB — this is called conditional probability. Additionally, only when the occurrence of \colorA has no effect on the occurrence of \definecolor\definecolor\colorB, there is \definecolor\definecolorP(|)=P.

Note: If P>0 and \definecolor\definecolorP>0, then \colorA and \definecolor\definecolor\colorB are mutually independent which cannot be established with mutually incompatible at the same time; that is, independence must be compatible and mutual exclusion must be related.

Suppose \colorA, \definecolor\definecolor\colorB, and \definecolor\colorC are three events. If \definecolor\definecolorP=PP, \definecolor\definecolor\definecolor\definecolorP=PP, \definecolorP=PP, and \definecolor\definecolor\definecolor\definecolorP=PPP are satisfied, then the events \colorA, \definecolor\definecolor\colorB, and \definecolor\colorC are mutually independent.

A more general definition is there are n events, _1,_2, \ldots, _n. If the probabilities of the product events for any 2, 3, \ldots, n events are equal to the product of the probabilities of each event, then the events _1,_2, \ldots, _n are independent of each other.

Examples

Example 1

A sequence of outcomes of spins of a fair or unfair roulette wheel is i.i.d. One implication of this is that if the roulette ball lands on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see the gambler's fallacy).

Example 2

Toss a coin 10 times and record how many times the coin lands on heads.

  1. Independent – Each outcome of landing will not affect the other outcome, which means the 10 results are independent from each other.
  2. Identically distributed – Regardless of whether the coin is fair (probability 1/2 of heads) or unfair, as long as the same coin is used for each flip, each flip will have the same probability as each other flip.

Such a sequence of two possible i.i.d. outcomes is also called a Bernoulli process.

Example 3

Roll a die 10 times and record how many times the result is 1.

  1. Independent – Each outcome of the die roll will not affect the next one, which means the 10 results are independent from each other.
  2. Identically distributed – Regardless of whether the die is fair or weighted, each roll will have the same probability as every other roll. In contrast, rolling 10 different dice, some of which are weighted and some of which are not, would not produce i.i.d. variables.

Example 4

Choose a card from a standard deck of cards containing 52 cards, then place the card back in the deck. Repeat this 52 times. Record the number of kings that appear.

  1. Independent – Each outcome of the card will not affect the next one, which means the 52 results are independent from each other. In contrast, if each card that is drawn is kept out of the deck, subsequent draws would be affected by it (drawing one king would make drawing a second king less likely), and the result would not be independent.
  2. Identically distributed – After drawing one card from it, each time the probability for a king is 4/52, which means the probability is identical each time.

Generalizations

Many results that were first proven under the assumption that the random variables are i.i.d. have been shown to be true even under a weaker distributional assumption.

Exchangeable random variables

See main article: Exchangeable random variables. The most general notion which shares the main properties of i.i.d. variables are exchangeable random variables, introduced by Bruno de Finetti. Exchangeability means that while variables may not be independent, future ones behave like past ones — formally, any value of a finite sequence is as likely as any permutation of those values — the joint probability distribution is invariant under the symmetric group.

This provides a useful generalization — for example, sampling without replacement is not independent, but is exchangeable.

Lévy process

See main article: Lévy process. In stochastic calculus, i.i.d. variables are thought of as a discrete time Lévy process: each variable gives how much one changes from one time to another. For example, a sequence of Bernoulli trials is interpreted as the Bernoulli process.

One may generalize this to include continuous time Lévy processes, and many Lévy processes can be seen as limits of i.i.d. variables—for instance, the Wiener process is the limit of the Bernoulli process.

In machine learning

Machine learning utilizes the vast amounts of data currently available to deliver faster and more accurate results.[6] To train machine learning models effectively, it is crucial to use historical data that is broadly generalizable. If the training data is not representative of the overall situation, the model's performance on new, unseen data may be inaccurate.

The i.i.d., or independent and identically distributed hypothesis, allows for a significant reduction in the number of individual cases required in the training sample.

This assumption simplifies mathematical maximization calculations. In optimization problems, the assumption of independent and identical distribution simplifies the calculation of the likelihood function. Due to the independence assumption, the likelihood function can be expressed as:

l(\theta)=P(x1,x2,x3,...,xn|\theta)=P(x1|\theta)P(x2|\theta)P(x3|\theta)...P(xn|\theta)

.

To maximize the probability of the observed event, the log function is applied to maximize the parameter \theta. Specifically, it computes:

\rmargmax\limits\thetalog(l(\theta))

,where

log(l(\theta))=log(P(x1|\theta))+log(P(x2|\theta))+log(P(x3|\theta))+...+log(P(xn|\theta))

.

Computers are very efficient at performing multiple additions, but not as efficient at performing multiplications. This simplification enhances computational efficiency. The log transformation, in the process of maximizing, converts many exponential functions into linear functions.

There are two main reasons why this hypothesis is practically useful with the central limit theorem:

  1. Even if the sample originates from a complex non-Gaussian distribution, it can be well-approximated because the central limit theorem allows it to be simplified to a Gaussian distribution. For a large number of observable samples, "the sum of many random variables will have an approximately normal distribution".
  2. The second reason is that the accuracy of the model depends on the simplicity and representational power of the model unit, as well as the quality of the data. The simplicity of the unit makes it easy to interpret and scale, while the representational power and scalability improve model accuracy. In a deep neural network, for instance, each neuron is simple yet powerful in representation, layer by layer, capturing more complex features to enhance model accuracy.

See also

Further reading

__FORCETOC__

Notes and References

  1. Web site: A brief primer on probability distributions . Aaron . Clauset . Aaron Clauset . 2011 . . 2011-11-29 . 2012-01-20 . https://web.archive.org/web/20120120154739/http://tuvalu.santafe.edu/~aaronc/courses/7000/csci7000-001_2011_L0.pdf . dead .
  2. Web site: Stephanie. 2016-05-11. IID Statistics: Independent and Identically Distributed Definition and Examples. 2021-12-09. Statistics How To. en-US.
  3. (§8).
  4. 10.4153/CJM-1958-026-0. Central Limit Theorems for Interchangeable Processes. 1958. Blum. J. R.. Chernoff. H.. Rosenblatt. M.. Teicher. H.. Canadian Journal of Mathematics. 10. 222–229. 124843240 . free.
  5. Book: Cover. T. M.. Elements Of Information Theory. Thomas. J. A.. Wiley-Interscience. 2006. 978-0-471-24195-9. 57–58.
  6. Web site: 2020-05-05. What is Machine Learning? A Definition.. 2021-12-16 . Expert.ai. en-US.