In statistics, the grouped Dirichlet distribution (GDD) is a multivariate generalization of the Dirichlet distribution It was first described by Ng et al. 2008.[1] The Grouped Dirichlet distribution arises in the analysis of categorical data where some observations could fall into any of a set of other 'crisp' category. For example, one may have a data set consisting of cases and controls under two different conditions. With complete data, the cross-classification of disease status forms a 2(case/control)-x-(condition/no-condition) table with cell probabilities
Treatment | No Treatment | ||
Controls | θ1 | θ2 | |
Cases | θ3 | θ4 |
If, however, the data includes, say, non-respondents which are known to be controls or cases, then the cross-classification of disease status forms a 2-x-3 table. The probability of the last column is the sum of the probabilities of the first two columns in each row, e.g.
Treatment | No Treatment | Missing | ||
Controls | θ1 | θ2 | θ1+θ2 | |
Cases | θ3 | θ4 | θ3+θ4 |
Consider the closed simplex set
l{T}n=\left\{\left(x1,\ldotsxn\right)\left|xi\geq0,i=1, … ,n,
n | |
\sum | |
i=1 |
xn=1\right.\right\}
x\inl{T}n
x-n=\left(x1,\ldots,xn-1\right)
n-1
l{T}n
x
\operatorname{GD}n,2,s\left(\left.x-n\right|a,b\right)=
| |||||||||||||||||
\operatorname{\Beta |
\left(a1,\ldots,as\right) ⋅ \operatorname{\Beta}\left(as+1,\ldots,an\right) ⋅ \operatorname{\Beta}\left(b1+\sum
sa | |
i,b |
2+\sum
n | |
i=s+1 |
ai\right) }
\operatorname{\Beta}\left(a\right)
Ng et al. went on to define an m partition grouped Dirichlet distribution with density of
x-n
\operatorname{GD}n,m,s\left(\left.x-n\right|a,b\right)=
-1 | |
c | |
m |
n | |
⋅ \left(\prod | |
i=1 |
ai-1 | |
x | |
i |
sj | |
\right) ⋅ \prod | |
k=sj-1+1 |
bj | |
x | |
k\right) |
s=\left(s1,\ldots,sm\right)
0=s0<s1\leqslant … \leqslantsm=n
cm=\left\{\prod
m\operatorname{\Beta}\left(a | |
sj-1+1 |
,\ldots,a | |
sj |
\right)\right\} ⋅ \operatorname{\Beta}\left(b1+\sum
s1 | |
k=1 |
ak,\ldots,bm+\sum
sm | |
k=sm-1+1 |
ak\right)
The authors went on to use these distributions in the context of three different applications in medical science.