Within statistics, Dynamic topic models are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time. This family of models was proposed by David Blei and John Lafferty and is an extension to Latent Dirichlet Allocation (LDA) that can handle sequential documents.[1]
In LDA, both the order the words appear in a document and the order the documents appear in the corpus are oblivious to the model. Whereas words are still assumed to be exchangeable, in a dynamic topic model the order of the documents plays a fundamental role. More precisely, the documents are grouped by time slice (e.g.: years) and it is assumed that the documents of each group come from a set of topics that evolved from the set of the previous slice.
Similarly to LDA and pLSA, in a dynamic topic model, each document is viewed as a mixture of unobserved topics. Furthermore, each topic defines a multinomial distribution over a set of terms. Thus, for each word of each document, a topic is drawn from the mixture and a term is subsequently drawn from the multinomial distribution corresponding to that topic.
The topics, however, evolve over time. For instance, the two most likely terms of a topic at time could be "network" and "Zipf" (in descending order) while the most likely ones at time could be "Zipf" and "percolation" (in descending order).
Define
\alphat
\betat,k
ηt,d
zt,d,n
wt,d,n
In this model, the multinomial distributions
\alphat+1
\betat+1,k
\alphat
\betat,k
The former representation has some disadvantages due to the fact that the parameters are constrained to be non-negative and sum to one.[2] When defining the evolution of these distributions, one would need to assure that such constraints were satisfied. Since both distributions are in the exponential family, one solution to this problem is to represent them in terms of the natural parameters, that can assume any real value and can be individually changed.
Using the natural parameterization, the dynamics of the topic model are given by
\betat,k|\betat-1,k\simN(\betat-1,k,\sigma2I)
\alphat|\alphat-1\simN(\alphat-1,\delta2I)
The generative process at time slice 't' is therefore:
\betat,k|\betat-1,k\simN(\betat-1,k,\sigma2I)\forallk
\alphat|\alphat-1\simN(\alphat-1,\delta2I)
ηt,d\sim
2 | |
N(\alpha | |
t,a |
I)
Zt,d,n\simrm{Mult}(\pi(ηt,d))
Wt,d,n\sim
rm{Mult}(\pi(\beta | |
t,Zt,d,n |
))
where
\pi(x)
\pi(xi)=
\exp(xi) | |
\sumi\exp(xi) |
In the dynamic topic model, only
Wt,d,n
In the original paper, a dynamic topic model is applied to the corpus of Science articles published between 1881 and 1999 aiming to show that this method can be used to analyze the trends of word usage inside topics. The authors also show that the model trained with past documents is able to fit documents of an incoming year better than LDA.
A continuous dynamic topic model was developed by Wang et al. and applied to predict the timestamp of documents.[3]
Going beyond text documents, dynamic topic models were used to study musical influence, by learning musical topics and how they evolve in recent history.[4]