Vecchia approximation explained

Vecchia approximation is a Gaussian processes approximation technique originally developed by Aldo Vecchia, a statistician at United States Geological Survey.[1] It is one of the earliest attempts to use Gaussian processes in high-dimensional settings. It has since been extensively generalized giving rise to many contemporary approximations.

Intuition

A joint probability distribution for events

A,B

, and

C

, denoted

P(A,B,C)

, can be expressed as

P(A,B,C)=P(A)P(B|A)P(C|A,B)

Vecchia's approximation takes the form, for example,

P(A,B,C)P(A)P(B|A)P(C|A)

and is accurate when events

B

and

C

are close to conditionally independent given knowledge of

A

. Of course one could have alternatively chosen the approximation

P(A,B,C)P(A)P(B|A)P(C|B)

and so use of the approximation requires some knowledge of which events are close to conditionally independent given others. Moreover, we could have chosena different ordering, for example

P(A,B,C)P(C)P(C|A)P(B|A).

Fortunately, in many cases there are good heuristics making decisions about how to construct the approximation.

More technically, general versions of the approximation lead to a sparse Cholesky factor of the precision matrix. Using the standard Cholesky factorization produces entries which can be interpreted[2] as conditional correlations with zeros indicating no independence (since the model is Gaussian). These independence relations can be alternatively expressed using graphical models and there exist theorems linking graph structure and vertex ordering with zeros in the Cholesky factor. In particular, it is known[3] that independencies that are encoded in a moral graph lead to Cholesky factors of the precision matrix that have no fill-in.

Formal description

The problem

Let

x

be a Gaussian process indexed by

l{S}

with mean function

\mu

and covariance function

K

. Assume that

S=\{s1,...,sn\}\subsetl{S}

is a finite subset of

l{S}

and

x=(x1,...,xn)

is a vector of values of

x

evaluated at

S

, i.e.

xi=x(si)

for

i=1,...,n

. Assume further, that one observes

y=(y1,...,yn)

where

yi=xi+\varepsiloni

with

\varepsiloni\overset{i.i.d.

} \mathcal(0, \sigma^2). In this context the two most common inference tasks include evaluating the likelihood

l{L}(y)=\intf(y,x)dx,

or making predictions of values of

x

for

s*\inl{S}

and

s\not\inS

, i.e. calculating

f(x(s*)\midy1,...,yn).

Original formulation

The original Vecchia method starts with the observation that the joint density of observations

f(y)=\left(y1,...,yn\right)

can be written as a product of conditional distributions

f(y)=f(y1)\prod

n
i=2

f(yi\midyi-1,...,y1).

Vecchia approximation assumes instead that for some

k\lln

\hat{f}(y)=f(y1)\prod

n
i=2

f(yi\midyi-1,...,ymax(i-k,1)).

Vecchia also suggested that the above approximation be applied to observations that are reordered lexicographically using their spatial coordinates. While his simple method has many weaknesses, it reduced the computational complexity to

l{O}(nk3)

. Many of its deficiencies were addressed by the subsequent generalizations.

General formulation

While conceptually simple, the assumption of the Vecchia approximation often proves to be fairly restrictive and inaccurate.[4] This inspired important generalizations and improvements introduced in the basic version over the years: the inclusion of latent variables, more sophisticated conditioning and better ordering. Different special cases of the general Vecchia approximation can be described in terms of how these three elements are selected.[5]

Latent variables

To describe extensions of the Vecchia method in its most general form, define

zi=(xi,yi)

and notice that for

z=(z1,...,zn)

it holds that like in the previous section

f(z)=f(x1,y1)\left(

n
\prod
i=2

f(xi\midz1:i-1)\right)\left(

n
\prod
i=2

f(yi\midxi)\right)

because given

xi

all other variables are independent of

yi

.

Ordering

It has been widely noted that the original lexicographic ordering based on coordinates when

l{S}

is two-dimensional produces poor results.[6] More recently another orderings have been proposed, some of which ensure that points are ordered in a quasi-random fashion. Highly scalable, they have been shown to also drastically improve accuracy.[4]

Conditioning

Similar to the basic version described above, for a given ordering a general Vecchia approximation can be defined as

\hat{f}(z)=f(x1,y1)\left(

n
\prod
i=2

f(xi\midzq(i))\right)\left(

n
\prod
i=2

f(yi\midxi)\right),

where

q(i)\subset\left\{1,...,i-1\right\}

. Since

yi\perpx-i,y-i\midxi

it follows that

f(xi\midzq(i))=f(xi\midxq(i),yq(i))=f(xi\midxq(i))

since suggesting that the terms

f(xi\midzq(i))

be replaced with

f(xi\midxq(i))

. It turns out, however, that sometimes conditioning on some of the observations

zi

increases sparsity of the Cholesky factor of the precision matrix of

(x,y)

. Therefore, one might instead consider sets

qy(i)

and

qx(i)

such that

q(i)=qy(i)\cupqx(i)

and express

\hat{f}

as

\hat{f}(z)=f(x1,y1)\left(

n
\prod
i=2

f(xi\mid

x
qx(i)

,

y
qy(i)

)\right)\left(

n
\prod
i=2

f(yi\midxi)\right).

Multiple methods of choosing

qy(i)

and

qx(i)

have been proposed, most notably the nearest-neighbour Gaussian process (NNGP),[7] meshed Gaussian process[8] and multi-resolution approximation (MRA) approaches using

q(i)=qx(i)

, standard Vecchia using

q(i)=qy(i)

and Sparse General Vecchia where both

qy(i)

and

qx(i)

are non-empty.[5]

Software

Several packages have been developed which implement some variants of the Vecchia approximation.

Notes and References

  1. Vecchia . A. V. . 1988 . Estimation and Model Identification for Continuous Spatial Processes . Journal of the Royal Statistical Society, Series B (Methodological) . en . 50 . 2 . 297–312 . 10.1111/j.2517-6161.1988.tb01729.x.
  2. Pourahmadi. M.. Cholesky Decompositions and Estimation of A Covariance Matrix: Orthogonality of Variance Correlation Parameters. Biometrika. 94. 4. 2007. 1006–1013. 0006-3444. 10.1093/biomet/asm073.
  3. Khare. Kshitij. Rajaratnam. Bala. Wishart distributions for decomposable covariance graph models. The Annals of Statistics. 39. 1. 2011. 514–555. 0090-5364. 10.1214/10-AOS841. free. 1103.1768.
  4. Guinness. Joseph. Permutation and Grouping Methods for Sharpening Gaussian Process Approximations. Technometrics. 60. 4. 2018. 415–429. 0040-1706. 10.1080/00401706.2018.1437476. 31447491 . 6707751.
  5. Katzfuss . Matthias . Guinness . Joseph . 1708.06302 . A General Framework for Vecchia Approximations of Gaussian Processes . Statistical Science . 2021 . 36 . 10.1214/19-STS755 . 88522976 .
  6. Book: Sudipto Banerjee. Bradley P. Carlin. Alan E. Gelfand. Hierarchical Modeling and Analysis for Spatial Data, Second Edition. 12 September 2014. CRC Press. 978-1-4398-1917-3.
  7. Datta. Abhirup. Banerjee. Sudipto. Finley. Andrew. Gelfand. Alan. Hierarchical Nearest-Neighbor Gaussian Process Models for Large Spatial Data. Journal of the American Statistical Association. 2016. 111. 514. 800–812. 10.1080/01621459.2015.1044091. 29720777 . 5927603 .
  8. Peruzzi. Michele. Banerjee. Sudipto. Finley. Andrew. Highly Scalable Bayesian Geostatistical Modeling Via Meshed Gaussian Processes on Partitioned Domains. Journal of the American Statistical Association. 2020. 117 . 538 . 969–982 . 10.1080/01621459.2020.1833889. 35935897 . 9354857 . 2003.11208.