Random coordinate descent explained

Randomized (Block) Coordinate Descent Method is an optimization algorithm popularized by Nesterov (2010) and Richtárik and Takáč (2011). The first analysis of this method, when applied to the problem of minimizing a smooth convex function, was performed by Nesterov (2010). In Nesterov's analysis the method needs to be applied to a quadratic perturbation of the original function with an unknown scaling factor. Richtárik and Takáč (2011) give iteration complexity bounds which do not require this, i.e., the method is applied to the objective function directly. Furthermore, they generalize the setting to the problem of minimizing a composite function, i.e., sum of a smooth convex and a (possibly nonsmooth) convex block-separable function:

F(x)=f(x)+\Psi(x),

where

\Psi(x)=

	n
\sum
	i=1

	(i)
\Psi
	i(x

x\inR^N

is decomposed into

blocks of variables/coordinates:

x=(x⁽¹⁾,...,x⁽ⁿ⁾)

and

\Psi_1,...,\Psi_n

are (simple) convex functions.

Example (block decomposition): If

x=(x_1,x_2,...,x₅₎\inR⁵

and

n=3

, one may choose

x⁽¹⁾=(x_1,x_3),x⁽²⁾=(x_2,x₅₎

and

x⁽³⁾=x₄

Example (block-separable regularizers):

n=N;\Psi(x)=\|x\|₁=

	n
\sum
	i=1

|x_i|

N=N₁+N₂+...+N_n;\Psi(x)=

	n
\sum
	i=1

\|x⁽ⁱ⁾\|₂

, where

x⁽ⁱ⁾\in

	N_i
R

and

\| ⋅ \|₂

is the standard Euclidean norm.

Algorithm

Consider the optimization problem

min
	x\inRⁿ

f(x),

where

is a convex and smooth function.

Smoothness: By smoothness we mean the following: we assume the gradient of

is coordinate-wise Lipschitz continuous with constants

L_1,L_2,...,L_n

. That is, we assume that

|\nabla_if(x+he_i)-\nabla_if(x)|\leqL_i|h|,

for all

x\inRⁿ

and

h\inR

, where

\nabla_i

denotes the partial derivative with respect to variable

x⁽ⁱ⁾

Nesterov, and Richtarik and Takac showed that the following algorithm converges to the optimal point: Input:

x₀\inRⁿ

//starting point Output:

set x := x_0 for k := 1, ... do choose coordinate

i\in\{1,2,...,n\}

, uniformly at random update

x⁽ⁱ⁾=x⁽ⁱ⁾-

	1{L
	_i}

\nabla_if(x)

end for

Convergence rate

Since the iterates of this algorithm are random vectors, a complexity result would give a bound on the number of iterations needed for the method to output an approximate solution with high probability. It was shown in that if

k\geq

	2nR_L(x₀₎
	\epsilon

log\left(

	*
f(x
	0)-f

\epsilon\rho

\right)

, where

R_L(x)=max_y

max
	x^*\inX^*

	*\\|
\\|y-x
	L

:f(y)\leqf(x)\}

f^*

is an optimal solution (

f^*=

min
	x\inRⁿ

\{f(x)\}

\rho\in(0,1)

is a confidence level and

\epsilon>0

is target accuracy,then

	*>
Prob(f(x
	k)-f

\epsilon)\leq\rho

Example on particular function

The following Figure shows how

x_k

develops during iterations, in principle.The problem is

f(x)=\tfrac{1}{2}x^T\left(\begin{array}{cc} 1&0.5\ 0.5&1 \end{array} \right) x-\left(\begin{array}{cc} 1.5&1.5 \end{array} \right)x, x_{0=\left(\begin{array}{}cc} 0&0 \end{array} \right)^T

Extension to block coordinate setting

One can naturally extend this algorithm not only just to coordinates, but to blocks of coordinates. Assume that we have space

R⁵

. This space has 5 coordinate directions, concretely

e₁=

	T, e
(1,0,0,0,0)
	2

	T, e
(0,1,0,0,0)
	3

	T, e
(0,0,1,0,0)
	4

	T, e
(0,0,0,1,0)
	5

=(0,0,0,0,1)^T

in which Random Coordinate Descent Method can move. However, one can group some coordinate directions into blocks and we can have instead of those 5 coordinate directions 3 block coordinate directions (see image).

Random coordinate descent explained

Algorithm

Convergence rate

Example on particular function

Extension to block coordinate setting

See also