Real NVP Networks

Real NVP networks are generative models especifically designed to encode invertible transformations. They provide efficient sampling and exact probability density evaluation, which can be leveraged for unsupervised learning of high-dimensional distributions through maximum likelihood estimation. In this article, we will summarize the original paper by Dinh et al.¹

Alejandro Sztrajman

June 26, 2021

$x$ $p_X(x)$ $z$ $p_Z(z)$ $x \rightarrow z$ $z = f(x)$ $z \rightarrow x$ $x = f^{-1}(z) = g(z)$ .

random number generator $p_X$ $p_Z$ $x$ $x$ $p_X$ . For higher-dimensional data, such as images, this can be leveraged to learn an unknown distribution from unlabeled samples, and enable the creation of new unseen images from the same distribution.

Network Architecture

coupling layers $z = f(x)$ $x = g(z) = f^{-1}(z)$ with the same RNVP network.

$\mathbf{x}$ $\mathbf{z}$ is obtained by splitting the input and applying invertible operations alternatingly to each component.

$x_1$ $x_2$ is transformed as follows:

\large{ \label{eq:fore} x'_1 = x_1\\ x'_2 = x_2 e^{s(x_1)} + t(x_1) }

$s$ $t$ $R$ $\rightarrow$ $R$ $\ref{eq:fore}$ $\mathbf{z}$ $\mathbf{x}$ ):

\large{x_1 = x'_1\\ x_2 = \frac{x'_2 - t(x_1)}{e^{s(x_1)}} \label{eq:back}}

$x$ $x1$ $x_2$ can be done arbitrarily. In practice, this is implemented using 2 complementary binary masks.

$x_1$ $x'_1$ $x'_2$ unmodified:

\large{z_1 = x'_1 e^{s'(x'_2)} + t'(x'_2)\\ z_2 = x'_2 \label{eq:fore2}}

$s'$ $t'$ $\ref{eq:fore}$ $\ref{eq:fore2}$ $\mathbf{z} = f(\mathbf{x})$ $\mathbf{x} = g(\mathbf{z})$ .

Density Estimation and Training

$z$ $p_Z(z)$ $x$ $p_X(x)$ $p_X$ $p_Z$ $p_Z$ $p_X$ in the shape of two moons.

The mapping between the two probability distributions responds to the change of variable formula, which follows from the conservation of the probability in a differential area. For univariate distributions, this can be expressed as:

\large{|p_X(x)dx| = |p_Z(z)dz| \qquad\rightarrow\qquad p_X(x) = p_Z(z)\left|\frac{dz}{dx}\right| = p_Z(z)\left|\frac{df(x)}{dx}\right|} \label{eq:prob}

$f(x)$ is one of two functions encoded by the RNVP network. For multivariate distributions the derivative becomes the determinant of the Jacobian matrix of the transformation:

\large{p_X(x) = p_Z(z)\left|\text{det}\,J\right| = p_Z(z)\left|\text{det}(J_1).\text{det}(J_2)\right|} \label{eq:prob2}

$\ref{eq:fore}$ :

\large{ J_1 = \begin{bmatrix} \frac{\partial x'_1}{\partial x_1} & \frac{\partial x'_1}{\partial x_2} \\ \frac{\partial x'_2}{\partial x_1} & \frac{\partial x'_2}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ \frac{\partial x'_2}{\partial x_1} & e^{s(x_1)} \end{bmatrix} }

$J_2$ $\ref{eq:fore2}$ . The determinant is easy to compute, due to these matrices being triangular:

\large{ \text{det}(J) = \text{det}(J_1).\text{det}(J_2) = e^{s(x_1)}e^{s'(x_2)} } \label{eq:prob3}

$\ref{eq:prob}$ $p_X(x)$ $p_Z(z)$ $z$ $p_Z(z)$ $x$ $z$ $z = f(x)$ $p_X(x)$ $x$ .

$x^{(i)}$ $x$ $x_1$ $x_2$ for the two branches of input/output of the coupling layers.

$p_X(x)$ $p(x^{(1)}, \ldots, x^{(N)} | \theta)$ $x^{(i)}$ given the parameters of the model (in this case, the network weights). For independent samples this can be expressed as:

\large{ p(x^{(1)}, \ldots, x^{(N)} | \theta) = \Pi_{i=1}^N p(x^{(i)}|\theta) } \label{eq:likelihood}

$\ref{eq:prob2}$ $\ref{eq:prob3}$ $\ref{eq:likelihood}$ , the training loss becomes:

\large{ \begin{align*} \text{Loss} &= - \text{log}\left(\Pi_{i=1}^N p_X(x^{(i)}|\theta)\right) = -\text{log}\left(\Pi_{i=1}^N p_Z(z^{(i)}|\theta).\text{det}(J^{(i)})\right) \\ &= -\sum_{i=1}^N \text{log}\left(p_Z(z^{(i)}|\theta)\right) - \sum_{i=1}^N \text{log}\left(\text{det}(J^{(i)}) \right) \\ &= -\sum_{i=1}^N \text{log}\left(p_Z(z^{(i)}|\theta)\right) - \sum_{i=1}^N \left(s(x^{(i)}_1) + s'(x^{(i)}_2) \right) \end{align*} }

$x^{(i)}$ $z^{(i)}$ $p_Z(z^{(i)})$ $s$ $s'$ operations within the coupling layers, which implies a simple inference of their corresponding networks.

References

[1] L. Dinh, J. Sohl-Dickstein, S. Bengio (2017). Density Estimation Using Real NVP. International Conference on Learning Representations.

Real NVP Networks

Network Architecture

For inputs of higher dimensionality, the split of x into x1 and x_2 can be done arbitrarily. In practice, this is implemented using 2 complementary binary masks.

Density Estimation and Training

p_X(x) corresponds to the probability density encoded by the RNVP network. Before training, this PDF will not be related to the actual distribution of the data samples x, but its evaluation will still be exact.

Note that we use x^{(i)} to refer to different samples of x. We reserve the notation x_1 and x_2 for the two branches of input/output of the coupling layers.

References

Further Reading

$x$ $x1$ $x_2$ can be done arbitrarily. In practice, this is implemented using 2 complementary binary masks.

$p_X(x)$ actual $x$ , but its evaluation will still be exact.

$x^{(i)}$ $x$ $x_1$ $x_2$ for the two branches of input/output of the coupling layers.