Real NVP networks are designed to encode an invertible mapping between elements from two probability distributions, and provide a simple way to compute its inverse. In the simple 1D example of the diagram, the network transforms samples , which follow a complex distribution , into values following a simple Gaussian distribution . This transformation can be expressed as a function , encoded by the network. The architecture allows us to perform the transformation in the opposite direction , effectively providing the inverse function .
This two-side transformations can be used to create a random number generator of the arbitrary distribution , simply by drawing samples from a Gaussian and transforming them into samples. As we will see, the training of the RNVP network can be done simply with samples of , without requiring knowledge of the underlying distribution . For higher-dimensional data, such as images, this can be leveraged to learn an unknown distribution from unlabeled samples, and enable the creation of new unseen images from the same distribution.
Real NVPs are composed of coupling layers, which perform invertible operations. The combination of these invertible layers results in an overall invertible transformation encoded by the network. In practice, the architecture is designed to allow evaluations in both directions; that is, we can compute and its inverse with the same RNVP network.
We'll analyze the implementation of a RNVP model for the simple case of a 2-dimensional input. As shown in the Figure, the transformation from to is obtained by splitting the input and applying invertible operations alternatingly to each component.
In the first coupling layer, is left unchanged and is transformed as follows:
Where and are arbitrary functions, in this case from . In practice, we will use neural networks to encode these functions, and their weights will be determined during the RNVP training. Equation gives place to an overall invertible transformation, with the following inverse, which can be used to implement the evaluation of the RNVP network in the opposite direction (from to ):
However, the transformation provided by the first coupling layer lacks flexibility, since it leaves unchanged. The solution is to apply a second complementary coupling layer, which now transforms and leaves unmodified:
with new functions/networks and . Gathering equations - we can implement the evaluation of the RNVP network in both directions: and .
During a typical training, the RNVP network learns to map samples from a known simple probability density , usually a Gaussian distribution, to samples from an unknown distribution . Once the training is complete, we can generate new unseen samples from the distribution by drawing random samples from and transforming them with the RNVP. The diagram illustrates the transformation of 2D samples from a Gaussian to a distribution in the shape of two moons.
The mapping between the two probability distributions responds to the change of variable formula, which follows from the conservation of the probability in a differential area. For univariate distributions, this can be expressed as:
where is one of two functions encoded by the RNVP network. For multivariate distributions the derivative becomes the determinant of the Jacobian matrix of the transformation:
where we split the computation of the determinant into each coupling layer. For the first coupling layer, the Jacobian matrix is computed as the partial derivatives from equation :
with similar results for from equation . The determinant is easy to compute, due to these matrices being triangular:
Hence, through equation we obtain a computationally-cheap way of computing if we know , the probability density of its corresponding point in -space. Notice that is a simple and known distribution, and that given an arbitrary , we can find its corresponding simply by evaluating the RNVP network: . Thus we now have a recipe to compute for any sample .
The unsupervised training of RNVP networks leverages this capacity to compute the probability density to do maximum likelihood estimation, which seeks to maximize the likelihood of measuring a series of data samples given the parameters of the model (in this case, the network weights). For independent samples this can be expressed as:
In practice, the maximization of the likelihood is done by minimizing the loss defined as the negative log likelihood. Gathering equations , and , the training loss becomes:
This expression is easy and fast to compute. For each sample , we can obtain the corresponding by direct evaluation of the RNVP network, and then compute which is chosen as a standard normal distribution. The second term only requires evaluating the and operations within the coupling layers, which implies a simple inference of their corresponding networks.
 L. Dinh, J. Sohl-Dickstein, S. Bengio (2017). Density Estimation Using Real NVP. International Conference on Learning Representations.