1
votes

I am trying to applying VAE in a simple toy example to familiarize with its property. However, I get stuck in training the model. The total loss and the reconstruction error does not seem to decrease.

The toy example is listed below

  1. random generate 5000 observation from a 2-dimensional multivariate normal distribution.
  2. apply a transformation f, [x,y] --> [sin(x),sin(y)]
  3. train a VAE with 1 hidden layer with 5 neuron in both encoder and decoder. The VAE has 2 latent variables.

In this example, I am not able to decrease the training loss to a sufficiently low level, and the reconstruction is also messy.

I have made several attempts

  1. increase the hidden layers to 2 and 3 (this does not help) --> I think it is not due to the complexity of the model
  2. check the network on MNIST (the result is comparable with the example I found on other sources) --> the model design is right
  3. delete the KL divergence in the Loss function (the model can reconstruct well) --> the model design is right
  4. I try to balance the weight on KL divergence --> when beta on KL divergence is low, it reconstruct well, but latent space is too far away from standard normal, when beta on KL divergence is high, it can not reconstruct well, but the latent space perform well.

I now suspect several potential reasons, but I can not distinguish which one could be the reason.

  1. It seems that I need to find a balance between weight on KL divergence and reconstruction loss
  2. Is it appropriate to use MSE loss + KL divergence as loss function?
  3. In low-dimension, the VAE does not perform because the ELBO is not so tight?

Could any one help?

The code is attached.

This part defines the model

import torch
import torch.nn as nn
import pandas as pd
import numpy as np


class VAE_Encoder(nn.Module):
    
    def __init__(self,input_size,hidden_size_list,latent_size):

        """
        The class is the builder of the encoder part of VAE. It does not need to be directly called.

        :param input_size: int
        :param hidden_size_list: list(int)
        :param latent_size: int
        """

        super().__init__()
        
        encoder_size = [input_size]+hidden_size_list
        encoder_layers = []
        
        for in_size,out_size in zip(encoder_size[:-1],encoder_size[1:]):
            
            encoder_layers.append(nn.Linear(in_size,out_size))
            encoder_layers.append(nn.ReLU())
            
        self.encoder = nn.Sequential(*encoder_layers)
        
        self.encoder_mu = nn.Linear(encoder_size[-1],latent_size)
        self.encoder_logvar = nn.Linear(encoder_size[-1],latent_size)
        
    
    def encode(self,x):
        
        return self.encoder(x)
     
    
    def encode_gaussian_param(self,encode_x):
        
        return self.encoder_mu(encode_x),self.encoder_logvar(encode_x)
     
    
    def reparametrize(self,mu,logvar):
        
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
        
        return mu+eps*std
    
    
    def forward(self,x):
        
        encode_x = self.encoder(x)
        mu,logvar = self.encode_gaussian_param(encode_x)
        z = self.reparametrize(mu,logvar)
        
        return z,mu,logvar


class VAE_Decoder(nn.Module):
    
    
    def __init__(self,input_size,hidden_size_list,latent_size):

        """
        The class is the builder of the decoder part of VAE. It does not need to be directly called.

        :param input_size: int
        :param hidden_size_list: list(int)
        :param latent_size: int

        """
        
        
        super().__init__()
        
        decoder_size = [latent_size] + hidden_size_list
        decoder_layers = []
        
        for in_size,out_size in zip(decoder_size[:-1],decoder_size[1:]):
            
            decoder_layers.append(nn.Linear(in_size,out_size))
            decoder_layers.append(nn.ReLU())
        
        decoder_layers.append(nn.Linear(decoder_size[-1],input_size))
        self.decoder = nn.Sequential(*decoder_layers)
    
    def forward(self,z):
        
        return self.decoder(z)


class VAE(nn.Module):
    
    
    def __init__(self,input_size,encoder_size,latent_size,decoder_size=None):


        """
        The class builds the whole VAE. It consists of a encoder model and a decoder model. 
        The user has flexibility to choose the number of layers in each part of the model by
        setting the encoder size and decoder size.

        :param input_size: int
        :param encoder_size: list(int)
        :param latent_size: int
        :param decoder_size: list(int)

        """
        
        super().__init__()
        
        if decoder_size is None:
            
            decoder_size = encoder_size[::-1]
        
        self.encoder = VAE_Encoder(input_size,encoder_size,latent_size)
        self.decoder = VAE_Decoder(input_size,decoder_size,latent_size)
    
    
    def decode(self,z):
        
        return self.decoder(z)
    
    
    def forward(self,x):
        
        
        z,mu,logvar = self.encoder(x)
        x = self.decoder(z)
        
        return x,mu,logvar


def simple_vae_loss(real,recon,mu,logvar,penalty=1):
    
    MSE = nn.functional.mse_loss(recon,real,reduction="sum")
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    
    return MSE+penalty*KLD

This part generate the toy example


def generate_value(x,y):
    
    
    v1 = np.sin(x)
    v2 = np.sin(y)
    
    x = np.zeros(x.shape)
    y = np.zeros(y.shape)

    
    return (v1,v2)


rand_number = torch.randn((5000,2))

x = rand_number.numpy()[:,0]
y = rand_number.numpy()[:,1]

new_value = generate_value(x,y)


new_x = new_value[0].reshape((5000,1))
new_y = new_value[1].reshape((5000,1))
new_data = np.concatenate((new_x,new_y),axis=1)
1
If I understand your model correctly your last decoder layer doesnt have an activation. Try to change your last activation to tanh, which outputs in the range of -1 to 1 (the range you want). And then Id use MSE for the loss. Maybe that helps!Theodor Peifer

1 Answers

0
votes

Just to pose a potential reason that leads to the result.

In a paper introduction the Hyperspherical Variational Auto-Encoders, the author suggests a problem of using gaussian as prior in the low dimensional setting.

The problem is called origin gravity. Quote the paper.

In low dimensions, the Gaussian density presents a concentrated probability mass around the origin, encouraging points to cluster in the center. This is particularly problematic when the data is divided into multiple clusters. Although an ideal latent space should separate clusters for each class, the normal prior will encourage all the cluster centers towards the origin. An ideal prior would only stimulate the variance of the posterior without forcing its mean to be close to the center. A prior satisfying these properties is a uniform over the entire space. Such a uniform prior, however, is not well defined on the hyperplane.

To verify whether this is the case, I generate 100 synthetic data from the VAE model. The most interesting finding is that all the latent variables concentrated on the origin (0,0).

"If we decrease the weight on the KL divergence, the latent variable starts to spread out. This is consistent with the origin gravity. When the gaussian prior is strong, the latent variable starts to cluster around origin. In this case, we have to reduce the influence of gaussian prior by reducing the weight on KL divergence."

I guess this is one of the reason.