Contextual RNN-GANs for Abstract Reasoning Diagram Generation

Arnab Ghosh*, Viveka Kulharia*, Amitabha Mukerjee, Vinay Namboodiri, Mohit Bansal

*Equal contribution

Motivation

An Example with an Explanation

An explanation of the ground truth is that the dashed line first goes to the left, then to the right, and then on both sides, and also changes from single to double, hence the ground truth should have double dashed lines on both the sides. On the corners, the number of slanted lines increase by one after every two images, hence the ground truth should have four slant lines on both the corners.

Some More Example Problems From DAT-DAR Dataset


The Model


Contextual RNN-GAN

The above figure corresponds to our Context-RNN-GAN model, where the generator G and the discriminator D (where Di represents its ith timestep snapshot) are both RNNs. G generates an image at every timestep while D receives all the preceding images as context to decide whether the current image output by G is real vs generated for that particular timestep. xi are the input images.

Impact of Adversarial Loss

Some Generations (Contextual-RNN-GAN)


Modeling of Consecutive Timesteps using Siamese Networks for better accuracy


Comparison With Human Performance

College-grade 10th-grade
Age range 20-22 14-16
#Students 21 48
Mean 44.17% 36.67%
Std 16.67% 17.67%
Max 66.67% 75.00%
Min 8.33% 8.33%

Context-RNN-GAN with features obtained from Siamese CNN is competitive with humans in 10th grade in the sense that it is able to achieve accuracy of 35.4% when the generated features are compared with the features obtained from actual answer images. It needs to be noted that humans can see the options to get the best possible overall sequence of six images and hence can select the best choice while our model is just comparing the generations (obtained using sequence of five images in the problem) with options to get the best option. So, we can say that our model is very good generator and comparable to even 10th grade humans. An interesting aspect is that the model is never trained on the correct answers, it is just trained on multiple sequences from the problem images and still performs remarkably well.


Interesting Cases


Application of the model to Moving-MNIST

The model can be applied to video prediction tasks as illustrated by above figure. The Moving MNIST task consists of videos of two moving MNIST digits and the next frame has to be predicted from the preceding frames.